Understanding 3D Diffusion Models

Isamu Isozaki
8 min readNov 13, 2023

This repository on Stable Dreamfusion mainly inspired this blog.

Hi! I think 3d diffusion and 3d generation seem to be the next big thing on everyone’s minds. Stability AI started hiring talented 3d AI engineers and there seems to be impressive instant 3d model generator research coming out of Google and the general research world daily. However, while I have a vague understanding of how these models work, I definitely do not have a concrete understanding.

So, the goal of this blog is to understand the following 2 papers

  1. “DreamFusion: Text-to-3D using 2D Diffusion” which arguably started the whole idea of converting text to 3d models. <github>
  2. “Zero-1-to-3: Zero-shot One Image to 3D Object”. This is the model used in the stable dreamfusion repo for converting images into 3d models.<github>

Now, let’s start with DreamFusion

DreamFusion

This is a pretty remarkable work that frankly blows my mind a bit. The idea is we can use diffusion models to generate images, such as stable diffusion, and use this to generate 3d models!

What is Diffusion?

Now if you want a recap on diffusion models check out here but the main recap is as follows:

Image taken from https://towardsdatascience.com/diffusion-models-made-easy-8414298ce4da. Thanks!

The diffusion process has a forward process which adds noise to the image and a reverse process which takes away noise from the image. You can think of this forward process as just multiplying the previous input by some quantity and adding some constant times to the normal distribution. So like this!

noisy image = a ⋅ less noisy image + b ⋅ noise

these as and bs change across which stage of the denoising process but pretty much all diffusion models follow exactly the above idea!

Every time we do the above, we move forward one timestep. So, the original image is z₀ but if we do the above noising 2 times we get z₂.

Then, it seems pretty obvious that

less noisy image = (noisy image — b⋅ noise)/a

In fact, as all these as and bs are constant, we sometimes can say

image = (noisy image-b’⋅ noise)/a’

for some modified constants b’ and a’. This approach has quite a bit of errors but it’s pretty useful for guessing the initial image at random timesteps!

Once we train a model so that we can find that noise, we pretty much have a diffusion model that can generate images from pure noise!

The main steps are,

  1. Start with pure noise as the noisy image
  2. Predict noise using the model that will push the image towards less noisy images
  3. Do the calculation above to get a less noisy image!

That is the main idea behind diffusion models! For how it is trained etc, I did previously write about it here. But the main idea is

  1. Start with input image z₀
  2. Add some random noise ϵ to it to get a noisy image zₜ
  3. Finally, predict the noise ϵ’(zₜ, t) that can be used to denoise zₜ to eventually get z₀.
  4. The loss is (ϵ’(zₜ, t)-ϵ)²

In Dreamfusion this loss is introduced as

The E is basically the expected value. So you can just think of this as the average over all possible data. The w(t) is just some constant at timestep t which we can comfortably ignore. αₜx + σₜε is our noisy image. Which we can obtain from

image = (noisy image-b’⋅ noise)/a’

that we talked about before. Then,

noisy image = a’⋅ image+b’⋅ noise

The noise parts in the middle are pretty much what we discussed!

So, we try to correctly guess the noise to denoise a noisy image at any given timestep! Let me know if this part is confusing. I can definitely revise it.

Back to DreamFusion

Now what Dreamfusion says is let’s say there is a differentiable generator g with some parameters θ where g(θ) outputs an image! You can think of this as a render of a 3d scene(θ) by a certain camera angle and lighting(g).

Now, how do we make g(θ) look like the output from the diffusion models? The first idea is to optimize θ so that the output of our generator becomes the output of the diffusion model like so!

Here, ϕ is the parameter of the diffusion model. The main idea here is we assume our diffusion model ϕ has a great understanding of images. Then, it must know which images look normal and which don’t. If our diffusion loss is high and our model can’t predict what noise has been added in the first place, that means the image generated from g(θ) doesn’t look like an actual image. So we can back propagate this loss to update θ and we should be good to go!

However, curiously they found that this didn’t do well. Let us see why! So our diffusion loss is

Then, we reset x with g(θ) and we want to get the gradient with respect to θ! Then, we pretty much just use the chain rule

y here is the input to the diffusion model in this case is text! So here we are trying to generate θ such that the output image from it corresponds to the text prompt.

One part I suspect is a typo is δ ϵ’/zₜ should probably be δ ϵ’/δ zₜ for a proper chain rule.

For why δ zₜ/δ x was missing, the image, x, can be approximated by a noisy image like so

zₜ = a’x+b’⋅ noise

so δ zₜ/δ x=a’ which is a constant so we can just put it in w(t). Other than that this seems like a pretty standard chain rule.

Here, the δ ϵ’/δ zₜ is expensive. This makes sense because we need to go through our diffusion model that generates these images and calculate gradients through it. However, this term is also not useful for small noise levels as the paper puts it “as it is trained to approximate the scaled Hessian of the marginal density.” Let me know if anyone has a good explanation for this. My current understanding is that as seen in other research like EDiffi, when the noise level is low, the ϵ’ pretty much completely ignores the text and focuses on making the noisy image look like a normal image. This is counterproductive for our case since it doesn’t necessarily force our generated image to follow the text prompt. So the result is suboptimal.

Now, what the paper did was since this term contributes pretty much nothing to our loss and just makes it bigger, let’s delete it! And we get

Now while deleting the derivative seems a bit counter-intuitive, this can be formally derived. It is pretty interesting so will put it in here once I fully digested it.

Now, overall, that’s it. And the main cool thing about this approach is we never need the gradients of our diffusion model! We can just sample from our diffusion model and use that for the gradient of θ!

NeRF: How do we generate a 3d model from θ?

First of all, let’s understand ray tracing! The main idea behind ray tracing is for every single pixel in our image, we shoot a laser. And if it hits an object, we see it. If it doesn’t hit anything we see nothing. The below image helped me with this intuition

Taken from https://developer.nvidia.com/discover/ray-tracing

Now let’s say me have 3d points μ. We put this through an MLP(one-layer neural network) and we get

  1. τ = how opaque that point is
  2. c= RGB of the point.

so essentially, we are mapping a 3d coordinate of the camera and the rays from them to color and how dense that color is.

In the dreamfusion code, during training, it seems to randomize the poses like so

if self.training:
# random pose on the fly
poses, dirs, thetas, phis, radius = rand_poses(....

so essentially it gets the 3d coordinates and the angle of the ray.

Overall, we will want to optimize this function with training so that it can map any point at any angle to color and RGB so that we can take pictures of our generated image from different angles! The pipeline is shown below

However, this was a technique in the case where we wanted to generate 3d models from text. Can we use the same technique for generating 3d models from images? The answer is no. The main issue is this loss function

the y there is our text conditioning that forces our model to generate texts. So there is nowhere in this program where we can tell it what kind of image we want it to look like.

Now, here, can we change it so that condition y takes in an image instead?

That is exactly what our next research paper did!

Zero-1-to-3: Zero-shot One Image to 3D Object

The main idea of this paper is can we finetune stable diffusion models so that we can turn objects? And the answer is yes!

In more detail, they first encode the input image with a powerful image encoder called CLIP. Then we add in the camera rotation and the translation, (R, T). And we concatenate all these together. Now, we have the image, the desired new camera rotation(R), and the desired new camera translation(T) and we want it to generate us a new image with those parameters. Now, when I first read this idea I was very curious that such a dataset to train a model like this already exists. And in fact, it does! This is the dataset https://objaverse.allenai.org/ which has been an open source project. It seems like this was made by a collaboration of “Allen Institute for AI, University of Washington, Columbia University, Stability AI, LAION, and Caltech”. It makes sense why it is so good and simultaneously open source!

Overall, in stable dream fusion, while I’m still digesting the code, this model has been combined with NeRFs so that you can generate 3d models from images by replacing the y with our clip image embeddings. I’m not sure if stable dream fusion takes advantage of the fact that we can turn the image. I hope it does since that is an amazing feature.

But overall that’s it for this topic. Hope you all enjoyed this article!

--

--