# Understanding K-Diffusion

Recently, at Huggingface we started a research paper reading group(feel free to join in!) and the first paper was “Elucidating the Design Space of Diffusion-Based Generative Models” otherwise known as k diffusion.

If you have been around diffusion models to any capacity, be it in hugging face, Laion, or Eleuther AI, you may have heard of k diffusion. k-diffusion is a paper by one of the legendary AI researchers Tero Karras and has been implemented by another legend Katherine Crowson(another k!) here.

I have done a series on understanding the math of diffusion models in the past here which should be helpful in understanding this paper. As k-diffusion is fundamentally an improvement on normal diffusion models.

# Prerequisites

One thing to note is this research paper is very math-heavy. I do plan to go through the math but I’ll also try to keep the requirements at a minimum. So far, you may need

- Understanding Probability
- Understanding basic calculus(derivatives/integrals)
- Basics of neural network layers
- Understand gradient descent
- In the later sections, having an understanding of the math of DDPMs will be helpful. Feel free to checkout here for my attempt at explaining it but there are definitely better blogs out there too!

# Diffusion Models

Diffusion models, at their core, are models that start with pure noise and then figure out the amount of noise we should add on every timestep and modify the image by that amount of noise. The common convention is to have x₀ be the image with no noise after all the noise is removed and xₜ be an image with noise t timesteps away from x₀ or x₀ with t timesteps of noise added to it.

Important note. In k diffusion, x₀ is pure noise while xᵢ is i steps denoised from pure noise. Now another point that got me confused was the timestep. i is not the timestep. At i=0, we have the max timestep while as we go down, the timesteps decrease too.

The above in practice works very well. Most of the top research and top image generation tools, like stable diffusion, are still based on this technique. However, there are a few problems

- Diffusion models are based on normal distributions and sampling methods from them. While this does make the math very elegant as you can check here, whether it is the “best” is not known
- The whole base of diffusion models is very tightly coupled. In that, we do not know which part we can modify/keep the same in order to not break the entire model.

So, the authors' idea was why don’t we, for now, forget all the theoretical basis of these diffusion models and think more simply in terms of denoising score matching? Now, what is denoising score matching? And before that, what even is core matching?

# Score Matching

The idea of score matching was first proposed in “Estimation of Non-Normalized Statistical Models by Score Matching”. Now before talking about score matching, let’s talk about what motivated it. This, I learned from here.

# Monte Carlo

Here, let’s say you have some kind of probability distribution like below

Now, how do we figure out what kind of distribution this is by just sampling from it? One idea is why don’t we just sample from this distribution a lot of times? Then we should eventually end up with something like the above by just making a histogram! This is called the Monte Carlo method.

However, what if in this distribution each time we sample, we “change” the distribution(ex. taking candy from a pile, making a move in chess)? Also what if this was not just 2 dimensions but 3 dimensions? Then how many times do we need to sample to find the probability distribution? What about 100 dimensions? Overall, the simple method of Monte Carlo just can’t adapt to a more complicated setting like that.

# Markov Chain

Now, the “change” problem, we need a way to keep track of our current state and also we will like to know by which probability you can go to the next state and so on. A bit like below!

This is called a Markov Chain. But now the question is, how do we update the above from sampling?

# Markov Chain Monte Carlo

Markov Chain Monte Carlo(MCMC) approach is pretty much that. We create a Markov chain and then do Monte Carlo sampling on it. One example of this is a random walk. We can have the person, on every step, record the state he is in and if he does this enough, then we can have a probability mapping between states.

However, for updating these Markov Chains, you might notice that the starting point is pretty important. Also, this update method feels a bit random at the beginning so which means we might need to do a lot of iterations here to make this work. This is where score matching comes in.

# Score Matching

Let’s say we have a variable x that is n dimensions and a real number. We can say this with x ∈ ℝⁿ. And let’s say we have a probability density function, which is a function that can take in x and predict the probability of x, as pₓ(). For the sake of our argument, pₓ() is a model we train like a deep neural network.

Now, let’s have θ ∈ ℝᵐ be the parameters of this model so we can say the probability of x is p(x; θ) which means the probability of x given θ.

Now, let’s say we want to estimate this probability distribution from x. You can see how this is a similar problem to the above because we want to find the distribution, given by θ, using observations, such as x.

Now, let’s call the predicted parameter θ’. Now, let’s consider that we have sampled a lot from the distribution so we can say that given x=1, I observed 5 balls falling into my basket while at x=2 I observed 2 balls falling into my basket. Let’s call this function q(x;θ). Now, if we integrate/sum q(x;θ) over all possible values of x, we can get a constant like so

Z(θ) = ∫ q(x;θ) dx

for all x ∈ ℝⁿ(Thanks Vladimir for pointing out my typo!)

Here now, if we divide q(x;θ) with Z(θ), we get our probability distribution p(x; θ) like so(with ξ instead of x)!

However, you might notice that to do this, we need to get every single x ∈ ℝⁿ or at least approximate it using MCMC. Which is a pretty bad approach, especially in high dimensions,

# Score function

So here, let’s introduce the score function. A score function is defined

The score function is the gradient of the log probabilities for every index in the n-dimensional x or ξ in this case. A gradient here is the derivative of the probability wrt to the input x. Thus, the gradient of the log probability wrt to the input is called the score function.

Now, what the score matching paper proposes is given the probability distribution of the pure data, so the probability distribution that is given by p(x),

if we minimize the distribution between the score function of this distribution of the data and the distribution of the data given θ, then we can say θ models our distribution. We can write this as

This intuitively makes sense. Since if the gradient of the probability is the same for every point x, then yes both distributions are the same. And we also see why this is called score matching since we are trying to match the score of the original function!

Now, why is this formulation better? For getting the distribution of the data ψₓ don’t we still need to go through the entire MCMC? Now, it’s time for some math tricks. For the proof, I will skip it for now, but this is equivalent to

Now the reason this is pretty awesome is that none of the functions above, except for p, depends on the base distribution. In fact, most of it is just parameterized by θ and so can be easily computed. Even the probability in the above term can just be thought of as the occurrence frequency. For example, if we have T observations of x, we can write the J as

This is the same as the integral above as T goes to infinity!

In practice, since we want to minimize J, we can make a loss function whose objective is J and we can just train this network to find the best θ!

# Denoising Score Matching

Now, what is denoising score matching? Denoising score matching is introduced in “A Connection Between Score Matching and Denoising Autoencoders”. The paper here starts with a discussion of Denoising Autoencoders.

# Denoising Autoencoders(DAE)

First of all, autoencoders are models where you put in your input data and get it back. Like the below

Here, DAEs are versions of these autoencoders where we denoise our input as so

Now, this denoising task is a pretty hard task because you need to understand the concept behind the noise and then solve the puzzle of getting the original back.

For this paper, we express a noisy image like below

which basically means that we just apply some gaussian noise to the initial input image with standard deviation σ. We also say that x is in a distribution Dₙ which is the distribution of our data like so x ∈ Dₙ

Now, in this context, let’s think about score matching

# Score Matching

Remember how we defined p using Z and q?

Now, we have a slightly different formulation like

Now, E here is called the Energy. When you think about it, if the energy here is high it has a lower probability, and if the energy is low, it has a higher probability. It sounds a bit like how a high-speed ball is hard to pinpoint at position x while a slow-moving one is easier to localize.

Now, let’s rewrite our score-matching objective, given the ground truth distribution of q as

Now, remember here that this is exactly the same as

it’s just a different notation! Now, we can slightly reformulate this as

So now everything is in terms of the noisy x. Interestingly, the above formula and this formula are equivalent(where all the xs are noisy). The proof is pretty interesting but I recommend checking it out yourself! I might include it if I can shorten it.

Then now, let’s say we have a sample pair

where

is the noisy version of x. Here, let’s assume that the underlying distribution is an isotropic normal distribution of standard deviation σI given by

Now I was confused by what isotropic meant too but this nice answer explains it very well.

Here, the probability of x and the noisy version being observed can be given by

From the simple idea that the probability of x and the noisy x being observed is the probability of observing x multiplied by the probability of observing noisy x given x. Just note, for the probability of x, x is not sampled by a normal distribution but instead from the dataset. That’s why we have q₀. Now for the probability of seeing the noisy x given x, it is pretty much the same as the probability of seeing a point in a normal distribution. Now given the formula for a normal distribution(here we will have mean 0),

Let us consider what

is. Now, the input to the function f above will be

And we will also do a log so we get

Note, the mean is 0. So, if we take the derivative wrt to noisy x, we end up with

This gives a pretty nice formulation where the derivative is pushing the noisy image toward x.

Now, why this is nice is we do not need to do the second derivative thing anymore like

instead, we got the ground truth distribution! So we can just plug into

Where the score function is the same thing but for our model. You can see an example of this on page 7 of here.

# Back to K diffusion

Now we know that K-diffusion plans to use score matching in a way to make denoising more practical. However, that is not the only contribution of this paper, in addition to that, they have 2 more contributions. But let’s discover these as we go along.

# Re-defining diffusion

Now, let’s completely forget how ddpms work. In the context of above, let’s think of

which is the distribution of data. Here, let’s consider the idea of adding Gaussian noise σ to this distribution. Let’s denote the distribution of applying Gaussian noise at a certain σ to x as p(x; σ). If we add high levels of Gaussian noise,

We can assume that this will pretty much end up being pure Gaussian noise. Now let’s formulate diffusion simply as denoising from pure noise, a sequence of σs such that we end up with no noise like so

so at the 0th index, we will encounter max noise and as we go on we keep decreasing noise until we reach N where there is no noise. In practice, we keep the index 0 to be the clean image while large indices like T as where the image becomes noisier.

Now, what the authors propose next is, given this basic idea, why don’t we find the differential equation of x with respect to time?

# Ordinal Differential Equation(ODE)

Now, let’s say that σ is some function of time. For example, one idea is having the noise level be

which has some

basic ideas like in the beginning, adding a lot of noise while later on gradually relaxing. However, let’s just stay general. Then, we can get dx(the derivative of data) wrt dt(the derivative of time) as so

This equation was based on the research from “Score-based Generative Modeling through Stochastic Differential Equations”. I’ll fill in the details of this derivation later but will just work with intuition for now. This is mainly because the idea seems pretty complicated and might distract us from our current idea for k diffusion. Feel free to go check the paper!

# Basic intuition

In the above equation, we can say that the derivative of x with respect to time is the negative multiple of

- derivative of the noise wrt to time
- the current noise
- the score which is the gradient of the log probabilities of x given the noise ratio.

Now for 1 and 2, I understood it as just taking the derivative of a σ² term somewhere since the variance is the thing that is increasing on every timestep. Next, we have the gradient of the probability of x given the current noise level wrt to x.

Now, with all this together and the negative sign in the beginning, it kind of resembles gradient descent. In that, for every push in the timestep(dt) we are nudging x towards decreasing the probability of x given the new noise level as we are destroying more information in x as time increases. As the paper put it, “Intuitively, an infinitesimal forward step of this ODE nudges the sample away from the data, at a rate that depends on the change in noise level. Equivalently, a backward step nudges the sample towards the data distribution.”

# Revisiting Denoising Score Matching

Now, remember the idea of given noisy x and x and their probability below,

We ended up with

In the same way, given some denoiser function

We can kind of define the network for like so

where all the cs are just coefficients and F is the network,

we can formulate the objective we want as

Then we can just get the score

which will be the ground truth distribution! The idea is pretty much exactly the same as

where we are pushing the noisy image, x, towards the denoised image D(x; σ). Now, the only part missing is the score part below and we can start training!

However, in this portion, this is just used for the sampling below

# Time-dependent signal scaling

Now, let’s say we want to scale the input x at time t by a function s(t). This can be accomplished by

One interesting thing to note is that for the score part, we divide x by s(t). This is because we want to keep the score modular from the scaling idea so we don’t need to do extra work!

Now, how do we solve this ODE? The first step suggested is to insert

into the above so we don’t need to deal with the gradient. Then, we can use integration schemes(like Euler or Runge-Kutta). Now, what are these integration schemes?

# Integration Schemes

## Euler

Euler’s method is very simple. We start with the initial value of x(the input image). Then, we choose a h that is very small to approximate dt. Then, for every step, we add dx/dt times h to x. You can kind of see how this can approximate x at timestep t by integrating by many rectangles

## Runge-Kutta

Runge-Kutta’s method is just a more complicated and accurate version of Euler where given step-size h, and we are trying to estimate yₙ₊₁ at timestep tₙ₊₁, we have

which looks complicated but surprisingly it works. If you ever dealt with physics that is too hard to integrate, I’m sure you encountered this method before.

# Back to K diffusion

Now, this ends with one of the 3 contributions of k diffusion. One thing the author stressed was that while they presented some puzzle pieces here like the scaling, scheduling, and choice of integration scheme, it can be independent in their formulation and is not required. This work is modular and a toolbox and is up to us to decide what to change and keep.

# Improvements to deterministic sampling

The k diffusion paper starts with a hypothesis for the sampling. The choice of sampling is “largely independent of the other components, such as network architecture and training details” so theoretically, you can use stable diffusion that is not trained for k diffusion but still use their sampler. In fact, they consider the denoiser, D, to be a black box.

# How do we sample from a black box?

Let’s say we have an ODE. When we solve it/integrate it, we are necessarily doing an approximation. The authors call this error the “truncation error”. This truncation error tends to increase by h, the step size, to some power. So it’s best if we keep h small by increasing the number of steps.

For Euler’s method, as we saw above, we just need to evaluate the gradient once per step and we were good to go. But the error is large

For Runge-Kutta’s method, we can evaluate without much error but we need to evaluate the gradient 4 times per step.

What the authors found to be a good middle ground was a method called Heun’s 2nd order method which evaluates the gradient just twice per step.

The authors used Heun’s 2nd order method everywhere except σ=0 as it breaks then, where they use Euler’s method. Here, all these ideas can be put in pseudo code like so!

Now, you may notice that there are timesteps shown above. Given by tᵢ as timestep at step i. How should we determine which timestep to go to next? Should they be constant like in stable diffusion?

What the authors found was interesting. The step size between timesteps “should decrease monotonically with decreasing σ and it does not need to vary on a per-sample basis”. This means that the timesteps and noise level should increase the most in the beginning at index 0, which is pure noise, and then decrease as we become more confident about what kind of image we want which does make sense intuitively. Here’s a visualization of this

One note the σ above is not the timestep but the noise level. At max timestep, we are at the left. The σ at max timestep will be the minimum noise level so σₜ < σ₀

Now they used the notation of

I wasn’t entirely sure on what this was but then I remembered we have the source code!

In crowsonkb’s sample_euler function, we have

`gamma = min(s_churn / (len(sigmas) - 1), 2 ** 0.5 - 1) if s_tmin <= sigmas[i] <= s_tmax else 0.`

sigma_hat = sigmas[i] * (gamma + 1)

if gamma > 0:

x = x + eps * (sigma_hat ** 2 - sigmas[i] ** 2) ** 0.5

dt = sigmas[i + 1] - sigma_hat

x = x + d * dt

Now, there is a bit to digest here. First of all, we can ignore churn and thus gamma for now. The main idea behind them is to make the outputs more stochastic by adding noise. Then, what we find is that dt, at least in the above algorithm is just the difference in noise levels! So curiously having the timesteps corresponding to the noise levels seems to be the strategy. I found that this is explained later on!

Now, if we have

We can define the noise level at the index i as

for some A, B, and ρ. This, with the above, gives

Here ρ is a very interesting parameter. At ρ=3 the truncation errors(the error of the integration scheme) equalize for every index. Now, what happens if we increase ρ? I tested on desmos.

This is at ρ=3

At ρ=7

It becomes steeper. What the authors found was that this steeper noise schedule performed better. This does give us a bit of insight into how diffusion models like to make images. Diffusion models like to first decide on “what” the image is with rough detail in the earlier indices and then in the later indices, refine that. For the k-diffusion repository and this paper, the respective authors set ρ=7.

# Planning σ(t) and s(t)

Now, you may have noticed that terms of

and

And you may have thought, it will be nice if we didn’t have to worry about these. Now you are in luck since that’s exactly what our authors and the paper they got inspiration from called DDIM thought. The formulation is

σ(t)=t and s(t)=1

This means that respectively, the above derivates are 1 and 0. Then the derivative simplifies from this slight nightmare to

the simple

And also this solves are previous puzzlement about why the noise level is t. The answer is because then the math is easier! In the DDIM paper, they may have mentioned why the above is the best formulation but I completely missed it. We may discover why this is later.

Now, here, I was a bit confused since at timestep 0, we will be getting 0 noise, and at timestep 50 we get 50 as noise. The point of confusion I had was confusing indices with timesteps. At index 0, we have the max timestep and then we keep decreasing the timestep as we step through the indices until we reach timestep=σ=0.

I was a bit confused before figuring this out so let me know if there are any mistakes about this anywhere!

Now, let’s graph the solution to dx/dt

This is mainly the authors comparing methods. We start, at timestep 0(yes timestep not index) with the init x of -1 and 1, and check how the gradient changes over time. One thing the authors found was that in the first two methods proposed by “Score-based generative modeling through stochastic differential equations”, the gradient either becomes just a horizontal line or is always curved. However, in their method/DDIMs, the curve is in fact a straight line!

Now, why is this important? It’s important to remember what exactly we do with dx/dt. Once we get it in the sample above, we use dx/dt to predict the x at a lower timestep by multiplying dx/dt by a small step size(at least in Euler). Here, if dx/dt is a flat line, then we will go way off course or if it’s curved, we might get a better chance but still we might drop off the curve like a car going at high speed at a bend. But if we have a straight line, you can go at whatever speed you want, you will end up at your destination which is the ground truth image at timestep 0 which we want.

This straight line can be attributed to our choice of σ(t)=t as intuitively if we remove y amount of noise from x every time we move y timesteps from t, then we are getting a straight line to the ground truth x at timestep 0.

# Stochastic Sampling

One thing the authors noticed was that deterministic sampling curiously leads to worse quality than stochastic sampling which is sampling with some noise. The natural question then is what is the role of this noise?

Let us first generalize this equation

the paper adds two new terms to make an equation

Let’s try to understand these terms at least intuitively

First of all, what is Langevin?

# Langevin

The idea of Langevin dynamics is first found in particles like so

What this says is the force on a particle is a gradient on the energy acting on the particle based on where it is(potential energy), a negative velocity term to say the force to slow down the particle(damping), and then some stochastic term based on the particle’s environment. You might start seeing a connection here.

This idea was applied to generative modeling in “Generative Modeling by Estimating Gradients of the Data Distribution”. The main idea of the paper was that the score function gives “a vector field pointing in the direction where the log data density grows the most” and then we can“produce samples using Langevin dynamics, which approximately works by gradually moving a random initial sample to high density regions along the (estimated) vector field of scores”

We can kind of replace the gradient of the potential energy with our log probability and add a noise function to get below

This is not acceleration but we can see roughly where the inspiration comes from. In each step, we are pushing our image particle towards the location where the probability of the noisy image is closer to 0. Which means we are getting closer to the denoised distribution.

The motivation behind the ϵ coefficients where one is ϵ/2 and the other is square root of ϵ came from the observation in the score function. They first made their score-matching function

Which looks pretty similar to all the other score-matching functions we saw. But then, they observed you can, under certain regularizations, formulate as

We want to minimize I imagined that if we want the same step size for both terms which I’m guessing will get translated above if we have a square root of epsilon for the right term, we will get ϵ/2 which means we can also add ϵ/2 to the left term. This is just intuition with very little math. I’ll fix this up when I get some time!

# Score-Based Generative Modeling Through Stochastic Differential Equations

What the paper “Score-Based Generative Modeling Through Stochastic Differential Equations” mainly did was combine the above paper on Langevin-based dynamics with DDPMs. DDPMs are formulated like

For an explanation of DDPMs, I recommend checking out my article here first. Βs here usually increase from x₀ to xₜ linearly which means that more standard deviations are added at higher indices but on lower indices, the standard deviation is pretty low. But also the coefficient of x becomes larger(close to 1) as we reach the ground truth x₀.

Now, if we combine this with the above

and flip to getting

instead, it kind of makes sense that we get

Here, the s is just the score function(gradient of log probability of x). We can get this by intuitively, just adding both and having 2Β=ϵ. Though this might need some more tricks.

Now here, the paper found we can create an ODE like so

Now, if we look at the DDPM, we can see that we can model

as

where z is some sample from a normal distribution. Then, we get

This is the process of adding noise to the ground truth data.

On the lhs, I need to think more on how exactly where the square root of 1-Β(t) went from below

But I’ll think the general idea of having -1/2 times a step size makes sense.

Now then what is ω? From what I can find this was called the Wiener process. What is a Wiener process?

# Wiener Process

I looked it up here. A Weiner process is defined as

So essentially, every dw will be sampled from a normal distribution which is a pretty awesome property. It’s basically a differential term that is here whenever we want to have a normal distribution in a differential equation!

Now, that was the differential equation for adding noise(forward process). What about the process of removing noise(reverse process)?

For this, we just need to solve the differential equation for the reverse process given by the normal distribution relation we found before

After some calculation, this ends up being something like the function below in the Reverse SDE

Thus, this paper introduced 2 processes. One for noising the data(forward SDE) and one for denoising the data(Reverse SDE)

# Back to K-Diffusion

Now, in k diffusion they slightly modified the above diff equation like so

The intuition is pretty simple. They just changed the step size from Β(t) to 2Β(t)σ(t)². Other than that, it looks pretty much exactly the same except for what the term this paper calls the probability flow which we discussed before.

For the + and — they each have the choice of going backward and forward in time. If we look at this diagram

For the forward process, we want to completely remove the score function. This can be accomplished by having

Then, when we do addition, we are completely removing the score thus allowing the noise to go rampant. But if we keep decreasing Β(t) to even negative values, we are strengthening our denoising/reverse process. In the k diffusion paper, they called Β(t) the “relative rate at which existing noise is replaced with new noise”

# Now why is this stochasticity important?

Now why stochasticity is important, is that the noise and correction mechanism done in the Langevin/score matching paper is very robust. In that, it can add noise and correct it very well while without it, we are trying to denoise by approximating this dynamics which introduces errors when we do it. This is the explanation of the paper and I think it makes sense.

# Overall algorithm

The overall algorithm for this can be shown below

For this algorithm, remember that t=σ(t). Here dω is approximated by ϵ. Now, it’s important to note that the noise injection happens independently of dt. Here, this is done by increasing the timestep and thus noise level by a factor of some γ. Next, we get the noise to move from tᵢ to tᵢ hat. For the intuition on how the square rooted distance between tᵢ² hat to tᵢ² gives the noise injection step, we can get some intuition from the pseudo-code of the score-based paper here

On line 7’s last step, we see the square root of 2 epsilon. I think here, epsilon is just sigma² where beta is a constant 1/2. Then, while not perfectly math-accurate, the distance does slightly make sense. Let me know if anyone can give a proper explanation here!

Now, after the noise is injected, everything else is pretty much the same as algorithm 1.

It seems like in this code, the S churn parameter can be changed to decide how stochastic/how big γ is.

As we can see from the pseudo-code there is an interesting difference where for k diffusion we take the derivative after doing noise injection while in the score matching paper, all the different parts are added at once(gradient is not computed at the noise injected step) which while is more readable doesn’t mean it’s better.

And yes, this sampler got state of the art as we can see below

# Training

Now, how should we train this Denoiser that we have great uses for sampling?

Training directly won’t be good since if we just do x=y+n where x is the noisy image, y is the ground truth and n is the noise, depending on the noise level we are currently on, x can have a very different magnitude from the rest of the training data which is a classic failure case when training neural networks since a lot of the layers will get activated by this one big sample.

So instead of learning this Denoiser directly, can we find a network F from which we can derive our denoiser? One idea is to define the Denoiser like

The main problem here is that any error by the net is amplified by a factor of σ and still x is too large sometimes. To address this issue, this paper came up with

where all the cs are coefficients that modulate given σ. Now in this form, the training loss becomes

inspired from

The main idea for the effective weight is to scale the loss so that high noise levels do not dominate training where λ(σ) is the probability of getting a given σ during training. Now, for the effective weight, it seems like we set it to 1 like below

which is interesting since I am curious where the c out squared came from. Next, for selecting which noise levels to use during training, the paper found that for very high and very low noise levels, it’s hard to properly train. What they found was instead, if you have a log-normal distribution, then you can more easily sample these noise levels and thus get better results!

# Augmentation

One clever detail for the augmentation that I liked in this paper was that they added a parameter to F on how much augmentation the input image had so that during inference, they can set it to 0 to get no augmentation! I am very much looking forward to trying this in my projects.

# Sampling

it seems like for the less diverse datasets, deterministic sampling is better. But on more diverse, stochastic is better.

# Final Words

Thanks for reading! There were a lot of parts I was confused by but hope this helps people get some sense of what K diffusion is