Understanding “Understanding Gradient Descent through the Training Jacobian”
Hi! The main purpose of this blog is to just act as notes for understanding the following paper “Understanding Gradient Descent through the Training Jacobian” which is just the paper Eleuther announced that I found on their discord announcements. I was planning to also go through “Estimating the Probability of Sampling a Trained Neural Network at Random”, and “Converting MLPs into Polynomials in Closed Form” but that will make this blog very long so I decided to just focus on this paper for now.
The main reason I’m checking out these papers is I notice I’m recently reading papers that do more with agent design/prompting but I’m starting to fall short on more of the mathy less engineering/application type papers. I felt like reading these more technical papers might help with long term and interesting foundational research.
For those who don’t know Eleuther AI, their discord server is probably where the smartest open source researchers gather to discuss AI research in general. They have released models such as GPT J, pythia and techniques like ROPE which I think is pretty foundational for current research in LLMs.
Understanding Gradient Descent through the Training Jacobian
The aim of this paper is to understand gradient descent, or in particular, the traits of the parameters that get learnt from gradient descent. Now, first of all, how do the parameters change while doing gradient descent?
The first paper the authors raised that attempts to tackle this problem is “Measuring the Intrinsic Dimension of Objective Landscapes”
Measuring the Intrinsic Dimension of Objective Landscapes
In this paper, the authors did gradient descent on θ of dimension d
where D is the original dimension, θ_0 is the initial parameters from a normal distribution and P is a random matrix which is normalized/all vectors are unit length.
By only training on θ, the authors were able to figure out the minimum dimension d that is needed to learn a particular task. My favorite quote from this paper is “solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10”.
So one interesting detail that can be learned here is that there is a smaller dimensional subspace which may be enough to solve tasks and the rest of the dimensions may not be doing much.
Back to Understanding Gradient Descent through the Training Jacobian
In more recent research, there seems to be interest in comparing the final parameters to the initial ones. And what most research finds out is that this change is low rank which means the change can be expressed as matrix multiplication of some Dxd matrix like above times an update of d dimensions.
One interesting use of this that the authors point to is “ReLoRA: High-Rank Training Through Low-Rank Updates” which exploits the fact that during pretraining as “updates are especially low-rank across relatively short (1K-2K step) segments of training”
ReLoRA: High-Rank Training Through Low-Rank Updates
Here the question is how can we exploit the above fact if training is not that low rank all the way? The answer is why don’t we just train a new lora for every new phase! Then the weight update becomes something like below for N phases of lora across training.
Now when we consider how the optimizer should work here, it becomes pretty complicated(because in adam you have to deal with momentum) but here, in every new phase, the optimizer is reset and the lr is set to 0 with warmup like so!
which is a pretty interesting technique.
Back to Understanding Gradient Descent through the Training Jacobian
Now given this, what does this paper do? The main contributions seems to be
- Compute Training Jacobian
- Do SVD on this matrix has an interesting property that we get the basis for both the initial values and the final values in terms of eigen vectors which means we can map them/get a function that inputs initial values and outputs the trained output values!
- Analysis of eigen value spectrum
Now first of all, what is even the Training Jacobian?
Training Jacobian
In this paper this is described as the Jacobian of the output of the trained network parameters with respect to the initial values. So given the number of parameters N, this will be a NxN matrix where the index i, j will be the derivative
d new parameter at index j /d initial parameter at index i
Now then how is this even useful? The main idea is that this is very useful in Taylor Expansion! In fact, let’s say we have the initial parameter θ_0 if we have a function f that outputs the trained parameters given this initial parameter, we can approximate this as
for all θs that are a small distance away from θ_0. Now, let’s imagine a small ball/sphere around θ_0. If we put all those points in the sphere in to f, what will be the shape that gets outputted?
First of all, we know the center is at f(θ_0). Now, let’s look the scatter plot of all the output of Jacobian times (θ-θ_0). We might get something like
Now, what are these arrows? In SVD, the decomposition is
where U and V are both orthonormal matrices while Σ is a diagonal matrix from the highest values at the front while the smallest values at the end. Here, U are called the left singular values while V is called the right singular values.
V’s role is to rotate and stretch the incoming input values which is (θ-θ_0) while U’s role is to rotate and stretch the incoming output values which is Σ V^*(θ-θ_0). The diagonal has the biggest singular values at the front/top left and the smallest at the bottom right. So the first column vector of U has the biggest deviation and the second column vector has the 2nd highest deviation and so on!
So when going back to the discussion of what will be the output of a small ball/sphere around θ_0 into this function, the center will be at f(θ_0) and the principle radius will be the first column vector of U and the 2nd will be the 2nd column vector of U and so on.
As a side note for PCA we use the right singular values since we are interested in how the input is stretched/rotated!
Now let’s say our loss function is
And let’s say we are at the point of training where we converge where learning rate is close to 0 and the training steps is tending to infinity. I think the authors mentioned this part mainly because we are about to work on an ODE and
We can take the derivative of the above with respect to step t to get
Then the output of above f at time t is, by solving dθ/dt = -Hθ,
for how we got here, I calculated like so
In this training session, regardless of starting parameters we converge to 0. The dimensionality of training is N, the full dimension of θ. Since no sub dimensional space can represent the parameter difference of every single starting parameter θ_0 to 0.
However, apparently usually the Hessian is more singular in neural networks. which means it has 0 eigen values. What happens then, in the above example, is that in the direction of of the eigen vectors for that 0 eigen value, we get exp(0)=1, so θ_0 will have the same magnitude in that direction as θ. As the authors says, training does not happen in this direction where eigen value is equal to 0. In this example, H will have rank rank(J_t(θ_0)-I) because the 0 eigen values go to 1 and e⁰=1. I think I can put a way nicer explanation here eventually using things like eigen decomposition but I think the above gives a bit of intuition.
Now then what does rank(J_t(θ_0)-I) measure? If I understand right, the dim(H) — the above gives the dimension of the subspace size where eigen values are 0 where the ball will not change size at all!
Now let’s look at the experiments!
Experiments
The training and differentiation was done in JAX which I think is primarily used when wanting to work with TPUs(but I heard people who use it like it more than pytorch).
The network is a MLP with one hidden layer of 64 dimensions on UCI digit dataset which looks very similar to mnist.
Now, let’s take a look at the training jacobian!
Analysis of Jacobian
After doing the SVD of the Jacobian, one interesting trait was that most of the singular values are 1!
The authors then checked how similar the left and right singular values are using cosine similarity. Note that this is only possible since the input and the output are the same size! Then we get something like this
The main interesting point is where the singular value is 1, the cosine similarity is pretty much 1 which means the right and left eigen vectors are very similar there but for all the other locations where the singular values deviate from 1, the cosine similarity drops.
The authors call this region where the singular value is one and the cosine similarity is pretty much 1, the “bulk subspace”. One of the main points about this subspace is that this subspace makes the input and output the same! To see this let’s say the singular values of the left and right are exactly the same where U=V. These are unitary so U U^* = the identity. And since the diagonal is also I since all the singular values are 1, this space pretty much doesn’t change anything from input to output. The only change seems to be coming from the initial high singular values and the final singular values.
Now then where do the first 500 singular values come from? What is causing them?
These 500 first singular values act to magnify/perturb the initial values and is called by the authors the “chaotic subspace”. The authors argue this comes from the objective being non-convex in that if the objective is convex, the authors argue all the eigen values here will be at most one!
Why are the eigen values of the SVD of this Jacobian at most one if the loss is convex?
I couldn’t find a detailed proof for this but I did find some intuitive reasoning that I liked so I’ll put here.
Let’s say we have a set of all initial parameters which we will call θ_0s then when we do training, as there is only one global minima, for any two points in θ_0s, the distance between them will decrease once we call f on those points.
In the case of our Jacobian which is the partial derivatives of the final trained parameters over the initial parameters, it’s not possible for the singular values to be higher than one as that’ll denote expanding the distance between some points which can never happen/the norm of the below
always is smaller than or equal to θ — θ_0. Another way to think of this is that the influence of the initial values on the final trained values decreases as training goes on! In fact if we reach the final end of training on a convex function the jacobian will be all 0s since the initial parameters had no influence on the final parameters! I think the idea is based on how many gradient descent steps we do this influence becomes less and less.
So, overall, high singular values can only occur because of non-convex loss functions!
Now here one caveat is that we need to have a good learning rate. Here the authors mention “Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability” which pretty much just explores some interesting mechanisms of how training diverges based on learning rate which I’ll cover a bit here.
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
Now for this paper, the main finding is it found a pretty interesting connection between divergence of training and the maximum value of the singular value of the Hessian. Hessian here means the second derivative of the Loss with respect to the parameters. The authors refer to this maximum singular value of the Hessian as sharpness. What they found is very interestingly, as training goes on, except for cross entropy, the sharpness keeps increasing. If the sharpness goes past 2 over the learning rate then we have divergence. Otherwise the training is relatively stable.
One intuition for this was for a quadratic objective function
This has optimum
Notice that this is convex for positive a!
Starting from x_0, with gradient descent size n, we update x_{t+1} as
Then the error is
then from the initial x_0 we have
so now when does this diverge? It’s at a higher than 2/n and then we get an oscillating divergence a bit like
Apparently this happens even in d dimensions along the eigen vectors like so!
This was pretty interesting to me as I usually just look at gradient norms but I never new the Hessian had such an interesting characteristic! And while I don’t think this fully explains why the oscillation in loss happens it does give a bit of intuition which I like
Back to Understanding Gradient Descent through the Training Jacobian
One final point about this chaotic subspace is that despite them having huge dimensions, they have cosine similarity which is around 0.6 which the authors mention is very high.
Finally, for the last 750 singular values, the authors call this the stable subspace and say the values are so small that they have pretty negligible impact.
How Linear is Training?
Now, usually for taylor expansions like
it only works for θs in a small ball around θ_0. Here, the authors tried to see how far they can go for this to break down! The method the authors used was to train on initial value
Here, v_i is the ith right singular vector while λ is a hyper parameter
Then from the above Taylor expansion, we will want to see the output be λ J(θ_0)v_i which when we project onto the left eigen vector u_i, we hope to see it’s roughly λ σ_i where σ_i is the eigen value. Here, we will define the response as
projected onto u_i
First of all in the bulk dimension, even from \lambda of 10^-3 to 10³ as can be seen from the blue line, they pretty match match the prediction perfectly.
The orange line is just testing this addition and seeing if, when projected onto any other vector that is orthogonal to u_i, whether they get any projection. Ideally this should be 0 but as the orange line in the bulk dimension shows it’s pretty close to 0(remember this is log scale)
Now for the chaotic region, it breaks down very fast as you can see in the second figure.
Does the Bulk Matter?
Here, the authors tried, given the final trained network g with input x and parameters θ, whether the bulk dimensions even matter/effect predictions in any way. For this they compute
for each right singular vector. The results are
Now very interestingly for the test set, when doing the above perturbations, this had no effect/minimal effect on the kl divergence. For chaotic and stable they had significant effect. However, when inputting white noise or inverted colors/out of distribution data, then the bulk perturbations had pretty much the same effect as all the other eigen vectors!
As the paper puts it nicely, “These results make it fairly intuitive why the bulk subspace is left virtually unchanged by SGD: it simply does not affect the network’s output on the training data distribution.”
Relationship with Parameter-Function Jacobian
The authors introduce a new jacobian called a parameter function jacobian. This is the jacobian of the final network outputs with respect to the network parameters. The author called this Jacobian PFJ.
Here, the authors got the PFJ on the test set and for white noise.
As can be seen above, for the PFJ on the test set, for the last 600 or so values, we have pretty much 0 singular value. This means that for eigen vectors along those directions in the parameters they have 0 effect even if we add more! For white noise we do not have this.
Now let’s take a look at the null space of the PFJs! The smallest singular value vectors can contribute to this. When we compare this to the bulk subspace, as can be seen above, the cosine similarity is pretty high between the PFJ on the test set and the bulk! For while noise PFJ vs random subspace the cosine similarity is pretty similar.
So the main idea here seems like the bulk puts a lot of the deviation in parameters to 0 potentially to deal with the excess dimensions.
Does bulk change based on seeds and labels?
The authors trained the network on different seeds/shuffled labels but no the bulk seems to consistently appear and is consistently similar in cosine similarity(I skipped over some details of how the cosine similarity is computed but definitely check out the paper if interested!)
SGD in a small subspace
Now like the paper I mentioned in the beginning the authors setup
However, one interesting difference is P here is a Nxn projection matrix into one of the 3 subspaces chaotic, bulk, or stable. This will just be the top n right singular values for each space!
Now here we find that the bulk subspace performs terribly but the chaotic does very well and stable does well after 300 subspaces. I think in the paper they flip the stable and chaotic which may be a typo. But overall, the bulk seems to prevent training on it.
Overall, this work was very interesting in finding this very curious bulk subspace. I’m very curious about the role of this bulk space. From what I understand it’s mainly there to ignore certain deviation in the parameter space and not alter the input at all. In fact I’m very curious if the bulk subspace is even necessary. It doesn’t seem, at least from the current paper, to contribute to the model at all. In fact it just acts as a way of dealing with excess parameters, at least from my understanding.
But overall, I think I got to review a lot of maths I learned in this paper. I might cover more of Eleuther’s papers but they definetly do seem more advanced/mathy than the papers I’m used to except k diffusion/ddpm papers.