Understanding PALP: Prompt Aligned Personalization of Text-to-Image Models

Isamu Isozaki
11 min readFeb 17, 2024

--

To understand this paper, let us first talk about personalization methods' challenges. The main competition of this paper is techniques like Dreambooth, Custom Diffusion, and Textual Inversion where we tune our model to adapt to a new concept. Now then what was the main limitation of these techniques? To understand this let us first see how we evaluate these methods

Evaluation

There are 2 main metrics used to evaluate personalization techniques. The first one is image alignment which measures how close our generated image is to our input images. The second is text alignment which measures how close our image is to our prompt! For both of these, we use a model called CLIP by Open AI which is trained to give the maximum value if an input text and an image correspond to each other

So for images, we put in an image from one side and another image from the other side and we measure the similarity and for text, one of those images just becomes text. There has been research on curious phenomena called the modality gap discovered by “Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning” at Stanford where text and image latents have a gap between them

which led to the official repo on these evaluation metrics called “clip score” to multiply the text alignment by 2.5.

Now that we know how we evaluate these models, how do tuning techniques do in this landscape? Usually, we are left with a tradeoff.

How do traditional techniques do?

Now, the two main techniques that are the “classics” in this space, in my opinion, are Textual Inversion and Dreambooth.

Textual Inversion

Textual Inversion is arguably the first customization method that allowed low-compute customization of diffusion models on just a few images. This technique is focused on inverting a given input image into a token in the input text using gradient descent. There are minor issues here that training takes a long time of 3000 steps of training which are hours of wait time.

However, more importantly, while the output image can be similar to the input sample when we do a textual inversion-based technique, we consistently get low textual alignment in that the output image doesn’t follow the prompt. Here’s a table of evaluation in custom diffusion demonstrating exactly this

Dreambooth

Dreambooth was arguably a completely opposite technique to textual inversion which managed to arguably turn out to be way more successful. The main idea was why not just train the entire model to fit the concept?

And yes this was good for a while and as can be seen from the table

it can be more consistent. However, I think you notice that the image alignment drops when compared to the textual inversion and we are increasing text alignment like a “trade-off”. One thing this paper observed was that a lot of techniques just are doing tradeoffs between image alignment and text alignment and are failing to improve both.

However, why exactly is this? In particular, why is it so hard to keep high text alignment with high image alignment/personalization?

Attention Contamination

I don’t know if this is the correct academic term but the phenomena that cause this is very simple. The model learns to ignore the prompt the longer it trains! For example in the textual inversion training script, we try to generate the same image given all these placeholder prompts

imagenet_templates_small = [
"a photo of a {}",
"a rendering of a {}",
"a cropped photo of the {}",
"the photo of a {}",
"a photo of a clean {}",
"a photo of a dirty {}",
"a dark photo of the {}",
"a photo of my {}",
"a photo of the cool {}",
"a close-up photo of a {}",
"a bright photo of the {}",
"a cropped photo of a {}",
"a photo of the {}",
"a good photo of the {}",
"a photo of one {}",
"a close-up photo of the {}",
"a rendition of the {}",
"a photo of the clean {}",
"a rendition of a {}",
"a photo of a nice {}",
"a good photo of a {}",
"a photo of the nice {}",
"a photo of the small {}",
"a photo of the weird {}",
"a photo of the large {}",
"a photo of a cool {}",
"a photo of a small {}",
]

So it is motivated to, regardless of prompt, generate the same image!

The specific way the model does this, most notably in textual inversion, is pretty interesting. There is a technique to look at attention maps per token called DAAM

If we examine the token responsible for representing our concept we see that it pretty much covers all the foreground subjects which leads to dominating the cross attention of the generation.

This is generated given prompt ‘A <cat-toy> next to a man with a friend’. Feel free to test yourself here or in code it’s just

from daam import trace, set_seed
from diffusers import DiffusionPipeline
from matplotlib import pyplot as plt
import torch
model_id = 'CompVis/stable-diffusion-v1-4'
device = 'cuda'
# repo_id_embeds = "sd-concepts-library/cat-toy"
repo_id_embeds = "sd-concepts-library/cat-toy"
subject_token = "<cat-toy>"

pipe = DiffusionPipeline.from_pretrained(model_id, use_auth_token=True, torch_dtype=torch.float16, variant='fp16')
pipe = pipe.to(device)
pipe.load_textual_inversion(repo_id_embeds)
prompt = f'A {subject_token} next to a man with a friend'
gen = set_seed(0) # for reproducibility

with torch.no_grad():
with trace(pipe) as tc:
out = pipe(prompt, num_inference_steps=50, generator=gen)
for word in prompt.split():
heat_map = tc.compute_global_heat_map()
heat_map = heat_map.compute_word_heat_map(word)
heat_map.plot_overlay(out.images[0])
plt.show()

In addition, it’s pretty well known that the token norms of textual inversion, at least for those posted on the hugging face conceptual library have way more massive norms which may be causing this “attention corruption” phenomenon (though even at small norms this can still happen).

One of the best demonstrations of this is multi-subject generation where these models particularly as they in a way “fight” over which token takes control of the entire image with end results like

when combining concepts of a cat and a chair. Here, individually it can generate decent pictures like these below(image inputs are from open images)

And the more we “personalize” the worse this issue becomes. Now, this is not a new problem. The main way researchers went about “fixing” this issue is pretty simply using segmentation maps like in “FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention” to restrict the effects of each token’s cross-attention map to a given area

Another method is an inference trick where for around the first 10% or so of the diffusion steps we just have it generate normal images, and then we add in our textual inversion tokens for details!

While these methods do work, they are not necessarily automatic. In that, during training, we have to worry about which segmentation map corresponds to which person/whether segmentation maps are correct. And for non-object concepts like art, this whole structure falls apart. In addition, both of these techniques are a bit hacky.

So, one of the main research questions was can we just increase image alignment/personalization without sacrificing text alignment and not having to deal with masks?

And the answer seems to be yes! The authors do this in a 2 step loss. Firstly, we do image alignment

Image Alignment

This is nothing new in that it’s an in-between method of Textual Inversion and Dreambooth where Textual Inversion is done while also LORAs are added to the model. This is trained to output the input image.

In equation form, it’s the below where G is our network which predicts the noise added to our noisy image x_t at timestep t. This is mainly how diffusion models work, we keep predicting the noise added from pure noise and removing it until we get an image! If this sounds confusing I recommend checking out this article first!

Here, it can be derived that the noisy image is given by

where x is the input image and the noise is ϵ. So I think you see at the max timestep we have pure noise and at the minimum timestep(0) we have x!

However, so far we are still left with the exact same issue with all previous methods where personalization comes at the cost of text alignment. This is where the main contribution of this paper, the prompt alignment, comes in

Prompt Alignment

The main problem with why we can’t do prompt alignment is that given a new random prompt of a subject, we do not have images of that subject corresponding to that prompt all the time.

So the authors thought then why don’t we push the generation of our subjects more towards our target prompt? In particular, the authors tried maximizing the probability of the generated image having the target prompt.

To accomplish this, we first estimate our input image from a random timestep like so!

which I think comes from below

Then, curiously given this input, we calculate below using pre-trained weights

where y^c doesn’t contain our placeholder token. Looking at

Now then, what does this loss do? If our predicted x is far from the text prompt then this loss will be high since given a noise version of our predicted x, our prompt won’t help us predict what the noise is well enough. However, if the predicted sample matches the prompt very well then the loss here is relatively lower!

Thus, this loss gives us guidance on how much our predicted image matches our prompt! For the inspiration of this loss, the authors pointed to Dreamfusion which is a 3d diffusion model!

The fundamental idea is very similar in that we replace x in the diffusion loss with a “render” of our 3d scene and if the “render” doesn’t match the text prompt which we can gather from the diffusion loss, then the render gets updated. Otherwise, we know the render is a good fit with the text prompt! For more details check out here where I covered dream fusion in more detail.

Now, we have

Then let us define the current noise as

which is very typical in classifier free guidance!

Now, then how do we find the gradient? For this, also inspired by Dreamfusion we can calculate it as

The brief intuition here is just the chain rule where

where the middle term was found, at least in dream fusion, to be unnecessary and gets discarded so we are left with

Here, one important part to note is in the above the x is the predicted x.

Given this change, given a prompt a “sketch of a cat” we see that in the prompt-aligned model, gives maintains the general

Over saturation

One issue the authors ran into here was the prompt alignment was too strong so given the above loss, they weren’t able to personalize that well. To overcome this they changed this loss

to this loss

My understanding of what the authors are saying is this

  1. In the original loss function, our loss is minimized if we move our predicted x directly to the center of all the embeddings that are generated from our y^c(clean prompt). So that conflicts with our objective of wanting to have
  2. In the new function, we try to move our sample to a cross-section of both the clean prompt and the personalized prompt a bit like

from the paper “Overcoming catastrophic forgetting in neural networks” which is a famous work on continual learning that is very applicable here as we want our model to do 2 tasks, image alignment and prompt alignment well.

Computational Complexity

One minor note here is for

of the function, we can just use

This does make sense as

so we can save a bit of time!

Evaluation

Now, let’s see the final result of the paper!

The overall conclusion here seems like it can get pretty impressive improvement in text alignment while keeping image alignment relatively good. It’s also interesting that the user study showed a bit of a difference to the clip scores.

Single-shot?

Finally, it seemed as though the authors did adapt this technique to a single shot IP-Adapter-like setting like here. What they did, to my understanding, is just tuned the Lora using one image

Overall, it seems way better at stylizing as that seems to be their main point. These techniques/models overfit the input image and ignore the prompt while this technique doesn’t do that. However, I’m not sure if it’s a fair comparison since the point of these encoder-based approaches is fast/instant adaptation while we do not know the running time for this algorithm.

Overall, I’m curious if this is the current state of the art but I think it’s a very valuable research direction to explore.

--

--