Generate My Favorite Dog just using 6Gb with Stable Diffusion!

Isamu Isozaki
4 min readOct 12, 2022

--

The dog picture above isn’t real. This is an AI-generated dog based on 6 images of my favorite dog, Frida, which is my roommate’s dog. In this article, I’ll just talk about what you can do to get images like this! If you are interested in the theory, check this series out!

Basic Theory

The main idea behind this technique is we are doing Textual Inversion. The idea is we already have cool models that generate images from text, so why not learn a word that represents our image, such as the above dog?

Requirements

This code requires at least 6GB of Nvidia GPU VRAM. I’ll also attach a colaboratory notebook here too!

Get the code

If you don’t have git, install git from here. Then do

git clone https://github.com/isamu-isozaki/diffusers.git

Setup Dependencies

If you don’t have cudatoolkits installed, install it from here! I used version 11.3.

If you don’t have an anaconda, install it from here! Then make a new environment so

conda create -n fewshot_diffusion -c pytorch pytorch torchvision torchaudio cudatoolkit=11.3 python=3.8.5 -y

Then you can do

conda activate fewshot_diffusion

You can verify if all the packages are installed correctly by doing

python

then in the shell,

import torch
torch.cuda.is_available()
exit()

should return true. And congrats. You just finished the hardest part! Now just do

cd diffusers

and then

pip install -e .
cd examples/textual_inversion
pip install accelerate transformers==4.23.1 timm fairscale albumentations wandb
git clone https://github.com/salesforce/BLIP.git

Now go here and press sign up! This allows you to see what your model is logging during training on a website!

Then do

wandb login

And copy and paste the key from here.

Now initialize an 🤗Accelerate environment with:

accelerate config

I did 0, 0, NO, NO, 0, NO

Now, make an account at hugginface and do

pip install huggingface_hub

then click agree here

huggingface-cli login

Training

Just do

accelerate launch textual_inversion.py — pretrained_model_name_or_path=”CompVis/stable-diffusion-v1–4" — train_data_dir=”frida” — learnable_property=”object” — placeholder_token=”<frida>” — initializer_token=”dog” — resolution=512 — train_batch_size=1 — gradient_accumulation_steps=4 — max_train_steps=3000 — learning_rate=5.0e-04 — scale_lr — lr_scheduler=”constant” — lr_warmup_steps=0 — output_dir=”textual_inversion_frida” — slice_div=2 — adam_epsilon=1e-8 — log_frequency=100 — save_frequency=500 — num_vec_per_token=10 — adam_weight_decay=1e-2 — mixed_precision=fp16 — gradient_checkpointing

and it should start training on the dog images! Go to wandb and you should see the progress! Apparently there is a bug when copying numbers from here in the pretrained-model_name_or_path so be careful of that!

Training parameters

If you want to change the directory it looks at for images, change — train_data_dir

Initializer_token is what the model initially thinks your object is. In my case, she was a dog so I put dog but if you have a picture of yourself, you might want to put “man” or “woman”.

For num_vec_per_token, I recommend starting with 10 and going up and down. This is the number of words that’s needed to describe the object. So, if you have a complicated looking object you might want to increase the number here. But if you increase too much, you get monsters like

I got this at 15 tokens.

For 1, I get images with less detail than my dog. For example,

So there is a tradeoff there. 10 is pretty good. Apart from the puppy above, I got images like

which are not perfect but pretty good!

If you have more GPU memory, you can remove gradient_checkpointing to speed up training!

mixed_precision can be removed too for cleaner results!

And you can increase train_batch_size for the model to learn quicker and more stably!

--

--