Generate My Favorite Dog just using 6Gb with Stable Diffusion!
The dog picture above isn’t real. This is an AI-generated dog based on 6 images of my favorite dog, Frida, which is my roommate’s dog. In this article, I’ll just talk about what you can do to get images like this! If you are interested in the theory, check this series out!
Basic Theory
The main idea behind this technique is we are doing Textual Inversion. The idea is we already have cool models that generate images from text, so why not learn a word that represents our image, such as the above dog?
Requirements
This code requires at least 6GB of Nvidia GPU VRAM. I’ll also attach a colaboratory notebook here too!
Get the code
If you don’t have git, install git from here. Then do
git clone https://github.com/isamu-isozaki/diffusers.git
Setup Dependencies
If you don’t have cudatoolkits installed, install it from here! I used version 11.3.
If you don’t have an anaconda, install it from here! Then make a new environment so
conda create -n fewshot_diffusion -c pytorch pytorch torchvision torchaudio cudatoolkit=11.3 python=3.8.5 -y
Then you can do
conda activate fewshot_diffusion
You can verify if all the packages are installed correctly by doing
python
then in the shell,
import torch
torch.cuda.is_available()
exit()
should return true. And congrats. You just finished the hardest part! Now just do
cd diffusers
and then
pip install -e .
cd examples/textual_inversion
pip install accelerate transformers==4.23.1 timm fairscale albumentations wandb
git clone https://github.com/salesforce/BLIP.git
Now go here and press sign up! This allows you to see what your model is logging during training on a website!
Then do
wandb login
And copy and paste the key from here.
Now initialize an 🤗Accelerate environment with:
accelerate config
I did 0, 0, NO, NO, 0, NO
Now, make an account at hugginface and do
pip install huggingface_hub
then click agree here
huggingface-cli login
Training
Just do
accelerate launch textual_inversion.py — pretrained_model_name_or_path=”CompVis/stable-diffusion-v1–4" — train_data_dir=”frida” — learnable_property=”object” — placeholder_token=”<frida>” — initializer_token=”dog” — resolution=512 — train_batch_size=1 — gradient_accumulation_steps=4 — max_train_steps=3000 — learning_rate=5.0e-04 — scale_lr — lr_scheduler=”constant” — lr_warmup_steps=0 — output_dir=”textual_inversion_frida” — slice_div=2 — adam_epsilon=1e-8 — log_frequency=100 — save_frequency=500 — num_vec_per_token=10 — adam_weight_decay=1e-2 — mixed_precision=fp16 — gradient_checkpointing
and it should start training on the dog images! Go to wandb and you should see the progress! Apparently there is a bug when copying numbers from here in the pretrained-model_name_or_path so be careful of that!
Training parameters
If you want to change the directory it looks at for images, change — train_data_dir
Initializer_token is what the model initially thinks your object is. In my case, she was a dog so I put dog but if you have a picture of yourself, you might want to put “man” or “woman”.
For num_vec_per_token, I recommend starting with 10 and going up and down. This is the number of words that’s needed to describe the object. So, if you have a complicated looking object you might want to increase the number here. But if you increase too much, you get monsters like
I got this at 15 tokens.
For 1, I get images with less detail than my dog. For example,
So there is a tradeoff there. 10 is pretty good. Apart from the puppy above, I got images like
which are not perfect but pretty good!
If you have more GPU memory, you can remove gradient_checkpointing to speed up training!
mixed_precision can be removed too for cleaner results!
And you can increase train_batch_size for the model to learn quicker and more stably!