The dog picture above isn’t real. This is an AI-generated dog based on 6 images of my favorite dog, Frida, which is my roommate’s dog. In this article, I’ll just talk about what you can do to get images like this! If you are interested in the theory, check this series out!
The main idea behind this technique is we are doing Textual Inversion. The idea is we already have cool models that generate images from text, so why not learn a word that represents our image, such as the above dog?
This code requires at least 6GB of Nvidia GPU VRAM. I’ll also attach a colaboratory notebook here too!
Get the code
If you don’t have git, install git from here. Then do
git clone https://github.com/isamu-isozaki/diffusers.git
If you don’t have cudatoolkits installed, install it from here! I used version 11.3.
If you don’t have an anaconda, install it from here! Then make a new environment so
conda create -n fewshot_diffusion -c pytorch pytorch torchvision torchaudio cudatoolkit=11.3 python=3.8.5 -y
Then you can do
conda activate fewshot_diffusion
You can verify if all the packages are installed correctly by doing
then in the shell,
should return true. And congrats. You just finished the hardest part! Now just do
pip install -e .
pip install accelerate transformers==4.23.1 timm fairscale albumentations wandb
git clone https://github.com/salesforce/BLIP.git
Now go here and press sign up! This allows you to see what your model is logging during training on a website!
And copy and paste the key from here.
Now initialize an 🤗Accelerate environment with:
I did 0, 0, NO, NO, 0, NO
Now, make an account at hugginface and do
pip install huggingface_hub
then click agree here
accelerate launch textual_inversion.py — pretrained_model_name_or_path=”CompVis/stable-diffusion-v1–4" — train_data_dir=”frida” — learnable_property=”object” — placeholder_token=”<frida>” — initializer_token=”dog” — resolution=512 — train_batch_size=1 — gradient_accumulation_steps=4 — max_train_steps=3000 — learning_rate=5.0e-04 — scale_lr — lr_scheduler=”constant” — lr_warmup_steps=0 — output_dir=”textual_inversion_frida” — slice_div=2 — adam_epsilon=1e-8 — log_frequency=100 — save_frequency=500 — num_vec_per_token=10 — adam_weight_decay=1e-2 — mixed_precision=fp16 — gradient_checkpointing
and it should start training on the dog images! Go to wandb and you should see the progress! Apparently there is a bug when copying numbers from here in the pretrained-model_name_or_path so be careful of that!
If you want to change the directory it looks at for images, change — train_data_dir
Initializer_token is what the model initially thinks your object is. In my case, she was a dog so I put dog but if you have a picture of yourself, you might want to put “man” or “woman”.
For num_vec_per_token, I recommend starting with 10 and going up and down. This is the number of words that’s needed to describe the object. So, if you have a complicated looking object you might want to increase the number here. But if you increase too much, you get monsters like
I got this at 15 tokens.
For 1, I get images with less detail than my dog. For example,
So there is a tradeoff there. 10 is pretty good. Apart from the puppy above, I got images like
which are not perfect but pretty good!
If you have more GPU memory, you can remove gradient_checkpointing to speed up training!
mixed_precision can be removed too for cleaner results!
And you can increase train_batch_size for the model to learn quicker and more stably!