Thanks for the response!

Jun 27, 2023

Thanks for the response! The reason for why it works mainly goes into receptive field for the vae. When you train vaes, one thing that ppl found was that compressing the latent less is way better at reconstructing. F4 vaes means we encode a 4x4 patch into one pixel in vae latent. In general the rule of thumb is lower number next to f is better. Sd is f8 which is right around the limit of getting good images. The intuitive idea here is that each pixel of vae just focuses on reconstructing the area it is responsible for which is basically how it works in practice. So, the latent of the image roughly corresponds, at least spatially, to a low res image. Let me know if that makes sense!

Written by Isamu Isozaki

Responses (1)