Generative Models Overview
The field of generative modeling has evolved through three major paradigms, each with distinct strengths and weaknesses. Understanding this landscape helps explain why diffusion models have become the dominant approach for high-quality image generation.
Three Main Approaches
| Model Type | Key Idea | Pros/Cons |
|---|---|---|
| GANs (2014) | Adversarial training: generator vs discriminator | High quality but unstable training, mode collapse |
| VAEs (2013) | Learn latent space, optimize variational bound | Stable but blurry samples |
| Diffusion (2020+) | Iterative denoising process | High quality, stable training, but slow sampling |
GANs: Adversarial Training
Generative Adversarial Networks (Goodfellow et al., 2014) introduced a game-theoretic approach:
- Generator: Creates fake samples to fool the discriminator
- Discriminator: Distinguishes real from fake samples
- Training: Minimax game between the two networks
Challenges:
- Training instability (requires careful hyperparameter tuning)
- Mode collapse (generator produces limited variety)
- No direct likelihood estimation
- Difficult convergence guarantees
Successes:
- StyleGAN series achieved photorealistic face generation
- CycleGAN enabled unpaired image-to-image translation
- Dominated 2014-2020 for high-quality generation
VAEs: Variational Autoencoders
Variational Autoencoders (Kingma & Welling, 2013) use probabilistic encoding:
- Encoder: Maps data to latent distribution
- Decoder: Reconstructs data from latent codes
- Training: Maximize ELBO (evidence lower bound)
Challenges:
- Blurry reconstructions (due to MSE loss)
- Posterior collapse (encoder ignores latent variable)
- Limited generation quality compared to GANs and diffusion
Strengths:
- Stable training
- Interpretable latent space
- Principled probabilistic framework
Diffusion Models: Iterative Denoising
Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) use a fundamentally different approach:
- Forward process: Gradually add noise over T steps
- Reverse process: Learn to denoise step-by-step
- Training: Predict noise at each timestep
Advantages:
- Stable training (no adversarial dynamics)
- High-quality samples (state-of-the-art FID scores)
- Principled probabilistic framework
- Easy conditioning on text, class labels, etc.
- Mode coverage without collapse
Key Insight: Instead of generating in one shot, diffusion models generate through iterative refinement, similar to how an artist sketches and refines.
Why Diffusion Won
Diffusion models have emerged as the dominant approach for several compelling reasons:
1. Quality
State-of-the-art image generation across multiple benchmarks:
- DALL-E 2, Stable Diffusion, Midjourney all use diffusion
- Superior to GANs on FID metrics
- Better mode coverage than GANs
2. Stability
More stable training than GANs:
- No adversarial dynamics
- Simple MSE loss on noise prediction
- Consistent convergence
3. Flexibility
Easy to condition on multiple modalities:
- Text-to-image (DALL-E 2, Stable Diffusion)
- Image editing and inpainting
- Super-resolution and style transfer
- Multi-modal conditioning
4. Interpretability
Clear iterative denoising process:
- Visualize generation at each step
- Control generation granularity
- Understand failure modes
The Diffusion Revolution (2020-2025)
Major milestones that established diffusion as the leading approach:
- 2020: DDPM paper shows competitive quality with GANs
- 2021: DDIM enables 20-50x faster sampling
- 2021: Classifier-free guidance enables powerful text conditioning
- 2022: DALL-E 2, Stable Diffusion, and Midjourney demonstrate unprecedented generation quality
- 2023-2025: Diffusion becomes the standard for image, video, and audio generation
Historical Context: GANs dominated 2014-2020, but diffusion models surpassed them in 2021-2022. By 2025, diffusion is the default choice for high-quality generation.
Comparison Table
| Aspect | GANs | VAEs | Diffusion |
|---|---|---|---|
| Sample Quality | High | Medium | Highest |
| Training Stability | Low | High | High |
| Mode Coverage | Poor | Good | Excellent |
| Speed (sampling) | Fast (1 step) | Fast (1 step) | Slow (50-1000 steps) |
| Likelihood | No | Yes | Yes |
| Conditioning | Moderate | Easy | Very easy |
Practical Impact
Modern text-to-image systems all use diffusion models:
- DALL-E 3: OpenAI’s text-to-image system
- Stable Diffusion: Open-source diffusion model
- Midjourney: Commercial diffusion-based generator
- Imagen: Google’s text-to-image system
Understanding diffusion is essential for working with state-of-the-art generative AI.
Related Concepts
- Diffusion Fundamentals - How diffusion models work
- DDPM Paper - Foundational diffusion algorithm
- Contrastive Learning - Used in CLIP for text conditioning
Key Papers
GANs:
- Goodfellow et al. (2014): “Generative Adversarial Networks”
- Karras et al. (2019): “A Style-Based Generator Architecture for GANs (StyleGAN)”
VAEs:
- Kingma & Welling (2013): “Auto-Encoding Variational Bayes”
Diffusion:
- Sohl-Dickstein et al. (2015): “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”
- Ho et al. (2020): “Denoising Diffusion Probabilistic Models”
- Song et al. (2020): “Denoising Diffusion Implicit Models”
Learning Resources
Tutorials
- Lilian Weng: What are Diffusion Models? - Comprehensive blog post
- Yang Song: Generative Modeling by Estimating Gradients - Score-based perspective
Video Explanations
- Outlier: Diffusion Models | Paper Explanation - Clear introduction
Comparisons
- OpenAI: Diffusion vs GANs comparison - Empirical comparison