Generative Models Overview

The field of generative modeling has evolved through three major paradigms, each with distinct strengths and weaknesses. Understanding this landscape helps explain why diffusion models have become the dominant approach for high-quality image generation.

Three Main Approaches

Model Type	Key Idea	Pros/Cons
GANs (2014)	Adversarial training: generator vs discriminator	High quality but unstable training, mode collapse
VAEs (2013)	Learn latent space, optimize variational bound	Stable but blurry samples
Diffusion (2020+)	Iterative denoising process	High quality, stable training, but slow sampling

GANs: Adversarial Training

Generative Adversarial Networks (Goodfellow et al., 2014) introduced a game-theoretic approach:

Generator: Creates fake samples to fool the discriminator
Discriminator: Distinguishes real from fake samples
Training: Minimax game between the two networks

Challenges:

Training instability (requires careful hyperparameter tuning)
Mode collapse (generator produces limited variety)
No direct likelihood estimation
Difficult convergence guarantees

Successes:

StyleGAN series achieved photorealistic face generation
CycleGAN enabled unpaired image-to-image translation
Dominated 2014-2020 for high-quality generation

VAEs: Variational Autoencoders

Variational Autoencoders (Kingma & Welling, 2013) use probabilistic encoding:

Encoder: Maps data to latent distribution $q_\phi(z|x)$
Decoder: Reconstructs data from latent codes $p_\theta(x|z)$
Training: Maximize ELBO (evidence lower bound)

\mathcal{L} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) || p(z))

Challenges:

Blurry reconstructions (due to MSE loss)
Posterior collapse (encoder ignores latent variable)
Limited generation quality compared to GANs and diffusion

Strengths:

Stable training
Interpretable latent space
Principled probabilistic framework

Diffusion Models: Iterative Denoising

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) use a fundamentally different approach:

Forward process: Gradually add noise over T steps
Reverse process: Learn to denoise step-by-step
Training: Predict noise at each timestep

Advantages:

Stable training (no adversarial dynamics)
High-quality samples (state-of-the-art FID scores)
Principled probabilistic framework
Easy conditioning on text, class labels, etc.
Mode coverage without collapse

Key Insight: Instead of generating in one shot, diffusion models generate through iterative refinement, similar to how an artist sketches and refines.

Why Diffusion Won

Diffusion models have emerged as the dominant approach for several compelling reasons:

1. Quality

State-of-the-art image generation across multiple benchmarks:

DALL-E 2, Stable Diffusion, Midjourney all use diffusion
Superior to GANs on FID metrics
Better mode coverage than GANs

2. Stability

More stable training than GANs:

No adversarial dynamics
Simple MSE loss on noise prediction
Consistent convergence

3. Flexibility

Easy to condition on multiple modalities:

Text-to-image (DALL-E 2, Stable Diffusion)
Image editing and inpainting
Super-resolution and style transfer
Multi-modal conditioning

4. Interpretability

Clear iterative denoising process:

Visualize generation at each step
Control generation granularity
Understand failure modes

The Diffusion Revolution (2020-2025)

Major milestones that established diffusion as the leading approach:

2020: DDPM paper shows competitive quality with GANs
2021: DDIM enables 20-50x faster sampling
2021: Classifier-free guidance enables powerful text conditioning
2022: DALL-E 2, Stable Diffusion, and Midjourney demonstrate unprecedented generation quality
2023-2025: Diffusion becomes the standard for image, video, and audio generation

Historical Context: GANs dominated 2014-2020, but diffusion models surpassed them in 2021-2022. By 2025, diffusion is the default choice for high-quality generation.

Comparison Table

Aspect	GANs	VAEs	Diffusion
Sample Quality	High	Medium	Highest
Training Stability	Low	High	High
Mode Coverage	Poor	Good	Excellent
Speed (sampling)	Fast (1 step)	Fast (1 step)	Slow (50-1000 steps)
Likelihood	No	Yes	Yes
Conditioning	Moderate	Easy	Very easy

Practical Impact

Modern text-to-image systems all use diffusion models:

DALL-E 3: OpenAI’s text-to-image system
Stable Diffusion: Open-source diffusion model
Midjourney: Commercial diffusion-based generator
Imagen: Google’s text-to-image system

Understanding diffusion is essential for working with state-of-the-art generative AI.

Diffusion Fundamentals - How diffusion models work
DDPM Paper - Foundational diffusion algorithm
Contrastive Learning - Used in CLIP for text conditioning

Key Papers

GANs:

Goodfellow et al. (2014): “Generative Adversarial Networks”
Karras et al. (2019): “A Style-Based Generator Architecture for GANs (StyleGAN)”

VAEs:

Kingma & Welling (2013): “Auto-Encoding Variational Bayes”

Diffusion:

Sohl-Dickstein et al. (2015): “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”
Ho et al. (2020): “Denoising Diffusion Probabilistic Models”
Song et al. (2020): “Denoising Diffusion Implicit Models”

Learning Resources