Skip to Content
LibraryConceptsGenerative Models

Generative Models Overview

The field of generative modeling has evolved through three major paradigms, each with distinct strengths and weaknesses. Understanding this landscape helps explain why diffusion models have become the dominant approach for high-quality image generation.

Three Main Approaches

Model TypeKey IdeaPros/Cons
GANs (2014)Adversarial training: generator vs discriminatorHigh quality but unstable training, mode collapse
VAEs (2013)Learn latent space, optimize variational boundStable but blurry samples
Diffusion (2020+)Iterative denoising processHigh quality, stable training, but slow sampling

GANs: Adversarial Training

Generative Adversarial Networks (Goodfellow et al., 2014) introduced a game-theoretic approach:

  • Generator: Creates fake samples to fool the discriminator
  • Discriminator: Distinguishes real from fake samples
  • Training: Minimax game between the two networks

Challenges:

  • Training instability (requires careful hyperparameter tuning)
  • Mode collapse (generator produces limited variety)
  • No direct likelihood estimation
  • Difficult convergence guarantees

Successes:

  • StyleGAN series achieved photorealistic face generation
  • CycleGAN enabled unpaired image-to-image translation
  • Dominated 2014-2020 for high-quality generation

VAEs: Variational Autoencoders

Variational Autoencoders (Kingma & Welling, 2013) use probabilistic encoding:

  • Encoder: Maps data to latent distribution qϕ(zx)q_\phi(z|x)
  • Decoder: Reconstructs data from latent codes pθ(xz)p_\theta(x|z)
  • Training: Maximize ELBO (evidence lower bound)
L=Eqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z))\mathcal{L} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) || p(z))

Challenges:

  • Blurry reconstructions (due to MSE loss)
  • Posterior collapse (encoder ignores latent variable)
  • Limited generation quality compared to GANs and diffusion

Strengths:

  • Stable training
  • Interpretable latent space
  • Principled probabilistic framework

Diffusion Models: Iterative Denoising

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) use a fundamentally different approach:

  • Forward process: Gradually add noise over T steps
  • Reverse process: Learn to denoise step-by-step
  • Training: Predict noise at each timestep

Advantages:

  • Stable training (no adversarial dynamics)
  • High-quality samples (state-of-the-art FID scores)
  • Principled probabilistic framework
  • Easy conditioning on text, class labels, etc.
  • Mode coverage without collapse

Key Insight: Instead of generating in one shot, diffusion models generate through iterative refinement, similar to how an artist sketches and refines.

Why Diffusion Won

Diffusion models have emerged as the dominant approach for several compelling reasons:

1. Quality

State-of-the-art image generation across multiple benchmarks:

  • DALL-E 2, Stable Diffusion, Midjourney all use diffusion
  • Superior to GANs on FID metrics
  • Better mode coverage than GANs

2. Stability

More stable training than GANs:

  • No adversarial dynamics
  • Simple MSE loss on noise prediction
  • Consistent convergence

3. Flexibility

Easy to condition on multiple modalities:

  • Text-to-image (DALL-E 2, Stable Diffusion)
  • Image editing and inpainting
  • Super-resolution and style transfer
  • Multi-modal conditioning

4. Interpretability

Clear iterative denoising process:

  • Visualize generation at each step
  • Control generation granularity
  • Understand failure modes

The Diffusion Revolution (2020-2025)

Major milestones that established diffusion as the leading approach:

  • 2020: DDPM paper shows competitive quality with GANs
  • 2021: DDIM enables 20-50x faster sampling
  • 2021: Classifier-free guidance enables powerful text conditioning
  • 2022: DALL-E 2, Stable Diffusion, and Midjourney demonstrate unprecedented generation quality
  • 2023-2025: Diffusion becomes the standard for image, video, and audio generation

Historical Context: GANs dominated 2014-2020, but diffusion models surpassed them in 2021-2022. By 2025, diffusion is the default choice for high-quality generation.

Comparison Table

AspectGANsVAEsDiffusion
Sample QualityHighMediumHighest
Training StabilityLowHighHigh
Mode CoveragePoorGoodExcellent
Speed (sampling)Fast (1 step)Fast (1 step)Slow (50-1000 steps)
LikelihoodNoYesYes
ConditioningModerateEasyVery easy

Practical Impact

Modern text-to-image systems all use diffusion models:

  • DALL-E 3: OpenAI’s text-to-image system
  • Stable Diffusion: Open-source diffusion model
  • Midjourney: Commercial diffusion-based generator
  • Imagen: Google’s text-to-image system

Understanding diffusion is essential for working with state-of-the-art generative AI.

Key Papers

GANs:

  • Goodfellow et al. (2014): “Generative Adversarial Networks”
  • Karras et al. (2019): “A Style-Based Generator Architecture for GANs (StyleGAN)”

VAEs:

  • Kingma & Welling (2013): “Auto-Encoding Variational Bayes”

Diffusion:

  • Sohl-Dickstein et al. (2015): “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”
  • Ho et al. (2020): “Denoising Diffusion Probabilistic Models”
  • Song et al. (2020): “Denoising Diffusion Implicit Models”

Learning Resources

Tutorials

Video Explanations

Comparisons