Skip to Content
Learning PathsAdvancedDiffusion Models

Generative Diffusion Models

Learn the mathematics and implementation of diffusion models - the technology that revolutionized image generation and now powers DALL-E, Stable Diffusion, and Midjourney. Understand forward and reverse diffusion processes, noise prediction training, and text conditioning through classifier-free guidance.

Why Diffusion Models Matter

From 2020-2023, diffusion models completely replaced GANs as the default for high-quality generation:

Timeline:

  • 2020: DDPM paper introduces diffusion for high-quality generation
  • 2021: DDIM enables 20-50x faster sampling
  • 2022: DALL-E 2, Stable Diffusion, Midjourney launch publicly
  • 2023: Diffusion models dominate generative AI (images, video, audio, 3D)

Why Diffusion Won:

  • More stable training than GANs (no mode collapse)
  • Better sample quality than VAEs
  • Flexible conditioning (text, images, sketches)
  • Controllable generation via guidance

Healthcare Applications:

  • Synthetic medical image generation for rare pathologies
  • Privacy-preserving synthetic patient data
  • Data augmentation for imbalanced datasets
  • Conditional generation (symptom text → medical image)

Learning Objectives

After completing this module, you will:

  • Generative Model Understanding: Grasp different generative approaches (GANs, VAEs, Diffusion) and why diffusion won
  • Diffusion Mathematics: Understand forward process (adding noise) and reverse process (learned denoising) with DDPM and DDIM
  • Text Conditioning: Learn how CLIP embeddings guide image generation through classifier-free guidance
  • Implementation Skills: Implement DDPM training and DDIM sampling from scratch
  • Healthcare Application: Apply generative techniques to synthetic medical data generation

Prerequisites

Before starting this module, ensure you have:

Essential Preparation: Watch Welch Labs: “But How Do AI Images Work?” (35 minutes) before starting. This provides crucial visual intuition for diffusion models.

Week 1: Diffusion Fundamentals

Day 1-2: Generative Models Landscape

Core Concept:

Generative Models

What You’ll Learn:

Generative Model Comparison:

ModelTrainingSamplingQualityDiversityStability
GANsAdversarialSingle step (fast)HighMediumUnstable (mode collapse)
VAEsLikelihood-basedSingle step (fast)MediumHighStable
DiffusionDenoisingIterative (slow)HighestHighVery stable

Generative Adversarial Networks (GANs):

  • Generator vs Discriminator
  • Pros: Fast sampling, high quality
  • Cons: Training instability, mode collapse, hard to condition

Variational Autoencoders (VAEs):

  • Encoder → Latent space → Decoder
  • Pros: Stable training, explicit likelihood
  • Cons: Blurry samples, posterior collapse

Diffusion Models:

  • Gradually add noise, then learn to denoise
  • Pros: Stable training, highest quality, flexible conditioning
  • Cons: Slow sampling (fixed by DDIM)

Why Diffusion Won (2020-2023):

  1. Quality: Better FID scores than GANs on most benchmarks
  2. Stability: No mode collapse, reliable training
  3. Scalability: Works at massive scale (billions of parameters)
  4. Conditioning: Easy to add text/class/image conditioning
  5. Fast Sampling: DDIM reduced steps from 1000 → 20-50

Learning Resources:

  • Papers: Original GAN, VAE, DDPM papers
  • Reading: Generative models survey
  • Videos: Welch Labs diffusion video (essential viewing)

Exercises:

  • Compare generated samples from GAN, VAE, and diffusion
  • Understand trade-offs for different applications
  • Identify when to use each model type

Checkpoint: Can you explain why diffusion models replaced GANs for most applications?

Day 3-5: Diffusion Fundamentals

Core Concept:

Diffusion Fundamentals

What You’ll Learn:

Forward Diffusion Process (Adding Noise):

Start with real image x0x_0 and gradually add Gaussian noise over T steps:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

Where:

  • x0x_0 is the original image
  • xTx_T is pure noise
  • βt\beta_t is the noise schedule (how much noise at each step)
  • T = 1000 steps typically

Key Insight - Direct Timestep Sampling:

We don’t need to apply noise T times sequentially! We can sample xtx_t directly from x0x_0:

xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon

Where:

  • αˉt=i=1t(1βi)\bar{\alpha}_t = \prod_{i=1}^t (1 - \beta_i)
  • ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) is standard Gaussian noise

This is the training data generation trick!

Noise Schedules:

  1. Linear Schedule: βt\beta_t increases linearly from β1\beta_1 to βT\beta_T

    • Simple but not optimal
    • Too much noise too quickly
  2. Cosine Schedule: Smoother noise addition

    • Better for high-resolution images
    • More uniform signal-to-noise ratio across timesteps

Alpha/Beta Notation:

  • αt=1βt\alpha_t = 1 - \beta_t
  • αˉt=i=1tαi\bar{\alpha}_t = \prod_{i=1}^t \alpha_i
  • αˉt\bar{\alpha}_t represents how much original signal remains at timestep t

Reverse Process (Learned Denoising):

Learn to reverse the diffusion process - denoise step by step:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The model θ\theta predicts how to remove noise at each timestep.

Learning Resources:

  • Papers: “Denoising Diffusion Probabilistic Models” (DDPM)
  • Videos: Welch Labs video on forward/reverse process
  • Code: Implement forward diffusion process

Exercises:

  • Implement forward diffusion (noise addition)
  • Visualize images at different timesteps (t=0, 250, 500, 750, 1000)
  • Implement direct timestep sampling
  • Compare linear vs cosine noise schedules

Checkpoint: Can you explain why we can sample xtx_t directly from x0x_0?

Day 6-7: DDPM Training

Architecture Paper:

DDPM (Denoising Diffusion Probabilistic Models)

What You’ll Learn:

DDPM’s Key Insight:

Instead of predicting xt1x_{t-1} directly, predict the noise ϵ\epsilon that was added:

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

Where:

  • ϵ\epsilon is the actual noise added
  • ϵθ(xt,t)\epsilon_\theta(x_t, t) is the model’s noise prediction
  • We sample xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon

Why Predict Noise?

  • Simpler objective than predicting x0x_0 or xt1x_{t-1}
  • More stable training
  • Better gradient properties
  • Empirically works much better

U-Net Architecture:

DDPM uses a U-Net for the denoising model:

  • Encoder: Downsampling path (convolutions + downsampling)
  • Bottleneck: Lowest resolution with most channels
  • Decoder: Upsampling path (transposed convolutions)
  • Skip Connections: Concatenate encoder features to decoder (like ResNet)

Timestep Embedding:

  • Sinusoidal positional encoding (like transformers)
  • Tells model which timestep it’s denoising
  • Injected via adaptive group normalization

Self-Attention Layers:

  • Added at lower resolutions (e.g., 16×16)
  • Captures long-range dependencies
  • Critical for high-quality generation

DDPM Training Algorithm:

# Simplified training loop for epoch in epochs: for x_0 in dataloader: # Real images # Sample random timestep t = torch.randint(0, T, (batch_size,)) # Sample noise epsilon = torch.randn_like(x_0) # Create noisy image x_t = sqrt_alpha_bar[t] * x_0 + sqrt_one_minus_alpha_bar[t] * epsilon # Predict noise epsilon_pred = model(x_t, t) # Compute loss loss = mse_loss(epsilon, epsilon_pred) # Backprop and update loss.backward() optimizer.step()

DDPM Sampling Algorithm:

# Start from pure noise x_t = torch.randn(batch_size, 3, H, W) # Denoise step by step for t in reversed(range(T)): # Predict noise epsilon_pred = model(x_t, t) # Compute mean mean = (x_t - beta[t] / sqrt_one_minus_alpha_bar[t] * epsilon_pred) / sqrt_alpha[t] # Add noise (except final step) if t > 0: noise = torch.randn_like(x_t) x_t = mean + sqrt(beta[t]) * noise else: x_t = mean # x_t is now the generated image

Learning Resources:

  • Papers: “Denoising Diffusion Probabilistic Models” (Ho et al., 2020)
  • Code:
    • Official DDPM implementation (TensorFlow)
    • Unofficial PyTorch implementations
  • Videos: DDPM paper explained

Exercises:

  • Implement DDPM training loop
  • Train simplified diffusion model on MNIST or CIFAR-10
  • Visualize denoising process at different timesteps
  • Implement U-Net with timestep conditioning

Checkpoint: Can you implement DDPM training and explain why we predict noise instead of images?

Week 2: Fast Sampling and Text Conditioning

Day 8-9: DDIM - Fast Sampling

Architecture Paper:

DDIM (Denoising Diffusion Implicit Models)

What You’ll Learn:

The Problem with DDPM:

  • Needs 1000 steps for generation → very slow
  • 50 seconds per image on GPU
  • Impractical for real-time applications

DDIM’s Solution:

  • Same trained model, different sampling process
  • Skip timesteps without retraining
  • 20-50 steps instead of 1000 → 20-50x speedup!
  • Deterministic generation (same noise → same image)

DDIM Update Rule:

xtΔt=αˉtΔtpred_x0+1αˉtΔtσt2ϵθ(xt,t)+σtϵx_{t-\Delta t} = \sqrt{\bar{\alpha}_{t-\Delta t}} \cdot \text{pred}\_x_0 + \sqrt{1 - \bar{\alpha}_{t-\Delta t} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \cdot \epsilon

Where:

  • pred_x0=xt1αˉtϵθ(xt,t)αˉt\text{pred}\_x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t} \cdot \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}
  • σt\sigma_t controls stochasticity (0 = deterministic)
  • Δt\Delta t is the step size (can be > 1)

Key Insights:

  1. Deterministic sampling: Set σt=0\sigma_t = 0, get same image from same noise
  2. Step skipping: Use subset of timesteps (e.g., [0, 50, 100, …, 1000])
  3. No retraining: Use existing DDPM checkpoint
  4. Interpolation: Can interpolate between images in latent space

DDPM vs DDIM:

AspectDDPMDDIM
Sampling steps100020-50
Sampling speedSlow (50s)Fast (2-3s)
StochasticYesOptional
DeterministicNoYes
Requires retraining-No

Practical Tips:

  • 50 steps: Good quality, 10x faster
  • 20 steps: Acceptable quality, 25x faster
  • 10 steps: Lower quality but usable
  • Trade-off quality vs speed based on application

Learning Resources:

  • Papers: “Denoising Diffusion Implicit Models” (Song et al., 2020)
  • Code: DDIM sampling implementation
  • Comparison: Side-by-side DDPM vs DDIM

Exercises:

  • Implement DDIM sampling
  • Compare DDPM (1000 steps) vs DDIM (50 steps) quality
  • Experiment with different step counts
  • Test deterministic generation (same seed → same image)

Checkpoint: Can you explain how DDIM achieves 20-50x speedup without retraining?

Day 10-12: Classifier-Free Guidance

Core Concept:

Classifier-Free Guidance

What You’ll Learn:

Text-to-Image Generation:

How do we guide diffusion models with text prompts like “a cat wearing a hat”?

Early Approach - Classifier Guidance:

  • Train separate classifier on noisy images
  • Use classifier gradients to guide denoising
  • Problems: Needs extra classifier, adversarial artifacts

Classifier-Free Guidance (CFG):

  • No separate classifier needed!
  • Train single model both conditionally and unconditionally
  • Amplify conditioning during sampling

Training with Conditioning:

# During training, randomly drop conditioning with probability p (e.g., 10%) if random() < 0.1: # Unconditional training epsilon_pred = model(x_t, t, cond=None) else: # Conditional training epsilon_pred = model(x_t, t, cond=text_embedding) loss = mse_loss(epsilon, epsilon_pred)

Sampling with Guidance:

ϵ~θ(xt,t,c)=ϵθ(xt,t,)+s(ϵθ(xt,t,c)ϵθ(xt,t,))\tilde{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))

Where:

  • cc is the conditioning (e.g., CLIP text embedding)
  • \emptyset is the empty (unconditional) conditioning
  • ss is the guidance scale (typically 7-15)

Intuition:

  • Unconditional prediction: “Generate any image”
  • Conditional prediction: “Generate image matching text”
  • Guidance: Amplify the difference → stronger text adherence

Guidance Scale Effects:

Guidance ScaleEffect
s = 0Ignore conditioning (pure unconditional)
s = 1Standard conditional generation
s = 3-5Moderate text adherence
s = 7-10Strong text adherence (Stable Diffusion default: 7.5)
s = 15+Very strong adherence, may sacrifice quality

Negative Prompts:

Guide away from unwanted concepts:

# Positive prompt: "a beautiful landscape" # Negative prompt: "blurry, low quality, watermark" epsilon_pred = epsilon_uncond + \ s * (epsilon_cond_positive - epsilon_cond_negative)

Text Conditioning via CLIP:

  • Use CLIP text encoder for text embeddings
  • Cross-attention in U-Net: Query from image features, Key+Value from text embeddings
  • Text embedding injected at multiple resolutions

Learning Resources:

  • Papers:
    • “Classifier-Free Diffusion Guidance” (Ho & Salimans, 2021)
    • “High-Resolution Image Synthesis with Latent Diffusion Models” (Stable Diffusion)
  • Code: Classifier-free guidance implementation
  • Demos: Stable Diffusion, DALL-E 2

Exercises:

  • Implement classifier-free guidance
  • Train conditional diffusion model (class-conditional on CIFAR-10)
  • Experiment with different guidance scales
  • Implement negative prompts
  • Compare conditional vs unconditional generation

Checkpoint: Can you implement classifier-free guidance and explain how it amplifies conditioning?

Day 13-14: Complete Text-to-Image Pipeline

Putting It All Together:

Complete Modern Text-to-Image System:

Text Prompt CLIP Text Encoder → Text Embeddings Cross-Attention in U-Net Pure Noise (x_T) → DDIM Sampling (50 steps) → Generated Image (x_0) U-Net Noise Predictor (trained with classifier-free guidance)

Key Components:

  1. CLIP Text Encoder: Convert prompt to embeddings
  2. U-Net with Cross-Attention: Noise predictor conditioned on text
  3. DDIM Sampling: Fast generation (50 steps)
  4. Classifier-Free Guidance: Strong text adherence

Hands-On Project:

Requirements:

  1. Implement complete text-conditional diffusion model
  2. Train on small dataset (e.g., Oxford Flowers with captions)
  3. Use DDIM for fast sampling
  4. Implement classifier-free guidance
  5. Generate images from text prompts

Advanced Features:

  • Image-to-image (start from noisy real image)
  • Inpainting (mask regions to regenerate)
  • Style transfer via prompts

Healthcare Application Example:

Synthetic Medical Image Generation:

# Generate synthetic chest X-rays for data augmentation prompts = [ "chest x-ray showing pneumonia in right lung", "chest x-ray with normal lung fields", "chest x-ray showing cardiomegaly", ] for prompt in prompts: # Generate multiple synthetic examples for i in range(10): synthetic_xray = diffusion_model.generate( prompt=prompt, guidance_scale=7.5, num_steps=50 ) save_image(synthetic_xray, f"{prompt}_{i}.png")

Benefits:

  • Augment rare pathology datasets
  • Privacy-preserving (no real patient data)
  • Controlled generation (specify exact conditions)

Module Completion Criteria

You have completed this module when you can:

  • ✅ Explain the diffusion process (forward and reverse)
  • ✅ Understand why we predict noise instead of images
  • ✅ Implement DDPM training loop
  • ✅ Implement DDIM sampling for fast generation
  • ✅ Understand and implement classifier-free guidance
  • ✅ Build complete text-to-image pipeline
  • ✅ Apply diffusion models to healthcare problems

Key Resources

Essential Papers (Must Read)

  1. “Denoising Diffusion Probabilistic Models” (Ho et al., 2020) - DDPM
  2. “Denoising Diffusion Implicit Models” (Song et al., 2020) - DDIM
  3. “Classifier-Free Diffusion Guidance” (Ho & Salimans, 2021) - CFG

Essential Video (Must Watch)

Welch Labs: “But How Do AI Images Work?” (35 minutes)

  • Best intuitive explanation of diffusion
  • Visual walkthrough of DALL-E 2 architecture
  • Watch before starting this module

Code Resources

  • HuggingFace Diffusers library (production-ready implementations)
  • Stable Diffusion official code
  • Educational diffusion implementations

Healthcare-Specific Resources

  • Medical image synthesis papers
  • Synthetic data validation studies
  • Privacy-preserving generation techniques

Common Pitfalls

1. Wrong Noise Schedule

Problem: Images don’t denoise properly Solution: Use cosine schedule for high-res images, validate alpha/beta values

2. Too Few Sampling Steps

Problem: Low quality images with DDIM Solution: Use at least 20-50 steps for good quality

3. Wrong Guidance Scale

Problem: Poor text adherence or unnatural images Solution: Start with 7.5, adjust based on results (3-15 range)

4. Forgetting Conditioning Dropout

Problem: Classifier-free guidance doesn’t work Solution: Train with 10% unconditional probability

5. Memory Issues

Problem: Out of memory during training Solution: Reduce batch size, use gradient accumulation, or train on lower resolution

Success Tips

  1. Start with MNIST/CIFAR-10

    • Validate implementation on simple data
    • Faster iteration cycles
    • Scale up to high-res later
  2. Use Pre-Trained Models First

    • Download Stable Diffusion weights
    • Understand architecture by using it
    • Fine-tune on domain data
  3. Visualize the Process

    • Save intermediate denoising steps
    • Watch how images form from noise
    • Understand what each component does
  4. Monitor Training Carefully

    • Watch sample quality during training
    • Check that conditioning has effect
    • Validate guidance scale range
  5. Healthcare Validation

    • Get clinical expert evaluation of synthetic images
    • Measure realism with FID score
    • Verify utility (does augmentation improve downstream task?)

Next Steps

After completing this module:

  1. Advanced Diffusion: Latent diffusion (Stable Diffusion), video diffusion, 3D generation
  2. Healthcare Applications: Apply to medical data synthesis
  3. Multimodal Integration: Combine with VLM knowledge for advanced conditioning
  4. Research: Explore recent papers on efficient diffusion, controllable generation

Time Investment

Total estimated time: 12-18 hours over 2 weeks

  • Generative models and diffusion fundamentals: 4-6 hours
  • DDPM training: 3-4 hours
  • DDIM fast sampling: 2-3 hours
  • Classifier-free guidance and text conditioning: 3-5 hours
  • Healthcare applications: 2-3 hours

Key Takeaway

“Diffusion models won by being boring.”

Unlike GANs with adversarial training and mode collapse, or VAEs with posterior collapse, diffusion models just work. Stable training. High quality. Flexible conditioning. The iterative denoising process is simple, intuitive, and scales to billions of parameters. DDIM made them fast. Classifier-free guidance made them controllable. Now they power virtually all modern image generation.

Understand the math. Implement DDPM training. Master classifier-free guidance. Apply to healthcare.


Ready to begin? Start with Generative Models overview, then watch the Welch Labs video.