Generative Diffusion Models

Learn the mathematics and implementation of diffusion models - the technology that revolutionized image generation and now powers DALL-E, Stable Diffusion, and Midjourney. Understand forward and reverse diffusion processes, noise prediction training, and text conditioning through classifier-free guidance.

Why Diffusion Models Matter

From 2020-2023, diffusion models completely replaced GANs as the default for high-quality generation:

Timeline:

2020: DDPM paper introduces diffusion for high-quality generation
2021: DDIM enables 20-50x faster sampling
2022: DALL-E 2, Stable Diffusion, Midjourney launch publicly
2023: Diffusion models dominate generative AI (images, video, audio, 3D)

Why Diffusion Won:

More stable training than GANs (no mode collapse)
Better sample quality than VAEs
Flexible conditioning (text, images, sketches)
Controllable generation via guidance

Healthcare Applications:

Synthetic medical image generation for rare pathologies
Privacy-preserving synthetic patient data
Data augmentation for imbalanced datasets
Conditional generation (symptom text → medical image)

Learning Objectives

After completing this module, you will:

Generative Model Understanding: Grasp different generative approaches (GANs, VAEs, Diffusion) and why diffusion won
Diffusion Mathematics: Understand forward process (adding noise) and reverse process (learned denoising) with DDPM and DDIM
Text Conditioning: Learn how CLIP embeddings guide image generation through classifier-free guidance
Implementation Skills: Implement DDPM training and DDIM sampling from scratch
Healthcare Application: Apply generative techniques to synthetic medical data generation

Prerequisites

Before starting this module, ensure you have:

Strong CNN understanding (for image generation)
Transformer knowledge (for conditioning mechanisms)
CLIP understanding (for text-guided generation)
Probability: Basic understanding of Gaussian distributions, Markov chains

Essential Preparation: Watch Welch Labs: “But How Do AI Images Work?” (35 minutes) before starting. This provides crucial visual intuition for diffusion models.

Week 1: Diffusion Fundamentals

Day 1-2: Generative Models Landscape

Core Concept:

Generative Models

What You’ll Learn:

Generative Model Comparison:

Model	Training	Sampling	Quality	Diversity	Stability
GANs	Adversarial	Single step (fast)	High	Medium	Unstable (mode collapse)
VAEs	Likelihood-based	Single step (fast)	Medium	High	Stable
Diffusion	Denoising	Iterative (slow)	Highest	High	Very stable

Generative Adversarial Networks (GANs):

Generator vs Discriminator
Pros: Fast sampling, high quality
Cons: Training instability, mode collapse, hard to condition

Variational Autoencoders (VAEs):

Encoder → Latent space → Decoder
Pros: Stable training, explicit likelihood
Cons: Blurry samples, posterior collapse

Diffusion Models:

Gradually add noise, then learn to denoise
Pros: Stable training, highest quality, flexible conditioning
Cons: Slow sampling (fixed by DDIM)

Why Diffusion Won (2020-2023):

Quality: Better FID scores than GANs on most benchmarks
Stability: No mode collapse, reliable training
Scalability: Works at massive scale (billions of parameters)
Conditioning: Easy to add text/class/image conditioning
Fast Sampling: DDIM reduced steps from 1000 → 20-50

Learning Resources:

Papers: Original GAN, VAE, DDPM papers
Reading: Generative models survey
Videos: Welch Labs diffusion video (essential viewing)

Exercises:

Compare generated samples from GAN, VAE, and diffusion
Understand trade-offs for different applications
Identify when to use each model type

Checkpoint: Can you explain why diffusion models replaced GANs for most applications?

Day 3-5: Diffusion Fundamentals

Core Concept:

Diffusion Fundamentals

What You’ll Learn:

Forward Diffusion Process (Adding Noise):

Start with real image $x_0$ and gradually add Gaussian noise over T steps:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

Where:

$x_0$ is the original image
$x_T$ is pure noise
$\beta_t$ is the noise schedule (how much noise at each step)
T = 1000 steps typically

Key Insight - Direct Timestep Sampling:

We don’t need to apply noise T times sequentially! We can sample $x_t$ directly from $x_0$ :

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon

Where:

$\bar{\alpha}_t = \prod_{i=1}^t (1 - \beta_i)$
$\epsilon \sim \mathcal{N}(0, I)$ is standard Gaussian noise

This is the training data generation trick!

Noise Schedules:

Linear Schedule: $\beta_t$ increases linearly from $\beta_1$ to $\beta_T$
- Simple but not optimal
- Too much noise too quickly
Cosine Schedule: Smoother noise addition
- Better for high-resolution images
- More uniform signal-to-noise ratio across timesteps

Alpha/Beta Notation:

$\alpha_t = 1 - \beta_t$
$\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$
$\bar{\alpha}_t$ represents how much original signal remains at timestep t

Reverse Process (Learned Denoising):

Learn to reverse the diffusion process - denoise step by step:

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The model $\theta$ predicts how to remove noise at each timestep.

Learning Resources:

Papers: “Denoising Diffusion Probabilistic Models” (DDPM)
Videos: Welch Labs video on forward/reverse process
Code: Implement forward diffusion process

Exercises:

Implement forward diffusion (noise addition)
Visualize images at different timesteps (t=0, 250, 500, 750, 1000)
Implement direct timestep sampling
Compare linear vs cosine noise schedules

Checkpoint: Can you explain why we can sample $x_t$ directly from $x_0$ ?

Day 6-7: DDPM Training

Architecture Paper:

DDPM (Denoising Diffusion Probabilistic Models)

What You’ll Learn:

DDPM’s Key Insight:

Instead of predicting $x_{t-1}$ directly, predict the noise $\epsilon$ that was added:

\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

Where:

$\epsilon$ is the actual noise added
$\epsilon_\theta(x_t, t)$ is the model’s noise prediction
We sample $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$

Why Predict Noise?

Simpler objective than predicting $x_0$ or $x_{t-1}$
More stable training
Better gradient properties
Empirically works much better

U-Net Architecture:

DDPM uses a U-Net for the denoising model:

Encoder: Downsampling path (convolutions + downsampling)
Bottleneck: Lowest resolution with most channels
Decoder: Upsampling path (transposed convolutions)
Skip Connections: Concatenate encoder features to decoder (like ResNet)

Timestep Embedding:

Sinusoidal positional encoding (like transformers)
Tells model which timestep it’s denoising
Injected via adaptive group normalization

Self-Attention Layers:

Added at lower resolutions (e.g., 16×16)
Captures long-range dependencies
Critical for high-quality generation

DDPM Training Algorithm:


# Simplified training loop
for epoch in epochs:
    for x_0 in dataloader:  # Real images
        # Sample random timestep
        t = torch.randint(0, T, (batch_size,))
 
        # Sample noise
        epsilon = torch.randn_like(x_0)
 
        # Create noisy image
        x_t = sqrt_alpha_bar[t] * x_0 + sqrt_one_minus_alpha_bar[t] * epsilon
 
        # Predict noise
        epsilon_pred = model(x_t, t)
 
        # Compute loss
        loss = mse_loss(epsilon, epsilon_pred)
 
        # Backprop and update
        loss.backward()
        optimizer.step()

DDPM Sampling Algorithm:


# Start from pure noise
x_t = torch.randn(batch_size, 3, H, W)
 
# Denoise step by step
for t in reversed(range(T)):
    # Predict noise
    epsilon_pred = model(x_t, t)
 
    # Compute mean
    mean = (x_t - beta[t] / sqrt_one_minus_alpha_bar[t] * epsilon_pred) / sqrt_alpha[t]
 
    # Add noise (except final step)
    if t > 0:
        noise = torch.randn_like(x_t)
        x_t = mean + sqrt(beta[t]) * noise
    else:
        x_t = mean
 
# x_t is now the generated image

Learning Resources:

Papers: “Denoising Diffusion Probabilistic Models” (Ho et al., 2020)
Code:
- Official DDPM implementation (TensorFlow)
- Unofficial PyTorch implementations
Videos: DDPM paper explained

Exercises:

Implement DDPM training loop
Train simplified diffusion model on MNIST or CIFAR-10
Visualize denoising process at different timesteps
Implement U-Net with timestep conditioning

Checkpoint: Can you implement DDPM training and explain why we predict noise instead of images?

Week 2: Fast Sampling and Text Conditioning

Day 8-9: DDIM - Fast Sampling

Architecture Paper:

DDIM (Denoising Diffusion Implicit Models)

What You’ll Learn:

The Problem with DDPM:

Needs 1000 steps for generation → very slow
50 seconds per image on GPU
Impractical for real-time applications

DDIM’s Solution:

Same trained model, different sampling process
Skip timesteps without retraining
20-50 steps instead of 1000 → 20-50x speedup!
Deterministic generation (same noise → same image)

DDIM Update Rule:

x_{t-\Delta t} = \sqrt{\bar{\alpha}_{t-\Delta t}} \cdot \text{pred}\_x_0 + \sqrt{1 - \bar{\alpha}_{t-\Delta t} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \cdot \epsilon

Where:

$\text{pred}\_x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t} \cdot \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}$
$\sigma_t$ controls stochasticity (0 = deterministic)
$\Delta t$ is the step size (can be > 1)

Key Insights:

Deterministic sampling: Set $\sigma_t = 0$ , get same image from same noise
Step skipping: Use subset of timesteps (e.g., [0, 50, 100, …, 1000])
No retraining: Use existing DDPM checkpoint
Interpolation: Can interpolate between images in latent space

DDPM vs DDIM:

Aspect	DDPM	DDIM
Sampling steps	1000	20-50
Sampling speed	Slow (50s)	Fast (2-3s)
Stochastic	Yes	Optional
Deterministic	No	Yes
Requires retraining	-	No

Practical Tips:

50 steps: Good quality, 10x faster
20 steps: Acceptable quality, 25x faster
10 steps: Lower quality but usable
Trade-off quality vs speed based on application

Learning Resources:

Papers: “Denoising Diffusion Implicit Models” (Song et al., 2020)
Code: DDIM sampling implementation
Comparison: Side-by-side DDPM vs DDIM

Exercises:

Implement DDIM sampling
Compare DDPM (1000 steps) vs DDIM (50 steps) quality
Experiment with different step counts
Test deterministic generation (same seed → same image)

Checkpoint: Can you explain how DDIM achieves 20-50x speedup without retraining?

Day 10-12: Classifier-Free Guidance

Core Concept:

Classifier-Free Guidance

What You’ll Learn:

Text-to-Image Generation:

How do we guide diffusion models with text prompts like “a cat wearing a hat”?

Early Approach - Classifier Guidance:

Train separate classifier on noisy images
Use classifier gradients to guide denoising
Problems: Needs extra classifier, adversarial artifacts

Classifier-Free Guidance (CFG):

No separate classifier needed!
Train single model both conditionally and unconditionally
Amplify conditioning during sampling

Training with Conditioning:


# During training, randomly drop conditioning with probability p (e.g., 10%)
if random() < 0.1:
    # Unconditional training
    epsilon_pred = model(x_t, t, cond=None)
else:
    # Conditional training
    epsilon_pred = model(x_t, t, cond=text_embedding)
 
loss = mse_loss(epsilon, epsilon_pred)

Sampling with Guidance:

\tilde{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))

Where:

$c$ is the conditioning (e.g., CLIP text embedding)
$\emptyset$ is the empty (unconditional) conditioning
$s$ is the guidance scale (typically 7-15)

Intuition:

Unconditional prediction: “Generate any image”
Conditional prediction: “Generate image matching text”
Guidance: Amplify the difference → stronger text adherence

Guidance Scale Effects:

Guidance Scale	Effect
s = 0	Ignore conditioning (pure unconditional)
s = 1	Standard conditional generation
s = 3-5	Moderate text adherence
s = 7-10	Strong text adherence (Stable Diffusion default: 7.5)
s = 15+	Very strong adherence, may sacrifice quality

Negative Prompts:

Guide away from unwanted concepts:


# Positive prompt: "a beautiful landscape"
# Negative prompt: "blurry, low quality, watermark"
 
epsilon_pred = epsilon_uncond + \
    s * (epsilon_cond_positive - epsilon_cond_negative)

Text Conditioning via CLIP:

Use CLIP text encoder for text embeddings
Cross-attention in U-Net: Query from image features, Key+Value from text embeddings
Text embedding injected at multiple resolutions

Learning Resources:

Papers:
- “Classifier-Free Diffusion Guidance” (Ho & Salimans, 2021)
- “High-Resolution Image Synthesis with Latent Diffusion Models” (Stable Diffusion)
Code: Classifier-free guidance implementation
Demos: Stable Diffusion, DALL-E 2

Exercises:

Implement classifier-free guidance
Train conditional diffusion model (class-conditional on CIFAR-10)
Experiment with different guidance scales
Implement negative prompts
Compare conditional vs unconditional generation

Checkpoint: Can you implement classifier-free guidance and explain how it amplifies conditioning?

Day 13-14: Complete Text-to-Image Pipeline

Putting It All Together:

Complete Modern Text-to-Image System:


Text Prompt
  ↓
CLIP Text Encoder → Text Embeddings
                         ↓
                    Cross-Attention in U-Net
                         ↑
Pure Noise (x_T) → DDIM Sampling (50 steps) → Generated Image (x_0)
                    ↑
                U-Net Noise Predictor
                (trained with classifier-free guidance)

Key Components:

CLIP Text Encoder: Convert prompt to embeddings
U-Net with Cross-Attention: Noise predictor conditioned on text
DDIM Sampling: Fast generation (50 steps)
Classifier-Free Guidance: Strong text adherence

Hands-On Project:

Requirements:

Implement complete text-conditional diffusion model
Train on small dataset (e.g., Oxford Flowers with captions)
Use DDIM for fast sampling
Implement classifier-free guidance
Generate images from text prompts

Advanced Features:

Image-to-image (start from noisy real image)
Inpainting (mask regions to regenerate)
Style transfer via prompts

Healthcare Application Example:

Synthetic Medical Image Generation:


# Generate synthetic chest X-rays for data augmentation
prompts = [
    "chest x-ray showing pneumonia in right lung",
    "chest x-ray with normal lung fields",
    "chest x-ray showing cardiomegaly",
]
 
for prompt in prompts:
    # Generate multiple synthetic examples
    for i in range(10):
        synthetic_xray = diffusion_model.generate(
            prompt=prompt,
            guidance_scale=7.5,
            num_steps=50
        )
        save_image(synthetic_xray, f"{prompt}_{i}.png")

Benefits:

Augment rare pathology datasets
Privacy-preserving (no real patient data)
Controlled generation (specify exact conditions)

Module Completion Criteria

You have completed this module when you can:

✅ Explain the diffusion process (forward and reverse)
✅ Understand why we predict noise instead of images
✅ Implement DDPM training loop
✅ Implement DDIM sampling for fast generation
✅ Understand and implement classifier-free guidance
✅ Build complete text-to-image pipeline
✅ Apply diffusion models to healthcare problems

Key Resources

Essential Papers (Must Read)

“Denoising Diffusion Probabilistic Models” (Ho et al., 2020) - DDPM
“Denoising Diffusion Implicit Models” (Song et al., 2020) - DDIM
“Classifier-Free Diffusion Guidance” (Ho & Salimans, 2021) - CFG

Essential Video (Must Watch)

Welch Labs: “But How Do AI Images Work?” (35 minutes)

Best intuitive explanation of diffusion
Visual walkthrough of DALL-E 2 architecture
Watch before starting this module

Code Resources

HuggingFace Diffusers library (production-ready implementations)
Stable Diffusion official code
Educational diffusion implementations

Healthcare-Specific Resources

Medical image synthesis papers
Synthetic data validation studies
Privacy-preserving generation techniques

Common Pitfalls

1. Wrong Noise Schedule

Problem: Images don’t denoise properly Solution: Use cosine schedule for high-res images, validate alpha/beta values

2. Too Few Sampling Steps

Problem: Low quality images with DDIM Solution: Use at least 20-50 steps for good quality

3. Wrong Guidance Scale

Problem: Poor text adherence or unnatural images Solution: Start with 7.5, adjust based on results (3-15 range)

4. Forgetting Conditioning Dropout

Problem: Classifier-free guidance doesn’t work Solution: Train with 10% unconditional probability

5. Memory Issues

Problem: Out of memory during training Solution: Reduce batch size, use gradient accumulation, or train on lower resolution

Success Tips

Start with MNIST/CIFAR-10
- Validate implementation on simple data
- Faster iteration cycles
- Scale up to high-res later
Use Pre-Trained Models First
- Download Stable Diffusion weights
- Understand architecture by using it
- Fine-tune on domain data
Visualize the Process
- Save intermediate denoising steps
- Watch how images form from noise
- Understand what each component does
Monitor Training Carefully
- Watch sample quality during training
- Check that conditioning has effect
- Validate guidance scale range
Healthcare Validation
- Get clinical expert evaluation of synthetic images
- Measure realism with FID score
- Verify utility (does augmentation improve downstream task?)

Next Steps

After completing this module:

Advanced Diffusion: Latent diffusion (Stable Diffusion), video diffusion, 3D generation
Healthcare Applications: Apply to medical data synthesis
Multimodal Integration: Combine with VLM knowledge for advanced conditioning
Research: Explore recent papers on efficient diffusion, controllable generation

Time Investment

Total estimated time: 12-18 hours over 2 weeks

Generative models and diffusion fundamentals: 4-6 hours
DDPM training: 3-4 hours
DDIM fast sampling: 2-3 hours
Classifier-free guidance and text conditioning: 3-5 hours
Healthcare applications: 2-3 hours

Key Takeaway

“Diffusion models won by being boring.”

Unlike GANs with adversarial training and mode collapse, or VAEs with posterior collapse, diffusion models just work. Stable training. High quality. Flexible conditioning. The iterative denoising process is simple, intuitive, and scales to billions of parameters. DDIM made them fast. Classifier-free guidance made them controllable. Now they power virtually all modern image generation.

Understand the math. Implement DDPM training. Master classifier-free guidance. Apply to healthcare.

Ready to begin? Start with Generative Models overview, then watch the Welch Labs video.