Generative Diffusion Models
Learn the mathematics and implementation of diffusion models - the technology that revolutionized image generation and now powers DALL-E, Stable Diffusion, and Midjourney. Understand forward and reverse diffusion processes, noise prediction training, and text conditioning through classifier-free guidance.
Why Diffusion Models Matter
From 2020-2023, diffusion models completely replaced GANs as the default for high-quality generation:
Timeline:
- 2020: DDPM paper introduces diffusion for high-quality generation
- 2021: DDIM enables 20-50x faster sampling
- 2022: DALL-E 2, Stable Diffusion, Midjourney launch publicly
- 2023: Diffusion models dominate generative AI (images, video, audio, 3D)
Why Diffusion Won:
- More stable training than GANs (no mode collapse)
- Better sample quality than VAEs
- Flexible conditioning (text, images, sketches)
- Controllable generation via guidance
Healthcare Applications:
- Synthetic medical image generation for rare pathologies
- Privacy-preserving synthetic patient data
- Data augmentation for imbalanced datasets
- Conditional generation (symptom text → medical image)
Learning Objectives
After completing this module, you will:
- Generative Model Understanding: Grasp different generative approaches (GANs, VAEs, Diffusion) and why diffusion won
- Diffusion Mathematics: Understand forward process (adding noise) and reverse process (learned denoising) with DDPM and DDIM
- Text Conditioning: Learn how CLIP embeddings guide image generation through classifier-free guidance
- Implementation Skills: Implement DDPM training and DDIM sampling from scratch
- Healthcare Application: Apply generative techniques to synthetic medical data generation
Prerequisites
Before starting this module, ensure you have:
- Strong CNN understanding (for image generation)
- Transformer knowledge (for conditioning mechanisms)
- CLIP understanding (for text-guided generation)
- Probability: Basic understanding of Gaussian distributions, Markov chains
Essential Preparation: Watch Welch Labs: “But How Do AI Images Work?” (35 minutes) before starting. This provides crucial visual intuition for diffusion models.
Week 1: Diffusion Fundamentals
Day 1-2: Generative Models Landscape
Core Concept:
Generative ModelsWhat You’ll Learn:
Generative Model Comparison:
| Model | Training | Sampling | Quality | Diversity | Stability |
|---|---|---|---|---|---|
| GANs | Adversarial | Single step (fast) | High | Medium | Unstable (mode collapse) |
| VAEs | Likelihood-based | Single step (fast) | Medium | High | Stable |
| Diffusion | Denoising | Iterative (slow) | Highest | High | Very stable |
Generative Adversarial Networks (GANs):
- Generator vs Discriminator
- Pros: Fast sampling, high quality
- Cons: Training instability, mode collapse, hard to condition
Variational Autoencoders (VAEs):
- Encoder → Latent space → Decoder
- Pros: Stable training, explicit likelihood
- Cons: Blurry samples, posterior collapse
Diffusion Models:
- Gradually add noise, then learn to denoise
- Pros: Stable training, highest quality, flexible conditioning
- Cons: Slow sampling (fixed by DDIM)
Why Diffusion Won (2020-2023):
- Quality: Better FID scores than GANs on most benchmarks
- Stability: No mode collapse, reliable training
- Scalability: Works at massive scale (billions of parameters)
- Conditioning: Easy to add text/class/image conditioning
- Fast Sampling: DDIM reduced steps from 1000 → 20-50
Learning Resources:
- Papers: Original GAN, VAE, DDPM papers
- Reading: Generative models survey
- Videos: Welch Labs diffusion video (essential viewing)
Exercises:
- Compare generated samples from GAN, VAE, and diffusion
- Understand trade-offs for different applications
- Identify when to use each model type
Checkpoint: Can you explain why diffusion models replaced GANs for most applications?
Day 3-5: Diffusion Fundamentals
Core Concept:
Diffusion FundamentalsWhat You’ll Learn:
Forward Diffusion Process (Adding Noise):
Start with real image and gradually add Gaussian noise over T steps:
Where:
- is the original image
- is pure noise
- is the noise schedule (how much noise at each step)
- T = 1000 steps typically
Key Insight - Direct Timestep Sampling:
We don’t need to apply noise T times sequentially! We can sample directly from :
Where:
- is standard Gaussian noise
This is the training data generation trick!
Noise Schedules:
-
Linear Schedule: increases linearly from to
- Simple but not optimal
- Too much noise too quickly
-
Cosine Schedule: Smoother noise addition
- Better for high-resolution images
- More uniform signal-to-noise ratio across timesteps
Alpha/Beta Notation:
- represents how much original signal remains at timestep t
Reverse Process (Learned Denoising):
Learn to reverse the diffusion process - denoise step by step:
The model predicts how to remove noise at each timestep.
Learning Resources:
- Papers: “Denoising Diffusion Probabilistic Models” (DDPM)
- Videos: Welch Labs video on forward/reverse process
- Code: Implement forward diffusion process
Exercises:
- Implement forward diffusion (noise addition)
- Visualize images at different timesteps (t=0, 250, 500, 750, 1000)
- Implement direct timestep sampling
- Compare linear vs cosine noise schedules
Checkpoint: Can you explain why we can sample directly from ?
Day 6-7: DDPM Training
Architecture Paper:
DDPM (Denoising Diffusion Probabilistic Models)What You’ll Learn:
DDPM’s Key Insight:
Instead of predicting directly, predict the noise that was added:
Where:
- is the actual noise added
- is the model’s noise prediction
- We sample
Why Predict Noise?
- Simpler objective than predicting or
- More stable training
- Better gradient properties
- Empirically works much better
U-Net Architecture:
DDPM uses a U-Net for the denoising model:
- Encoder: Downsampling path (convolutions + downsampling)
- Bottleneck: Lowest resolution with most channels
- Decoder: Upsampling path (transposed convolutions)
- Skip Connections: Concatenate encoder features to decoder (like ResNet)
Timestep Embedding:
- Sinusoidal positional encoding (like transformers)
- Tells model which timestep it’s denoising
- Injected via adaptive group normalization
Self-Attention Layers:
- Added at lower resolutions (e.g., 16×16)
- Captures long-range dependencies
- Critical for high-quality generation
DDPM Training Algorithm:
# Simplified training loop
for epoch in epochs:
for x_0 in dataloader: # Real images
# Sample random timestep
t = torch.randint(0, T, (batch_size,))
# Sample noise
epsilon = torch.randn_like(x_0)
# Create noisy image
x_t = sqrt_alpha_bar[t] * x_0 + sqrt_one_minus_alpha_bar[t] * epsilon
# Predict noise
epsilon_pred = model(x_t, t)
# Compute loss
loss = mse_loss(epsilon, epsilon_pred)
# Backprop and update
loss.backward()
optimizer.step()DDPM Sampling Algorithm:
# Start from pure noise
x_t = torch.randn(batch_size, 3, H, W)
# Denoise step by step
for t in reversed(range(T)):
# Predict noise
epsilon_pred = model(x_t, t)
# Compute mean
mean = (x_t - beta[t] / sqrt_one_minus_alpha_bar[t] * epsilon_pred) / sqrt_alpha[t]
# Add noise (except final step)
if t > 0:
noise = torch.randn_like(x_t)
x_t = mean + sqrt(beta[t]) * noise
else:
x_t = mean
# x_t is now the generated imageLearning Resources:
- Papers: “Denoising Diffusion Probabilistic Models” (Ho et al., 2020)
- Code:
- Official DDPM implementation (TensorFlow)
- Unofficial PyTorch implementations
- Videos: DDPM paper explained
Exercises:
- Implement DDPM training loop
- Train simplified diffusion model on MNIST or CIFAR-10
- Visualize denoising process at different timesteps
- Implement U-Net with timestep conditioning
Checkpoint: Can you implement DDPM training and explain why we predict noise instead of images?
Week 2: Fast Sampling and Text Conditioning
Day 8-9: DDIM - Fast Sampling
Architecture Paper:
DDIM (Denoising Diffusion Implicit Models)What You’ll Learn:
The Problem with DDPM:
- Needs 1000 steps for generation → very slow
- 50 seconds per image on GPU
- Impractical for real-time applications
DDIM’s Solution:
- Same trained model, different sampling process
- Skip timesteps without retraining
- 20-50 steps instead of 1000 → 20-50x speedup!
- Deterministic generation (same noise → same image)
DDIM Update Rule:
Where:
- controls stochasticity (0 = deterministic)
- is the step size (can be > 1)
Key Insights:
- Deterministic sampling: Set , get same image from same noise
- Step skipping: Use subset of timesteps (e.g., [0, 50, 100, …, 1000])
- No retraining: Use existing DDPM checkpoint
- Interpolation: Can interpolate between images in latent space
DDPM vs DDIM:
| Aspect | DDPM | DDIM |
|---|---|---|
| Sampling steps | 1000 | 20-50 |
| Sampling speed | Slow (50s) | Fast (2-3s) |
| Stochastic | Yes | Optional |
| Deterministic | No | Yes |
| Requires retraining | - | No |
Practical Tips:
- 50 steps: Good quality, 10x faster
- 20 steps: Acceptable quality, 25x faster
- 10 steps: Lower quality but usable
- Trade-off quality vs speed based on application
Learning Resources:
- Papers: “Denoising Diffusion Implicit Models” (Song et al., 2020)
- Code: DDIM sampling implementation
- Comparison: Side-by-side DDPM vs DDIM
Exercises:
- Implement DDIM sampling
- Compare DDPM (1000 steps) vs DDIM (50 steps) quality
- Experiment with different step counts
- Test deterministic generation (same seed → same image)
Checkpoint: Can you explain how DDIM achieves 20-50x speedup without retraining?
Day 10-12: Classifier-Free Guidance
Core Concept:
Classifier-Free GuidanceWhat You’ll Learn:
Text-to-Image Generation:
How do we guide diffusion models with text prompts like “a cat wearing a hat”?
Early Approach - Classifier Guidance:
- Train separate classifier on noisy images
- Use classifier gradients to guide denoising
- Problems: Needs extra classifier, adversarial artifacts
Classifier-Free Guidance (CFG):
- No separate classifier needed!
- Train single model both conditionally and unconditionally
- Amplify conditioning during sampling
Training with Conditioning:
# During training, randomly drop conditioning with probability p (e.g., 10%)
if random() < 0.1:
# Unconditional training
epsilon_pred = model(x_t, t, cond=None)
else:
# Conditional training
epsilon_pred = model(x_t, t, cond=text_embedding)
loss = mse_loss(epsilon, epsilon_pred)Sampling with Guidance:
Where:
- is the conditioning (e.g., CLIP text embedding)
- is the empty (unconditional) conditioning
- is the guidance scale (typically 7-15)
Intuition:
- Unconditional prediction: “Generate any image”
- Conditional prediction: “Generate image matching text”
- Guidance: Amplify the difference → stronger text adherence
Guidance Scale Effects:
| Guidance Scale | Effect |
|---|---|
| s = 0 | Ignore conditioning (pure unconditional) |
| s = 1 | Standard conditional generation |
| s = 3-5 | Moderate text adherence |
| s = 7-10 | Strong text adherence (Stable Diffusion default: 7.5) |
| s = 15+ | Very strong adherence, may sacrifice quality |
Negative Prompts:
Guide away from unwanted concepts:
# Positive prompt: "a beautiful landscape"
# Negative prompt: "blurry, low quality, watermark"
epsilon_pred = epsilon_uncond + \
s * (epsilon_cond_positive - epsilon_cond_negative)Text Conditioning via CLIP:
- Use CLIP text encoder for text embeddings
- Cross-attention in U-Net: Query from image features, Key+Value from text embeddings
- Text embedding injected at multiple resolutions
Learning Resources:
- Papers:
- “Classifier-Free Diffusion Guidance” (Ho & Salimans, 2021)
- “High-Resolution Image Synthesis with Latent Diffusion Models” (Stable Diffusion)
- Code: Classifier-free guidance implementation
- Demos: Stable Diffusion, DALL-E 2
Exercises:
- Implement classifier-free guidance
- Train conditional diffusion model (class-conditional on CIFAR-10)
- Experiment with different guidance scales
- Implement negative prompts
- Compare conditional vs unconditional generation
Checkpoint: Can you implement classifier-free guidance and explain how it amplifies conditioning?
Day 13-14: Complete Text-to-Image Pipeline
Putting It All Together:
Complete Modern Text-to-Image System:
Text Prompt
↓
CLIP Text Encoder → Text Embeddings
↓
Cross-Attention in U-Net
↑
Pure Noise (x_T) → DDIM Sampling (50 steps) → Generated Image (x_0)
↑
U-Net Noise Predictor
(trained with classifier-free guidance)Key Components:
- CLIP Text Encoder: Convert prompt to embeddings
- U-Net with Cross-Attention: Noise predictor conditioned on text
- DDIM Sampling: Fast generation (50 steps)
- Classifier-Free Guidance: Strong text adherence
Hands-On Project:
Requirements:
- Implement complete text-conditional diffusion model
- Train on small dataset (e.g., Oxford Flowers with captions)
- Use DDIM for fast sampling
- Implement classifier-free guidance
- Generate images from text prompts
Advanced Features:
- Image-to-image (start from noisy real image)
- Inpainting (mask regions to regenerate)
- Style transfer via prompts
Healthcare Application Example:
Synthetic Medical Image Generation:
# Generate synthetic chest X-rays for data augmentation
prompts = [
"chest x-ray showing pneumonia in right lung",
"chest x-ray with normal lung fields",
"chest x-ray showing cardiomegaly",
]
for prompt in prompts:
# Generate multiple synthetic examples
for i in range(10):
synthetic_xray = diffusion_model.generate(
prompt=prompt,
guidance_scale=7.5,
num_steps=50
)
save_image(synthetic_xray, f"{prompt}_{i}.png")Benefits:
- Augment rare pathology datasets
- Privacy-preserving (no real patient data)
- Controlled generation (specify exact conditions)
Module Completion Criteria
You have completed this module when you can:
- ✅ Explain the diffusion process (forward and reverse)
- ✅ Understand why we predict noise instead of images
- ✅ Implement DDPM training loop
- ✅ Implement DDIM sampling for fast generation
- ✅ Understand and implement classifier-free guidance
- ✅ Build complete text-to-image pipeline
- ✅ Apply diffusion models to healthcare problems
Key Resources
Essential Papers (Must Read)
- “Denoising Diffusion Probabilistic Models” (Ho et al., 2020) - DDPM
- “Denoising Diffusion Implicit Models” (Song et al., 2020) - DDIM
- “Classifier-Free Diffusion Guidance” (Ho & Salimans, 2021) - CFG
Essential Video (Must Watch)
Welch Labs: “But How Do AI Images Work?” (35 minutes)
- Best intuitive explanation of diffusion
- Visual walkthrough of DALL-E 2 architecture
- Watch before starting this module
Code Resources
- HuggingFace Diffusers library (production-ready implementations)
- Stable Diffusion official code
- Educational diffusion implementations
Healthcare-Specific Resources
- Medical image synthesis papers
- Synthetic data validation studies
- Privacy-preserving generation techniques
Common Pitfalls
1. Wrong Noise Schedule
Problem: Images don’t denoise properly Solution: Use cosine schedule for high-res images, validate alpha/beta values
2. Too Few Sampling Steps
Problem: Low quality images with DDIM Solution: Use at least 20-50 steps for good quality
3. Wrong Guidance Scale
Problem: Poor text adherence or unnatural images Solution: Start with 7.5, adjust based on results (3-15 range)
4. Forgetting Conditioning Dropout
Problem: Classifier-free guidance doesn’t work Solution: Train with 10% unconditional probability
5. Memory Issues
Problem: Out of memory during training Solution: Reduce batch size, use gradient accumulation, or train on lower resolution
Success Tips
-
Start with MNIST/CIFAR-10
- Validate implementation on simple data
- Faster iteration cycles
- Scale up to high-res later
-
Use Pre-Trained Models First
- Download Stable Diffusion weights
- Understand architecture by using it
- Fine-tune on domain data
-
Visualize the Process
- Save intermediate denoising steps
- Watch how images form from noise
- Understand what each component does
-
Monitor Training Carefully
- Watch sample quality during training
- Check that conditioning has effect
- Validate guidance scale range
-
Healthcare Validation
- Get clinical expert evaluation of synthetic images
- Measure realism with FID score
- Verify utility (does augmentation improve downstream task?)
Next Steps
After completing this module:
- Advanced Diffusion: Latent diffusion (Stable Diffusion), video diffusion, 3D generation
- Healthcare Applications: Apply to medical data synthesis
- Multimodal Integration: Combine with VLM knowledge for advanced conditioning
- Research: Explore recent papers on efficient diffusion, controllable generation
Time Investment
Total estimated time: 12-18 hours over 2 weeks
- Generative models and diffusion fundamentals: 4-6 hours
- DDPM training: 3-4 hours
- DDIM fast sampling: 2-3 hours
- Classifier-free guidance and text conditioning: 3-5 hours
- Healthcare applications: 2-3 hours
Key Takeaway
“Diffusion models won by being boring.”
Unlike GANs with adversarial training and mode collapse, or VAEs with posterior collapse, diffusion models just work. Stable training. High quality. Flexible conditioning. The iterative denoising process is simple, intuitive, and scales to billions of parameters. DDIM made them fast. Classifier-free guidance made them controllable. Now they power virtually all modern image generation.
Understand the math. Implement DDPM training. Master classifier-free guidance. Apply to healthcare.
Ready to begin? Start with Generative Models overview, then watch the Welch Labs video.