DDPM: Denoising Diffusion Probabilistic Models
Paper: Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
Key Insight: Instead of trying to predict the denoised image directly, predict the noise that was added. This simple objective leads to stable training and high-quality generation.
Core Innovation
DDPM reformulates diffusion model training as a noise prediction problem:
Where:
- : Random timestep
- : Original data (e.g., image)
- : Sampled noise
- : Noisy version
- : Neural network that predicts noise
In words: Train a network to predict what noise was added at any timestep.
The Reverse Process
While the forward process adds noise, the reverse process learns to denoise:
The mean is parameterized using the noise prediction:
The variance can be fixed or learned (fixed works well in practice).
Neural Network Architecture
DDPM uses a U-Net with self-attention layers:
Architecture components:
- U-Net backbone: Encoder-decoder with skip connections
- Timestep embedding: Sinusoidal position encoding (like Transformers)
- Self-attention layers: At lower resolutions (16×16, 8×8)
- ResNet blocks: For feature processing at each resolution
Why U-Net?
- Skip connections preserve spatial information
- Multi-scale processing handles different noise levels
- Proven effective for image-to-image tasks
Timestep Embedding
def timestep_embedding(t, dim=128):
"""Convert timestep to embedding (sinusoidal like Transformers)"""
half_dim = dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim) * -emb)
emb = t[:, None] * emb[None, :]
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
return embTraining Algorithm
Model Definition
class DiffusionModel(nn.Module):
def __init__(self, image_channels=3, model_channels=128):
super().__init__()
self.unet = UNet(
in_channels=image_channels,
out_channels=image_channels, # Predict noise (same shape as input)
model_channels=model_channels,
num_res_blocks=2,
attention_resolutions=[16, 8],
channel_mult=[1, 2, 4, 4],
)
def forward(self, x, t):
"""
Predict noise
Args:
x: noisy image (B, C, H, W)
t: timestep (B,)
Returns:
predicted_noise: (B, C, H, W)
"""
return self.unet(x, t)Training Loop
def train_step(model, x0, noise_schedule, optimizer):
"""Single training step for DDPM"""
batch_size = x0.shape[0]
device = x0.device
# 1. Sample random timesteps for each image in the batch
t = torch.randint(0, noise_schedule.T, (batch_size,), device=device)
# 2. Sample noise
noise = torch.randn_like(x0)
# 3. Create noisy images
alpha_bar_t = noise_schedule.alpha_bar[t].view(-1, 1, 1, 1)
x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * noise
# 4. Predict noise
predicted_noise = model(x_t, t)
# 5. Compute loss (MSE between predicted and actual noise)
loss = F.mse_loss(predicted_noise, noise)
# 6. Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()Full Training Script
def train_ddpm(model, dataloader, noise_schedule, num_epochs=100):
"""Train DDPM model"""
optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
model.train()
for epoch in range(num_epochs):
epoch_loss = 0.0
for batch_idx, (images, _) in enumerate(dataloader):
images = images.to(device)
# Training step
loss = train_step(model, images, noise_schedule, optimizer)
epoch_loss += loss
if batch_idx % 100 == 0:
print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss:.4f}")
avg_loss = epoch_loss / len(dataloader)
print(f"Epoch {epoch} completed. Average Loss: {avg_loss:.4f}")
# Generate samples periodically
if epoch % 10 == 0:
generate_and_save_samples(model, noise_schedule, epoch)Sampling Algorithm
Once trained, generate images by starting from pure noise and iteratively denoising:
@torch.no_grad()
def sample_ddpm(model, shape, noise_schedule, device):
"""
Generate samples using DDPM
Args:
model: trained diffusion model
shape: (batch_size, channels, height, width)
noise_schedule: NoiseSchedule object
Returns:
generated_images: (B, C, H, W)
"""
model.eval()
T = noise_schedule.T
# Start from pure noise
x = torch.randn(shape, device=device)
# Iteratively denoise
for t in reversed(range(T)):
# Create batch of timesteps
t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
# Predict noise
predicted_noise = model(x, t_batch)
# Get schedule values
alpha_t = noise_schedule.alpha[t]
alpha_bar_t = noise_schedule.alpha_bar[t]
beta_t = noise_schedule.beta[t]
# Compute mean of reverse distribution
mean = (x - (beta_t / torch.sqrt(1 - alpha_bar_t)) * predicted_noise) / torch.sqrt(alpha_t)
# Add noise (except for final step)
if t > 0:
variance = beta_t
noise = torch.randn_like(x)
x = mean + torch.sqrt(variance) * noise
else:
# Final step: no noise
x = mean
return xWhy Predict Noise Instead of the Image?
You might wonder: why predict noise instead of directly predicting ?
Empirical reasons:
- Better gradients: Noise prediction provides better learning signal across timesteps
- Stable training: Less variance in the training objective
- Multi-scale learning: Network learns to denoise at all noise levels simultaneously
Theoretical reason: The noise prediction is equivalent to learning the score function , which has nice mathematical properties for generation.
Results and Impact
Quantitative Results
On ImageNet 256×256:
- FID: Competitive with state-of-the-art GANs
- Inception Score: High sample quality
- Mode coverage: No mode collapse (unlike GANs)
On CIFAR-10:
- FID: State-of-the-art unconditional generation
- Sample quality: High-quality diverse samples
Qualitative Impact
DDPM demonstrated:
- ✅ Stable training (no adversarial dynamics)
- ✅ High-quality generation (matching or exceeding GANs)
- ✅ Mode coverage (generates diverse samples)
- ✅ Simple objective (MSE on noise prediction)
Drawback: Slow sampling (~1000 neural network evaluations per image)
Computational Costs
Training:
- Standard deep learning training (comparable to other generative models)
- Can be parallelized across batch and timesteps
Sampling:
- ~1000 steps for high-quality generation
- 1000 forward passes through the U-Net
- Seconds to minutes per image (depending on GPU)
- Not practical for real-time applications
Solution: DDIM (next paper) reduces this to 20-50 steps with minimal quality loss.
Comparison to GANs
| Aspect | DDPM | GAN |
|---|---|---|
| Training stability | Stable | Unstable |
| Mode coverage | Excellent | Mode collapse risk |
| Sample quality | Excellent | Excellent |
| Sampling speed | Slow (1000 steps) | Fast (1 step) |
| Training objective | Simple MSE | Minimax game |
| Likelihood | Yes | No |
Hyperparameters
Critical hyperparameters:
- T (timesteps): 1000 is standard
- Noise schedule: Linear from to
- Learning rate: 2e-4 works well (Adam optimizer)
- Batch size: As large as GPU memory allows (64-128 typical)
- Model size: 128-256 base channels for U-Net
Less critical:
- EMA (exponential moving average) of weights helps (decay = 0.9999)
- Gradient clipping can improve stability (rarely needed)
Practical Tips
- Start small: Train on 64×64 or 128×128 before scaling to 256×256
- Use EMA: Keep exponential moving average of model weights for generation
- Monitor samples: Visualize generated samples during training (every 10 epochs)
- Be patient: Diffusion models need many iterations to train (100K-1M steps)
- Fixed variance works: No need to learn variance, can fix to
Limitations Addressed by Follow-up Work
- Slow sampling: DDIM (2020) enables 20-50x speedup
- Weak conditioning: Classifier-free guidance (2021) enables strong text conditioning
- High resolution: Latent diffusion (2021) enables efficient high-resolution generation
- Architecture: Transformer-based diffusion (DiT, 2023) improves scalability
Related Concepts
- Diffusion Fundamentals - Forward process and noise schedules
- DDIM Paper - Fast sampling with step skipping
- Classifier-Free Guidance - Text conditioning
- Generative Models - Comparison to GANs and VAEs
Key References
- Original Paper - DDPM paper
- Official Code - PyTorch implementation
- Annotated Implementation - Line-by-line explanation
Learning Resources
Paper Explanations
- AI Coffee Break: DDPM Explained - Video walkthrough
- Outlier: Diffusion Models - Clear explanation
Implementation Guides
- Hugging Face Diffusers - Production implementation
- PyTorch Tutorial - Step-by-step guide
- Annotated PyTorch - Commented code
Mathematical Depth
- Lilian Weng: What are Diffusion Models? - Comprehensive treatment
- Calvin Luo: Understanding Diffusion Models - Unified perspective