Skip to Content
LibraryPapersDDPM (2020)

DDPM: Denoising Diffusion Probabilistic Models

Paper: Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.

Key Insight: Instead of trying to predict the denoised image directly, predict the noise that was added. This simple objective leads to stable training and high-quality generation.

Core Innovation

DDPM reformulates diffusion model training as a noise prediction problem:

L=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

Where:

  • tUniform(1,T)t \sim \text{Uniform}(1, T): Random timestep
  • x0x_0: Original data (e.g., image)
  • ϵN(0,I)\epsilon \sim \mathcal{N}(0, I): Sampled noise
  • xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon: Noisy version
  • ϵθ(xt,t)\epsilon_\theta(x_t, t): Neural network that predicts noise

In words: Train a network to predict what noise was added at any timestep.

The Reverse Process

While the forward process adds noise, the reverse process learns to denoise:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The mean is parameterized using the noise prediction:

μθ(xt,t)=1αt(xtβt1αˉtϵθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)

The variance can be fixed or learned (fixed works well in practice).

Neural Network Architecture

DDPM uses a U-Net with self-attention layers:

Architecture components:

  • U-Net backbone: Encoder-decoder with skip connections
  • Timestep embedding: Sinusoidal position encoding (like Transformers)
  • Self-attention layers: At lower resolutions (16×16, 8×8)
  • ResNet blocks: For feature processing at each resolution

Why U-Net?

  • Skip connections preserve spatial information
  • Multi-scale processing handles different noise levels
  • Proven effective for image-to-image tasks

Timestep Embedding

def timestep_embedding(t, dim=128): """Convert timestep to embedding (sinusoidal like Transformers)""" half_dim = dim // 2 emb = math.log(10000) / (half_dim - 1) emb = torch.exp(torch.arange(half_dim) * -emb) emb = t[:, None] * emb[None, :] emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1) return emb

Training Algorithm

Model Definition

class DiffusionModel(nn.Module): def __init__(self, image_channels=3, model_channels=128): super().__init__() self.unet = UNet( in_channels=image_channels, out_channels=image_channels, # Predict noise (same shape as input) model_channels=model_channels, num_res_blocks=2, attention_resolutions=[16, 8], channel_mult=[1, 2, 4, 4], ) def forward(self, x, t): """ Predict noise Args: x: noisy image (B, C, H, W) t: timestep (B,) Returns: predicted_noise: (B, C, H, W) """ return self.unet(x, t)

Training Loop

def train_step(model, x0, noise_schedule, optimizer): """Single training step for DDPM""" batch_size = x0.shape[0] device = x0.device # 1. Sample random timesteps for each image in the batch t = torch.randint(0, noise_schedule.T, (batch_size,), device=device) # 2. Sample noise noise = torch.randn_like(x0) # 3. Create noisy images alpha_bar_t = noise_schedule.alpha_bar[t].view(-1, 1, 1, 1) x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * noise # 4. Predict noise predicted_noise = model(x_t, t) # 5. Compute loss (MSE between predicted and actual noise) loss = F.mse_loss(predicted_noise, noise) # 6. Backpropagation optimizer.zero_grad() loss.backward() optimizer.step() return loss.item()

Full Training Script

def train_ddpm(model, dataloader, noise_schedule, num_epochs=100): """Train DDPM model""" optimizer = torch.optim.Adam(model.parameters(), lr=2e-4) model.train() for epoch in range(num_epochs): epoch_loss = 0.0 for batch_idx, (images, _) in enumerate(dataloader): images = images.to(device) # Training step loss = train_step(model, images, noise_schedule, optimizer) epoch_loss += loss if batch_idx % 100 == 0: print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss:.4f}") avg_loss = epoch_loss / len(dataloader) print(f"Epoch {epoch} completed. Average Loss: {avg_loss:.4f}") # Generate samples periodically if epoch % 10 == 0: generate_and_save_samples(model, noise_schedule, epoch)

Sampling Algorithm

Once trained, generate images by starting from pure noise and iteratively denoising:

@torch.no_grad() def sample_ddpm(model, shape, noise_schedule, device): """ Generate samples using DDPM Args: model: trained diffusion model shape: (batch_size, channels, height, width) noise_schedule: NoiseSchedule object Returns: generated_images: (B, C, H, W) """ model.eval() T = noise_schedule.T # Start from pure noise x = torch.randn(shape, device=device) # Iteratively denoise for t in reversed(range(T)): # Create batch of timesteps t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long) # Predict noise predicted_noise = model(x, t_batch) # Get schedule values alpha_t = noise_schedule.alpha[t] alpha_bar_t = noise_schedule.alpha_bar[t] beta_t = noise_schedule.beta[t] # Compute mean of reverse distribution mean = (x - (beta_t / torch.sqrt(1 - alpha_bar_t)) * predicted_noise) / torch.sqrt(alpha_t) # Add noise (except for final step) if t > 0: variance = beta_t noise = torch.randn_like(x) x = mean + torch.sqrt(variance) * noise else: # Final step: no noise x = mean return x

Why Predict Noise Instead of the Image?

You might wonder: why predict noise ϵ\epsilon instead of directly predicting x0x_0?

Empirical reasons:

  1. Better gradients: Noise prediction provides better learning signal across timesteps
  2. Stable training: Less variance in the training objective
  3. Multi-scale learning: Network learns to denoise at all noise levels simultaneously

Theoretical reason: The noise prediction is equivalent to learning the score function xlogp(x)\nabla_x \log p(x), which has nice mathematical properties for generation.

Results and Impact

Quantitative Results

On ImageNet 256×256:

  • FID: Competitive with state-of-the-art GANs
  • Inception Score: High sample quality
  • Mode coverage: No mode collapse (unlike GANs)

On CIFAR-10:

  • FID: State-of-the-art unconditional generation
  • Sample quality: High-quality diverse samples

Qualitative Impact

DDPM demonstrated:

  • ✅ Stable training (no adversarial dynamics)
  • ✅ High-quality generation (matching or exceeding GANs)
  • ✅ Mode coverage (generates diverse samples)
  • ✅ Simple objective (MSE on noise prediction)

Drawback: Slow sampling (~1000 neural network evaluations per image)

Computational Costs

Training:

  • Standard deep learning training (comparable to other generative models)
  • Can be parallelized across batch and timesteps

Sampling:

  • ~1000 steps for high-quality generation
  • 1000 forward passes through the U-Net
  • Seconds to minutes per image (depending on GPU)
  • Not practical for real-time applications

Solution: DDIM (next paper) reduces this to 20-50 steps with minimal quality loss.

Comparison to GANs

AspectDDPMGAN
Training stabilityStableUnstable
Mode coverageExcellentMode collapse risk
Sample qualityExcellentExcellent
Sampling speedSlow (1000 steps)Fast (1 step)
Training objectiveSimple MSEMinimax game
LikelihoodYesNo

Hyperparameters

Critical hyperparameters:

  • T (timesteps): 1000 is standard
  • Noise schedule: Linear from β1=0.0001\beta_1 = 0.0001 to βT=0.02\beta_T = 0.02
  • Learning rate: 2e-4 works well (Adam optimizer)
  • Batch size: As large as GPU memory allows (64-128 typical)
  • Model size: 128-256 base channels for U-Net

Less critical:

  • EMA (exponential moving average) of weights helps (decay = 0.9999)
  • Gradient clipping can improve stability (rarely needed)

Practical Tips

  1. Start small: Train on 64×64 or 128×128 before scaling to 256×256
  2. Use EMA: Keep exponential moving average of model weights for generation
  3. Monitor samples: Visualize generated samples during training (every 10 epochs)
  4. Be patient: Diffusion models need many iterations to train (100K-1M steps)
  5. Fixed variance works: No need to learn variance, can fix to βt\beta_t

Limitations Addressed by Follow-up Work

  1. Slow sampling: DDIM (2020) enables 20-50x speedup
  2. Weak conditioning: Classifier-free guidance (2021) enables strong text conditioning
  3. High resolution: Latent diffusion (2021) enables efficient high-resolution generation
  4. Architecture: Transformer-based diffusion (DiT, 2023) improves scalability

Key References

Learning Resources

Paper Explanations

Implementation Guides

Mathematical Depth