DDPM: Denoising Diffusion Probabilistic Models

Paper: Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.

Key Insight: Instead of trying to predict the denoised image directly, predict the noise that was added. This simple objective leads to stable training and high-quality generation.

Core Innovation

DDPM reformulates diffusion model training as a noise prediction problem:

\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

Where:

$t \sim \text{Uniform}(1, T)$ : Random timestep
$x_0$ : Original data (e.g., image)
$\epsilon \sim \mathcal{N}(0, I)$ : Sampled noise
$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ : Noisy version
$\epsilon_\theta(x_t, t)$ : Neural network that predicts noise

In words: Train a network to predict what noise was added at any timestep.

The Reverse Process

While the forward process adds noise, the reverse process learns to denoise:

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The mean is parameterized using the noise prediction:

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)

The variance can be fixed or learned (fixed works well in practice).

Neural Network Architecture

DDPM uses a U-Net with self-attention layers:

Architecture components:

U-Net backbone: Encoder-decoder with skip connections
Timestep embedding: Sinusoidal position encoding (like Transformers)
Self-attention layers: At lower resolutions (16×16, 8×8)
ResNet blocks: For feature processing at each resolution

Why U-Net?

Skip connections preserve spatial information
Multi-scale processing handles different noise levels
Proven effective for image-to-image tasks

Timestep Embedding


def timestep_embedding(t, dim=128):
    """Convert timestep to embedding (sinusoidal like Transformers)"""
    half_dim = dim // 2
    emb = math.log(10000) / (half_dim - 1)
    emb = torch.exp(torch.arange(half_dim) * -emb)
    emb = t[:, None] * emb[None, :]
    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
    return emb

Training Algorithm

Model Definition


class DiffusionModel(nn.Module):
    def __init__(self, image_channels=3, model_channels=128):
        super().__init__()
        self.unet = UNet(
            in_channels=image_channels,
            out_channels=image_channels,  # Predict noise (same shape as input)
            model_channels=model_channels,
            num_res_blocks=2,
            attention_resolutions=[16, 8],
            channel_mult=[1, 2, 4, 4],
        )
 
    def forward(self, x, t):
        """
        Predict noise
 
        Args:
            x: noisy image (B, C, H, W)
            t: timestep (B,)
 
        Returns:
            predicted_noise: (B, C, H, W)
        """
        return self.unet(x, t)

Training Loop


def train_step(model, x0, noise_schedule, optimizer):
    """Single training step for DDPM"""
    batch_size = x0.shape[0]
    device = x0.device
 
    # 1. Sample random timesteps for each image in the batch
    t = torch.randint(0, noise_schedule.T, (batch_size,), device=device)
 
    # 2. Sample noise
    noise = torch.randn_like(x0)
 
    # 3. Create noisy images
    alpha_bar_t = noise_schedule.alpha_bar[t].view(-1, 1, 1, 1)
    x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * noise
 
    # 4. Predict noise
    predicted_noise = model(x_t, t)
 
    # 5. Compute loss (MSE between predicted and actual noise)
    loss = F.mse_loss(predicted_noise, noise)
 
    # 6. Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
 
    return loss.item()

Full Training Script


def train_ddpm(model, dataloader, noise_schedule, num_epochs=100):
    """Train DDPM model"""
    optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
    model.train()
 
    for epoch in range(num_epochs):
        epoch_loss = 0.0
 
        for batch_idx, (images, _) in enumerate(dataloader):
            images = images.to(device)
 
            # Training step
            loss = train_step(model, images, noise_schedule, optimizer)
            epoch_loss += loss
 
            if batch_idx % 100 == 0:
                print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss:.4f}")
 
        avg_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch} completed. Average Loss: {avg_loss:.4f}")
 
        # Generate samples periodically
        if epoch % 10 == 0:
            generate_and_save_samples(model, noise_schedule, epoch)

Sampling Algorithm

Once trained, generate images by starting from pure noise and iteratively denoising:


@torch.no_grad()
def sample_ddpm(model, shape, noise_schedule, device):
    """
    Generate samples using DDPM
 
    Args:
        model: trained diffusion model
        shape: (batch_size, channels, height, width)
        noise_schedule: NoiseSchedule object
 
    Returns:
        generated_images: (B, C, H, W)
    """
    model.eval()
    T = noise_schedule.T
 
    # Start from pure noise
    x = torch.randn(shape, device=device)
 
    # Iteratively denoise
    for t in reversed(range(T)):
        # Create batch of timesteps
        t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
 
        # Predict noise
        predicted_noise = model(x, t_batch)
 
        # Get schedule values
        alpha_t = noise_schedule.alpha[t]
        alpha_bar_t = noise_schedule.alpha_bar[t]
        beta_t = noise_schedule.beta[t]
 
        # Compute mean of reverse distribution
        mean = (x - (beta_t / torch.sqrt(1 - alpha_bar_t)) * predicted_noise) / torch.sqrt(alpha_t)
 
        # Add noise (except for final step)
        if t > 0:
            variance = beta_t
            noise = torch.randn_like(x)
            x = mean + torch.sqrt(variance) * noise
        else:
            # Final step: no noise
            x = mean
 
    return x

Why Predict Noise Instead of the Image?

You might wonder: why predict noise $\epsilon$ instead of directly predicting $x_0$ ?

Empirical reasons:

Better gradients: Noise prediction provides better learning signal across timesteps
Stable training: Less variance in the training objective
Multi-scale learning: Network learns to denoise at all noise levels simultaneously

Theoretical reason: The noise prediction is equivalent to learning the score function $\nabla_x \log p(x)$ , which has nice mathematical properties for generation.

Results and Impact

Quantitative Results

On ImageNet 256×256:

FID: Competitive with state-of-the-art GANs
Inception Score: High sample quality
Mode coverage: No mode collapse (unlike GANs)

On CIFAR-10:

FID: State-of-the-art unconditional generation
Sample quality: High-quality diverse samples

Qualitative Impact

DDPM demonstrated:

✅ Stable training (no adversarial dynamics)
✅ High-quality generation (matching or exceeding GANs)
✅ Mode coverage (generates diverse samples)
✅ Simple objective (MSE on noise prediction)

Drawback: Slow sampling (~1000 neural network evaluations per image)

Computational Costs

Training:

Standard deep learning training (comparable to other generative models)
Can be parallelized across batch and timesteps

Sampling:

~1000 steps for high-quality generation
1000 forward passes through the U-Net
Seconds to minutes per image (depending on GPU)
Not practical for real-time applications

Solution: DDIM (next paper) reduces this to 20-50 steps with minimal quality loss.

Comparison to GANs

Aspect	DDPM	GAN
Training stability	Stable	Unstable
Mode coverage	Excellent	Mode collapse risk
Sample quality	Excellent	Excellent
Sampling speed	Slow (1000 steps)	Fast (1 step)
Training objective	Simple MSE	Minimax game
Likelihood	Yes	No

Hyperparameters

Critical hyperparameters:

T (timesteps): 1000 is standard
Noise schedule: Linear from $\beta_1 = 0.0001$ to $\beta_T = 0.02$
Learning rate: 2e-4 works well (Adam optimizer)
Batch size: As large as GPU memory allows (64-128 typical)
Model size: 128-256 base channels for U-Net

Less critical:

EMA (exponential moving average) of weights helps (decay = 0.9999)
Gradient clipping can improve stability (rarely needed)

Practical Tips

Start small: Train on 64×64 or 128×128 before scaling to 256×256
Use EMA: Keep exponential moving average of model weights for generation
Monitor samples: Visualize generated samples during training (every 10 epochs)
Be patient: Diffusion models need many iterations to train (100K-1M steps)
Fixed variance works: No need to learn variance, can fix to $\beta_t$

Limitations Addressed by Follow-up Work

Slow sampling: DDIM (2020) enables 20-50x speedup
Weak conditioning: Classifier-free guidance (2021) enables strong text conditioning
High resolution: Latent diffusion (2021) enables efficient high-resolution generation
Architecture: Transformer-based diffusion (DiT, 2023) improves scalability

Diffusion Fundamentals - Forward process and noise schedules
DDIM Paper - Fast sampling with step skipping
Classifier-Free Guidance - Text conditioning
Generative Models - Comparison to GANs and VAEs

Key References

Original Paper - DDPM paper
Official Code - PyTorch implementation
Annotated Implementation - Line-by-line explanation

Learning Resources

Paper Explanations

AI Coffee Break: DDPM Explained - Video walkthrough
Outlier: Diffusion Models - Clear explanation

Implementation Guides

Hugging Face Diffusers - Production implementation
PyTorch Tutorial - Step-by-step guide
Annotated PyTorch - Commented code

Mathematical Depth

Lilian Weng: What are Diffusion Models? - Comprehensive treatment
Calvin Luo: Understanding Diffusion Models - Unified perspective