Skip to Content
LibraryPapersDDIM (2021)

DDIM: Denoising Diffusion Implicit Models

Paper: Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR 2021.

Key Innovation: DDPM’s main weakness is slow sampling (~1000 neural network evaluations per image). DDIM solves this with deterministic sampling and step skipping, achieving 20-50x speedup with minimal quality loss.

The Problem with DDPM

DDPM sampling requires iterating through all T timesteps:

x_T → x_{T-1} → x_{T-2} → ... → x_1 → x_0 1000 steps = 1000 neural network forward passes = slow

For a U-Net taking 50ms per forward pass:

  • 1000 steps × 50ms = 50 seconds per image 🐌

This is impractical for real-world applications.

DDIM’s Breakthrough

DDIM allows skipping timesteps while maintaining quality:

x_T → x_{900} → x_{800} → ... → x_{100} → x_0 50 steps = 50 neural network forward passes = 20x faster

Same 50ms U-Net:

  • 50 steps × 50ms = 2.5 seconds per image

Critical insight: No retraining needed! Use your existing DDPM model with DDIM sampling.

Core Idea: Deterministic vs Stochastic

DDPM (stochastic):

  • Adds random noise at each step
  • Different samples even with same starting noise
  • Must follow all timesteps sequentially

DDIM (deterministic):

  • No random noise added (optional parameter eta)
  • Same starting noise → same output (reproducible)
  • Can skip timesteps freely

The DDIM Update Formula

Instead of the DDPM update, DDIM uses:

xtΔt=αˉtΔt(xt1αˉtϵθ(xt,t)αˉt)predicted x0+1αˉtΔtϵθ(xt,t)x_{t-\Delta t} = \sqrt{\bar{\alpha}_{t-\Delta t}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1-\bar{\alpha}_{t-\Delta t}} \epsilon_\theta(x_t, t)

Breaking this down:

  1. Predict x0x_0: x^0=xt1αˉtϵθ(xt,t)αˉt\hat{x}_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t} \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}

    • Use noise prediction to estimate the clean image
  2. Scale predicted x0x_0: αˉtΔtx^0\sqrt{\bar{\alpha}_{t-\Delta t}} \hat{x}_0

    • Scale by appropriate alpha value for target timestep
  3. Add noise component: 1αˉtΔtϵθ(xt,t)\sqrt{1-\bar{\alpha}_{t-\Delta t}} \epsilon_\theta(x_t, t)

    • Add the right amount of noise for target timestep

Key insight: This formula works for any Δt\Delta t, enabling arbitrary step skipping!

Implementation

@torch.no_grad() def sample_ddim(model, shape, noise_schedule, steps=50, eta=0.0, device='cuda'): """ Fast sampling with DDIM Args: model: trained diffusion model (same as DDPM!) shape: (batch_size, channels, height, width) steps: number of sampling steps (e.g., 50 instead of 1000) eta: stochasticity parameter eta=0 → fully deterministic (recommended) eta=1 → same as DDPM (slow) noise_schedule: NoiseSchedule object Returns: generated_images: (B, C, H, W) """ model.eval() T = noise_schedule.T # Create timestep schedule (skip steps) # e.g., [999, 979, 959, ..., 19, 0] for 50 steps skip = T // steps timesteps = torch.arange(0, T, skip).flip(0) timesteps = torch.cat([timesteps, torch.tensor([0])]) # Start from pure noise x = torch.randn(shape, device=device) # Iteratively denoise with larger steps for i in range(len(timesteps) - 1): t = timesteps[i] t_prev = timesteps[i + 1] # Create batch of timesteps t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long) # Predict noise at current timestep predicted_noise = model(x, t_batch) # Get alpha values alpha_bar_t = noise_schedule.alpha_bar[t] alpha_bar_t_prev = noise_schedule.alpha_bar[t_prev] # Predict x0 from current x_t pred_x0 = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t) # Direction pointing to x_t dir_xt = torch.sqrt(1 - alpha_bar_t_prev) * predicted_noise # DDIM update (deterministic when eta=0) x = torch.sqrt(alpha_bar_t_prev) * pred_x0 + dir_xt return x

Adding Stochasticity (Optional)

The eta parameter controls randomness:

# Add optional noise for stochastic sampling if eta > 0 and t_prev > 0: sigma_t = eta * torch.sqrt( (1 - alpha_bar_t_prev) / (1 - alpha_bar_t) * (1 - alpha_bar_t / alpha_bar_t_prev) ) noise = torch.randn_like(x) # Modified update with noise x = torch.sqrt(alpha_bar_t_prev) * pred_x0 + \ torch.sqrt(1 - alpha_bar_t_prev - sigma_t**2) * predicted_noise + \ sigma_t * noise

Spectrum of stochasticity:

  • eta = 0: Fully deterministic (recommended for speed)
  • eta = 0.5: Some randomness
  • eta = 1: Equivalent to DDPM (slow but high quality)

Choosing Number of Steps

Quality vs speed tradeoff:

StepsSpeedQualityUse Case
10FastestLowerPreviews, rapid iteration
20-30FastGoodMost applications
50MediumVery goodHigh quality generation
100-200SlowerExcellentResearch, best quality
1000SlowestMarginal improvementRarely needed

Recommendation: 50 steps provides the best balance of quality and speed for most use cases. Use 20-30 for faster iteration during development.

Performance Comparison

# Generate samples with both methods model.eval() # DDPM: 1000 steps import time start = time.time() samples_ddpm = sample_ddpm(model, (4, 3, 64, 64), noise_schedule) time_ddpm = time.time() - start print(f"DDPM: {time_ddpm:.2f} seconds") # DDIM: 50 steps start = time.time() samples_ddim = sample_ddim(model, (4, 3, 64, 64), noise_schedule, steps=50) time_ddim = time.time() - start print(f"DDIM: {time_ddim:.2f} seconds") print(f"Speedup: {time_ddpm / time_ddim:.1f}x")

Typical output:

DDPM: 50.3 seconds DDIM: 2.5 seconds Speedup: 20.1x

Why DDIM Works: Mathematical Insight

Key theoretical contribution:

DDPM defines a specific diffusion process (stochastic with noise at each step). DDIM shows there are infinitely many processes that:

  1. Have the same marginal distributions q(xtx0)q(x_t | x_0)
  2. Can use the same trained noise predictor ϵθ\epsilon_\theta
  3. Allow deterministic sampling and step skipping

Practical insight:

  • The model learns to predict noise
  • We can use that prediction in different ways
  • DDIM’s way is faster without retraining

No Retraining Needed!

The most amazing part of DDIM:

Use your existing DDPM model - just change the sampling algorithm! No need to retrain anything.

This is why DDIM had such immediate impact - everyone could instantly speed up their models 20x.

Conditional DDIM

DDIM works seamlessly with conditioning (text, class labels, etc.):

@torch.no_grad() def sample_ddim_conditional(model, condition, shape, steps=50): """DDIM sampling with conditioning""" x = torch.randn(shape) timesteps = create_timestep_schedule(steps) for i in range(len(timesteps) - 1): t = timesteps[i] t_prev = timesteps[i + 1] # Predict noise with conditioning predicted_noise = model(x, t, condition) # DDIM update (same as unconditional) x = ddim_update(x, predicted_noise, t, t_prev) return x

Results and Impact

Quantitative Results

On ImageNet 256×256:

  • 50 steps: Nearly identical FID to 1000-step DDPM
  • 20 steps: Minor quality degradation, still excellent
  • 10 steps: Noticeable but acceptable quality

Impact on the Field

DDIM transformed diffusion models from research curiosities to practical tools:

  • DALL-E 2: Uses DDIM-style sampling
  • Stable Diffusion: DDIM for fast generation (20-50 steps)
  • Midjourney: Optimized sampling based on DDIM insights

Without DDIM, diffusion would still be too slow for production use.

DDIM vs DDPM Summary

AspectDDPMDDIM
Sampling steps100020-50
Sampling timeSlow20-50x faster
StochasticYesOptional (eta parameter)
QualityExcellentNearly identical
TrainingStandardUse same DDPM model
FlexibilityFixed stepsSkip any steps
ReproducibilityStochasticDeterministic (eta=0)

When to Use DDIM

Use DDIM when:

  • ✅ You need fast generation
  • ✅ You want deterministic outputs (reproducibility)
  • ✅ You’re doing iterative editing (same seed = same result)
  • ✅ Deploying to production
  • ✅ Real-time or interactive applications

Use DDPM when:

  • ❓ You want maximum diversity (debatable)
  • ❓ Research comparison (matching paper conditions)

In practice, DDIM is almost always preferred.

Practical Tips

  1. Start with 50 steps: Good default for quality/speed balance
  2. Use eta=0: Deterministic sampling is faster and often better
  3. Experiment with step schedules: Linear spacing is good, but quadratic or custom schedules can help
  4. Cache timestep embeddings: Precompute for efficiency
  5. Lower steps for previews: Use 10-20 steps during development

Advanced: Custom Timestep Schedules

Different schedules can improve quality:

def create_timestep_schedule(T, steps, schedule_type='linear'): """Create custom timestep schedule""" if schedule_type == 'linear': # Uniform spacing return torch.linspace(0, T-1, steps).long().flip(0) elif schedule_type == 'quadratic': # More steps for early denoising t = torch.linspace(0, 1, steps) ** 2 return (t * (T-1)).long().flip(0) elif schedule_type == 'cosine': # Cosine schedule t = torch.linspace(0, 1, steps) alpha_bar = torch.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2 return (alpha_bar * (T-1)).long().flip(0)

Limitations and Extensions

Limitations:

  • Still slower than GANs (which generate in 1 step)
  • Quality slightly degrades with very few steps (<20)

Extensions:

  • PLMS (2022): Further speedup with Pseudo Linear Multi-Step
  • DPM-Solver (2022): Optimized ODE solvers for diffusion
  • Consistency Models (2023): Single-step generation while maintaining quality

Key References

Learning Resources

Paper Explanations

Implementation Guides

Mathematical Depth