DDIM: Denoising Diffusion Implicit Models

Paper: Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR 2021.

Key Innovation: DDPM’s main weakness is slow sampling (~1000 neural network evaluations per image). DDIM solves this with deterministic sampling and step skipping, achieving 20-50x speedup with minimal quality loss.

The Problem with DDPM

DDPM sampling requires iterating through all T timesteps:


x_T → x_{T-1} → x_{T-2} → ... → x_1 → x_0
1000 steps = 1000 neural network forward passes = slow

For a U-Net taking 50ms per forward pass:

1000 steps × 50ms = 50 seconds per image 🐌

This is impractical for real-world applications.

DDIM’s Breakthrough

DDIM allows skipping timesteps while maintaining quality:


x_T → x_{900} → x_{800} → ... → x_{100} → x_0
50 steps = 50 neural network forward passes = 20x faster

Same 50ms U-Net:

50 steps × 50ms = 2.5 seconds per image ⚡

Critical insight: No retraining needed! Use your existing DDPM model with DDIM sampling.

Core Idea: Deterministic vs Stochastic

DDPM (stochastic):

Adds random noise at each step
Different samples even with same starting noise
Must follow all timesteps sequentially

DDIM (deterministic):

No random noise added (optional parameter eta)
Same starting noise → same output (reproducible)
Can skip timesteps freely

The DDIM Update Formula

Instead of the DDPM update, DDIM uses:

x_{t-\Delta t} = \sqrt{\bar{\alpha}_{t-\Delta t}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1-\bar{\alpha}_{t-\Delta t}} \epsilon_\theta(x_t, t)

Breaking this down:

Predict $x_0$ : $\hat{x}_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t} \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}$
- Use noise prediction to estimate the clean image
Scale predicted $x_0$ : $\sqrt{\bar{\alpha}_{t-\Delta t}} \hat{x}_0$
- Scale by appropriate alpha value for target timestep
Add noise component: $\sqrt{1-\bar{\alpha}_{t-\Delta t}} \epsilon_\theta(x_t, t)$
- Add the right amount of noise for target timestep

Key insight: This formula works for any $\Delta t$ , enabling arbitrary step skipping!

Implementation


@torch.no_grad()
def sample_ddim(model, shape, noise_schedule, steps=50, eta=0.0, device='cuda'):
    """
    Fast sampling with DDIM
 
    Args:
        model: trained diffusion model (same as DDPM!)
        shape: (batch_size, channels, height, width)
        steps: number of sampling steps (e.g., 50 instead of 1000)
        eta: stochasticity parameter
              eta=0 → fully deterministic (recommended)
              eta=1 → same as DDPM (slow)
        noise_schedule: NoiseSchedule object
 
    Returns:
        generated_images: (B, C, H, W)
    """
    model.eval()
    T = noise_schedule.T
 
    # Create timestep schedule (skip steps)
    # e.g., [999, 979, 959, ..., 19, 0] for 50 steps
    skip = T // steps
    timesteps = torch.arange(0, T, skip).flip(0)
    timesteps = torch.cat([timesteps, torch.tensor([0])])
 
    # Start from pure noise
    x = torch.randn(shape, device=device)
 
    # Iteratively denoise with larger steps
    for i in range(len(timesteps) - 1):
        t = timesteps[i]
        t_prev = timesteps[i + 1]
 
        # Create batch of timesteps
        t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)
 
        # Predict noise at current timestep
        predicted_noise = model(x, t_batch)
 
        # Get alpha values
        alpha_bar_t = noise_schedule.alpha_bar[t]
        alpha_bar_t_prev = noise_schedule.alpha_bar[t_prev]
 
        # Predict x0 from current x_t
        pred_x0 = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
 
        # Direction pointing to x_t
        dir_xt = torch.sqrt(1 - alpha_bar_t_prev) * predicted_noise
 
        # DDIM update (deterministic when eta=0)
        x = torch.sqrt(alpha_bar_t_prev) * pred_x0 + dir_xt
 
    return x

Adding Stochasticity (Optional)

The eta parameter controls randomness:


# Add optional noise for stochastic sampling
if eta > 0 and t_prev > 0:
    sigma_t = eta * torch.sqrt(
        (1 - alpha_bar_t_prev) / (1 - alpha_bar_t) *
        (1 - alpha_bar_t / alpha_bar_t_prev)
    )
 
    noise = torch.randn_like(x)
 
    # Modified update with noise
    x = torch.sqrt(alpha_bar_t_prev) * pred_x0 + \
        torch.sqrt(1 - alpha_bar_t_prev - sigma_t**2) * predicted_noise + \
        sigma_t * noise

Spectrum of stochasticity:

eta = 0: Fully deterministic (recommended for speed)
eta = 0.5: Some randomness
eta = 1: Equivalent to DDPM (slow but high quality)

Choosing Number of Steps

Quality vs speed tradeoff:

Steps	Speed	Quality	Use Case
10	Fastest	Lower	Previews, rapid iteration
20-30	Fast	Good	Most applications
50	Medium	Very good	High quality generation
100-200	Slower	Excellent	Research, best quality
1000	Slowest	Marginal improvement	Rarely needed

Recommendation: 50 steps provides the best balance of quality and speed for most use cases. Use 20-30 for faster iteration during development.

Performance Comparison


# Generate samples with both methods
model.eval()
 
# DDPM: 1000 steps
import time
start = time.time()
samples_ddpm = sample_ddpm(model, (4, 3, 64, 64), noise_schedule)
time_ddpm = time.time() - start
print(f"DDPM: {time_ddpm:.2f} seconds")
 
# DDIM: 50 steps
start = time.time()
samples_ddim = sample_ddim(model, (4, 3, 64, 64), noise_schedule, steps=50)
time_ddim = time.time() - start
print(f"DDIM: {time_ddim:.2f} seconds")
 
print(f"Speedup: {time_ddpm / time_ddim:.1f}x")

Typical output:


DDPM: 50.3 seconds
DDIM: 2.5 seconds
Speedup: 20.1x

Why DDIM Works: Mathematical Insight

Key theoretical contribution:

DDPM defines a specific diffusion process (stochastic with noise at each step). DDIM shows there are infinitely many processes that:

Have the same marginal distributions $q(x_t | x_0)$
Can use the same trained noise predictor $\epsilon_\theta$
Allow deterministic sampling and step skipping

Practical insight:

The model learns to predict noise
We can use that prediction in different ways
DDIM’s way is faster without retraining

No Retraining Needed!

The most amazing part of DDIM:

Use your existing DDPM model - just change the sampling algorithm! No need to retrain anything.

This is why DDIM had such immediate impact - everyone could instantly speed up their models 20x.

Conditional DDIM

DDIM works seamlessly with conditioning (text, class labels, etc.):


@torch.no_grad()
def sample_ddim_conditional(model, condition, shape, steps=50):
    """DDIM sampling with conditioning"""
    x = torch.randn(shape)
 
    timesteps = create_timestep_schedule(steps)
 
    for i in range(len(timesteps) - 1):
        t = timesteps[i]
        t_prev = timesteps[i + 1]
 
        # Predict noise with conditioning
        predicted_noise = model(x, t, condition)
 
        # DDIM update (same as unconditional)
        x = ddim_update(x, predicted_noise, t, t_prev)
 
    return x

Results and Impact

Quantitative Results

On ImageNet 256×256:

50 steps: Nearly identical FID to 1000-step DDPM
20 steps: Minor quality degradation, still excellent
10 steps: Noticeable but acceptable quality

Impact on the Field

DDIM transformed diffusion models from research curiosities to practical tools:

DALL-E 2: Uses DDIM-style sampling
Stable Diffusion: DDIM for fast generation (20-50 steps)
Midjourney: Optimized sampling based on DDIM insights

Without DDIM, diffusion would still be too slow for production use.

DDIM vs DDPM Summary

Aspect	DDPM	DDIM
Sampling steps	1000	20-50
Sampling time	Slow	20-50x faster
Stochastic	Yes	Optional (eta parameter)
Quality	Excellent	Nearly identical
Training	Standard	Use same DDPM model
Flexibility	Fixed steps	Skip any steps
Reproducibility	Stochastic	Deterministic (eta=0)

When to Use DDIM

Use DDIM when:

✅ You need fast generation
✅ You want deterministic outputs (reproducibility)
✅ You’re doing iterative editing (same seed = same result)
✅ Deploying to production
✅ Real-time or interactive applications

Use DDPM when:

❓ You want maximum diversity (debatable)
❓ Research comparison (matching paper conditions)

In practice, DDIM is almost always preferred.

Practical Tips

Start with 50 steps: Good default for quality/speed balance
Use eta=0: Deterministic sampling is faster and often better
Experiment with step schedules: Linear spacing is good, but quadratic or custom schedules can help
Cache timestep embeddings: Precompute for efficiency
Lower steps for previews: Use 10-20 steps during development

Advanced: Custom Timestep Schedules

Different schedules can improve quality:


def create_timestep_schedule(T, steps, schedule_type='linear'):
    """Create custom timestep schedule"""
    if schedule_type == 'linear':
        # Uniform spacing
        return torch.linspace(0, T-1, steps).long().flip(0)
 
    elif schedule_type == 'quadratic':
        # More steps for early denoising
        t = torch.linspace(0, 1, steps) ** 2
        return (t * (T-1)).long().flip(0)
 
    elif schedule_type == 'cosine':
        # Cosine schedule
        t = torch.linspace(0, 1, steps)
        alpha_bar = torch.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2
        return (alpha_bar * (T-1)).long().flip(0)

Limitations and Extensions

Limitations:

Still slower than GANs (which generate in 1 step)
Quality slightly degrades with very few steps (<20)

Extensions:

PLMS (2022): Further speedup with Pseudo Linear Multi-Step
DPM-Solver (2022): Optimized ODE solvers for diffusion
Consistency Models (2023): Single-step generation while maintaining quality

DDPM Paper - Original training algorithm
Diffusion Fundamentals - Forward/reverse process
Classifier-Free Guidance - Text conditioning (uses DDIM)

Key References

Original Paper - DDIM paper
Official Code - PyTorch implementation
Hugging Face Diffusers - Production implementation

Learning Resources

Paper Explanations

AI Coffee Break: DDIM Explained - Video walkthrough
Yannic Kilcher: DDIM Paper Review - In-depth analysis

Implementation Guides

Hugging Face Tutorial - Using DDIM in practice
PyTorch Lightning Example - Complete training example

Mathematical Depth

Lilian Weng: Diffusion Models - Section on DDIM
Yang Song: Score-Based Models - Unified view