Diffusion Fundamentals
Diffusion models work on a beautifully simple principle: gradually add noise to data until it becomes pure Gaussian noise, then learn to reverse the process.
The Core Idea
Think of the diffusion process like:
- Taking a clear image
- Adding a tiny bit of noise
- Repeating 1000 times until you have pure static
- Training a neural network to undo each noise step
Once trained, you can start from pure noise and iteratively denoise to generate new samples.
Forward Process: Adding Noise
Given data (e.g., an image), we gradually corrupt it over steps (typically ):
where is a noise schedule (e.g., linearly increasing from 0.0001 to 0.02).
What this means:
- At each step , we add a small amount of Gaussian noise
- The noise amount is controlled by
- After steps, the data becomes indistinguishable from noise
Visualization of the Process
x_0 (clear image) → x_1 → x_2 → ... → x_999 → x_1000 (pure noise)
[tiny noise] [more noise] [almost noise] [complete noise]At different timesteps:
- t=0: Original image (100% signal)
- t=250: Slightly noisy (recognizable)
- t=500: Very noisy (barely recognizable)
- t=1000: Pure Gaussian noise (0% signal)
Key Mathematical Property
The forward process has a crucial property: we can jump directly to any step without computing all intermediate steps:
where:
Why this matters: During training, we can sample any timestep directly, making training efficient. We don’t need to run the forward process 1000 times for each training example.
Understanding Alpha and Beta
Let’s clarify the notation:
-
: Variance of noise added at step
- Small values (0.0001 to 0.02)
- Controls how much noise is added
- Typically increases over time (noise schedule)
-
: Retention factor
- How much of the previous step is kept
- Close to 1 (0.98 to 0.9999)
-
: Cumulative retention
- How much of the original remains at step
- Decreases from 1.0 (at ) to near 0 (at )
Intuition: tells us the signal-to-noise ratio at timestep :
- At : (100% signal, 0% noise)
- At : (30% signal, 70% noise)
- At : (0% signal, 100% noise)
Noise Schedule
The noise schedule is critical for quality:
Linear Schedule (Original DDPM)
- increases linearly from to
- Simple but effective
- Standard for most applications
Cosine Schedule (Improved)
- Smoother noise addition
- Better results for high-resolution images
- Prevents too much noise being added early
Implementation
class NoiseSchedule:
"""Precompute noise schedule for efficiency"""
def __init__(self, T=1000, beta_start=0.0001, beta_end=0.02):
self.T = T
# Linear schedule
self.beta = torch.linspace(beta_start, beta_end, T)
# Compute alpha values
self.alpha = 1.0 - self.beta
# Compute cumulative product (this is bar_alpha_t)
self.alpha_bar = torch.cumprod(self.alpha, dim=0)
# For reverse process
self.alpha_bar_prev = torch.cat([
torch.tensor([1.0]),
self.alpha_bar[:-1]
])Forward Diffusion Implementation
def forward_diffusion(x0, t, noise_schedule):
"""
Apply forward diffusion to x0 at timestep t
Args:
x0: original image (B, C, H, W)
t: timestep (B,) - integers from 0 to T-1
noise_schedule: object with alpha_bar values
Returns:
noisy_image: x_t
noise: the noise that was added
"""
# Get noise schedule values for timestep t
alpha_bar_t = noise_schedule.alpha_bar[t]
# Sample Gaussian noise
noise = torch.randn_like(x0)
# Add noise according to the formula:
# x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
sqrt_alpha_bar_t = torch.sqrt(alpha_bar_t).view(-1, 1, 1, 1)
sqrt_one_minus_alpha_bar_t = torch.sqrt(1 - alpha_bar_t).view(-1, 1, 1, 1)
noisy_image = sqrt_alpha_bar_t * x0 + sqrt_one_minus_alpha_bar_t * noise
return noisy_image, noiseExample Usage
# Setup
T = 1000
noise_schedule = NoiseSchedule(T=T)
# Load an image
x0 = load_image() # Shape: (1, 3, 256, 256)
# Sample a random timestep
t = torch.randint(0, T, (1,)) # e.g., t = 500
# Apply forward diffusion
x_t, noise = forward_diffusion(x0, t, noise_schedule)
# Now x_t is the noisy version of x0 at timestep t
# noise is the Gaussian noise that was addedWhy This Process Works
The forward process is:
- Simple: Just adding Gaussian noise with a schedule
- Tractable: We can compute the distribution exactly
- Reversible: In theory, we can reverse it step-by-step
- Trainable: We can learn a neural network to reverse it
The key insight is that while the forward process destroys information, we can train a neural network to recover it step-by-step.
Connection to Score-Based Models
Diffusion models are closely related to score-based generative models:
- Score function: (gradient of log density)
- Noise prediction: Learning is equivalent to learning the score
- Connection:
This theoretical connection provides deeper understanding of why diffusion models work.
Reverse Process Preview
While the forward process adds noise, the reverse process learns to denoise:
The neural network learns to predict:
- Either the noise that was added
- Or the mean of the reverse distribution
- Or the score function
We’ll cover the reverse process in detail in the DDPM paper analysis.
Related Concepts
- Generative Models - Why diffusion vs GANs/VAEs
- DDPM Paper - Training the reverse process
- DDIM Paper - Fast sampling with step skipping
Key Papers
- Sohl-Dickstein et al. (2015): “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” - Original diffusion paper
- Ho et al. (2020): “Denoising Diffusion Probabilistic Models” - Modern formulation
- Song et al. (2020): “Score-Based Generative Modeling through Stochastic Differential Equations” - Score-based view
Learning Resources
Mathematical Foundations
- Lilian Weng: Diffusion Models - Comprehensive mathematical treatment
- Yang Song: Score-Based Models - Connection to score matching
Visual Explanations
- AI Coffee Break: Diffusion Models Explained - Visual walkthrough
- Outlier: Diffusion Models - Clear introduction
Implementation
- Hugging Face Diffusers - Production implementations
- PyTorch Tutorial - Step-by-step implementation