Diffusion Fundamentals

Diffusion models work on a beautifully simple principle: gradually add noise to data until it becomes pure Gaussian noise, then learn to reverse the process.

The Core Idea

Think of the diffusion process like:

Taking a clear image
Adding a tiny bit of noise
Repeating 1000 times until you have pure static
Training a neural network to undo each noise step

Once trained, you can start from pure noise and iteratively denoise to generate new samples.

Forward Process: Adding Noise

Given data $x_0$ (e.g., an image), we gradually corrupt it over $T$ steps (typically $T = 1000$ ):

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

where $\beta_t$ is a noise schedule (e.g., linearly increasing from 0.0001 to 0.02).

What this means:

At each step $t$ , we add a small amount of Gaussian noise
The noise amount is controlled by $\beta_t$
After $T$ steps, the data becomes indistinguishable from noise

Visualization of the Process


x_0 (clear image) → x_1 → x_2 → ... → x_999 → x_1000 (pure noise)
     [tiny noise]   [more noise]         [almost noise]  [complete noise]

At different timesteps:

t=0: Original image (100% signal)
t=250: Slightly noisy (recognizable)
t=500: Very noisy (barely recognizable)
t=1000: Pure Gaussian noise (0% signal)

Key Mathematical Property

The forward process has a crucial property: we can jump directly to any step $t$ without computing all intermediate steps:

q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)

where:

$\alpha_t = 1 - \beta_t$
$\bar{\alpha}_t = \prod_{i=1}^t \alpha_i = \prod_{i=1}^t (1-\beta_i)$

Why this matters: During training, we can sample any timestep directly, making training efficient. We don’t need to run the forward process 1000 times for each training example.

Understanding Alpha and Beta

Let’s clarify the notation:

$\beta_t$ : Variance of noise added at step $t$
- Small values (0.0001 to 0.02)
- Controls how much noise is added
- Typically increases over time (noise schedule)
$\alpha_t = 1 - \beta_t$ : Retention factor
- How much of the previous step is kept
- Close to 1 (0.98 to 0.9999)
$\bar{\alpha}_t = \prod_{i=1}^t (1-\beta_i)$ : Cumulative retention
- How much of the original $x_0$ remains at step $t$
- Decreases from 1.0 (at $t=0$ ) to near 0 (at $t=T$ )

Intuition: $\bar{\alpha}_t$ tells us the signal-to-noise ratio at timestep $t$ :

At $t=0$ : $\bar{\alpha}_0 = 1$ (100% signal, 0% noise)
At $t=500$ : $\bar{\alpha}_{500} \approx 0.3$ (30% signal, 70% noise)
At $t=1000$ : $\bar{\alpha}_{1000} \approx 0$ (0% signal, 100% noise)

Noise Schedule

The noise schedule $\{\beta_1, \beta_2, \ldots, \beta_T\}$ is critical for quality:

Linear Schedule (Original DDPM)

$\beta_t$ increases linearly from $\beta_1 = 0.0001$ to $\beta_T = 0.02$
Simple but effective
Standard for most applications

Cosine Schedule (Improved)

Smoother noise addition
Better results for high-resolution images
Prevents too much noise being added early

Implementation


class NoiseSchedule:
    """Precompute noise schedule for efficiency"""
 
    def __init__(self, T=1000, beta_start=0.0001, beta_end=0.02):
        self.T = T
 
        # Linear schedule
        self.beta = torch.linspace(beta_start, beta_end, T)
 
        # Compute alpha values
        self.alpha = 1.0 - self.beta
 
        # Compute cumulative product (this is bar_alpha_t)
        self.alpha_bar = torch.cumprod(self.alpha, dim=0)
 
        # For reverse process
        self.alpha_bar_prev = torch.cat([
            torch.tensor([1.0]),
            self.alpha_bar[:-1]
        ])

Forward Diffusion Implementation


def forward_diffusion(x0, t, noise_schedule):
    """
    Apply forward diffusion to x0 at timestep t
 
    Args:
        x0: original image (B, C, H, W)
        t: timestep (B,) - integers from 0 to T-1
        noise_schedule: object with alpha_bar values
 
    Returns:
        noisy_image: x_t
        noise: the noise that was added
    """
    # Get noise schedule values for timestep t
    alpha_bar_t = noise_schedule.alpha_bar[t]
 
    # Sample Gaussian noise
    noise = torch.randn_like(x0)
 
    # Add noise according to the formula:
    # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
    sqrt_alpha_bar_t = torch.sqrt(alpha_bar_t).view(-1, 1, 1, 1)
    sqrt_one_minus_alpha_bar_t = torch.sqrt(1 - alpha_bar_t).view(-1, 1, 1, 1)
 
    noisy_image = sqrt_alpha_bar_t * x0 + sqrt_one_minus_alpha_bar_t * noise
 
    return noisy_image, noise

Example Usage


# Setup
T = 1000
noise_schedule = NoiseSchedule(T=T)
 
# Load an image
x0 = load_image()  # Shape: (1, 3, 256, 256)
 
# Sample a random timestep
t = torch.randint(0, T, (1,))  # e.g., t = 500
 
# Apply forward diffusion
x_t, noise = forward_diffusion(x0, t, noise_schedule)
 
# Now x_t is the noisy version of x0 at timestep t
# noise is the Gaussian noise that was added

Why This Process Works

The forward process is:

Simple: Just adding Gaussian noise with a schedule
Tractable: We can compute the distribution exactly
Reversible: In theory, we can reverse it step-by-step
Trainable: We can learn a neural network to reverse it

The key insight is that while the forward process destroys information, we can train a neural network to recover it step-by-step.

Connection to Score-Based Models

Diffusion models are closely related to score-based generative models:

Score function: $\nabla_{x_t} \log p(x_t)$ (gradient of log density)
Noise prediction: Learning $\epsilon_\theta(x_t, t)$ is equivalent to learning the score
Connection: $\epsilon_\theta(x_t, t) \approx \sqrt{1-\bar{\alpha}_t} \nabla_{x_t} \log p(x_t)$

This theoretical connection provides deeper understanding of why diffusion models work.

Reverse Process Preview

While the forward process adds noise, the reverse process learns to denoise:

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The neural network learns to predict:

Either the noise that was added
Or the mean of the reverse distribution
Or the score function

We’ll cover the reverse process in detail in the DDPM paper analysis.

Generative Models - Why diffusion vs GANs/VAEs
DDPM Paper - Training the reverse process
DDIM Paper - Fast sampling with step skipping

Key Papers

Sohl-Dickstein et al. (2015): “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” - Original diffusion paper
Ho et al. (2020): “Denoising Diffusion Probabilistic Models” - Modern formulation
Song et al. (2020): “Score-Based Generative Modeling through Stochastic Differential Equations” - Score-based view

Learning Resources

Mathematical Foundations

Lilian Weng: Diffusion Models - Comprehensive mathematical treatment
Yang Song: Score-Based Models - Connection to score matching

Visual Explanations

AI Coffee Break: Diffusion Models Explained - Visual walkthrough
Outlier: Diffusion Models - Clear introduction

Implementation

Hugging Face Diffusers - Production implementations
PyTorch Tutorial - Step-by-step implementation

Diffusion Fundamentals

The Core Idea

Forward Process: Adding Noise

Visualization of the Process

Key Mathematical Property

Understanding Alpha and Beta

Noise Schedule

Linear Schedule (Original DDPM)

Cosine Schedule (Improved)

Implementation

Forward Diffusion Implementation

Example Usage

Why This Process Works

Connection to Score-Based Models

Reverse Process Preview

Related Concepts

Key Papers

Learning Resources

Mathematical Foundations

Visual Explanations

Implementation