Skip to Content
LibraryConceptsDiffusion Fundamentals

Diffusion Fundamentals

Diffusion models work on a beautifully simple principle: gradually add noise to data until it becomes pure Gaussian noise, then learn to reverse the process.

The Core Idea

Think of the diffusion process like:

  1. Taking a clear image
  2. Adding a tiny bit of noise
  3. Repeating 1000 times until you have pure static
  4. Training a neural network to undo each noise step

Once trained, you can start from pure noise and iteratively denoise to generate new samples.

Forward Process: Adding Noise

Given data x0x_0 (e.g., an image), we gradually corrupt it over TT steps (typically T=1000T = 1000):

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)

where βt\beta_t is a noise schedule (e.g., linearly increasing from 0.0001 to 0.02).

What this means:

  • At each step tt, we add a small amount of Gaussian noise
  • The noise amount is controlled by βt\beta_t
  • After TT steps, the data becomes indistinguishable from noise

Visualization of the Process

x_0 (clear image) → x_1 → x_2 → ... → x_999 → x_1000 (pure noise) [tiny noise] [more noise] [almost noise] [complete noise]

At different timesteps:

  • t=0: Original image (100% signal)
  • t=250: Slightly noisy (recognizable)
  • t=500: Very noisy (barely recognizable)
  • t=1000: Pure Gaussian noise (0% signal)

Key Mathematical Property

The forward process has a crucial property: we can jump directly to any step tt without computing all intermediate steps:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)

where:

  • αt=1βt\alpha_t = 1 - \beta_t
  • αˉt=i=1tαi=i=1t(1βi)\bar{\alpha}_t = \prod_{i=1}^t \alpha_i = \prod_{i=1}^t (1-\beta_i)

Why this matters: During training, we can sample any timestep directly, making training efficient. We don’t need to run the forward process 1000 times for each training example.

Understanding Alpha and Beta

Let’s clarify the notation:

  • βt\beta_t: Variance of noise added at step tt

    • Small values (0.0001 to 0.02)
    • Controls how much noise is added
    • Typically increases over time (noise schedule)
  • αt=1βt\alpha_t = 1 - \beta_t: Retention factor

    • How much of the previous step is kept
    • Close to 1 (0.98 to 0.9999)
  • αˉt=i=1t(1βi)\bar{\alpha}_t = \prod_{i=1}^t (1-\beta_i): Cumulative retention

    • How much of the original x0x_0 remains at step tt
    • Decreases from 1.0 (at t=0t=0) to near 0 (at t=Tt=T)

Intuition: αˉt\bar{\alpha}_t tells us the signal-to-noise ratio at timestep tt:

  • At t=0t=0: αˉ0=1\bar{\alpha}_0 = 1 (100% signal, 0% noise)
  • At t=500t=500: αˉ5000.3\bar{\alpha}_{500} \approx 0.3 (30% signal, 70% noise)
  • At t=1000t=1000: αˉ10000\bar{\alpha}_{1000} \approx 0 (0% signal, 100% noise)

Noise Schedule

The noise schedule {β1,β2,,βT}\{\beta_1, \beta_2, \ldots, \beta_T\} is critical for quality:

Linear Schedule (Original DDPM)

  • βt\beta_t increases linearly from β1=0.0001\beta_1 = 0.0001 to βT=0.02\beta_T = 0.02
  • Simple but effective
  • Standard for most applications

Cosine Schedule (Improved)

  • Smoother noise addition
  • Better results for high-resolution images
  • Prevents too much noise being added early

Implementation

class NoiseSchedule: """Precompute noise schedule for efficiency""" def __init__(self, T=1000, beta_start=0.0001, beta_end=0.02): self.T = T # Linear schedule self.beta = torch.linspace(beta_start, beta_end, T) # Compute alpha values self.alpha = 1.0 - self.beta # Compute cumulative product (this is bar_alpha_t) self.alpha_bar = torch.cumprod(self.alpha, dim=0) # For reverse process self.alpha_bar_prev = torch.cat([ torch.tensor([1.0]), self.alpha_bar[:-1] ])

Forward Diffusion Implementation

def forward_diffusion(x0, t, noise_schedule): """ Apply forward diffusion to x0 at timestep t Args: x0: original image (B, C, H, W) t: timestep (B,) - integers from 0 to T-1 noise_schedule: object with alpha_bar values Returns: noisy_image: x_t noise: the noise that was added """ # Get noise schedule values for timestep t alpha_bar_t = noise_schedule.alpha_bar[t] # Sample Gaussian noise noise = torch.randn_like(x0) # Add noise according to the formula: # x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise sqrt_alpha_bar_t = torch.sqrt(alpha_bar_t).view(-1, 1, 1, 1) sqrt_one_minus_alpha_bar_t = torch.sqrt(1 - alpha_bar_t).view(-1, 1, 1, 1) noisy_image = sqrt_alpha_bar_t * x0 + sqrt_one_minus_alpha_bar_t * noise return noisy_image, noise

Example Usage

# Setup T = 1000 noise_schedule = NoiseSchedule(T=T) # Load an image x0 = load_image() # Shape: (1, 3, 256, 256) # Sample a random timestep t = torch.randint(0, T, (1,)) # e.g., t = 500 # Apply forward diffusion x_t, noise = forward_diffusion(x0, t, noise_schedule) # Now x_t is the noisy version of x0 at timestep t # noise is the Gaussian noise that was added

Why This Process Works

The forward process is:

  1. Simple: Just adding Gaussian noise with a schedule
  2. Tractable: We can compute the distribution exactly
  3. Reversible: In theory, we can reverse it step-by-step
  4. Trainable: We can learn a neural network to reverse it

The key insight is that while the forward process destroys information, we can train a neural network to recover it step-by-step.

Connection to Score-Based Models

Diffusion models are closely related to score-based generative models:

  • Score function: xtlogp(xt)\nabla_{x_t} \log p(x_t) (gradient of log density)
  • Noise prediction: Learning ϵθ(xt,t)\epsilon_\theta(x_t, t) is equivalent to learning the score
  • Connection: ϵθ(xt,t)1αˉtxtlogp(xt)\epsilon_\theta(x_t, t) \approx \sqrt{1-\bar{\alpha}_t} \nabla_{x_t} \log p(x_t)

This theoretical connection provides deeper understanding of why diffusion models work.

Reverse Process Preview

While the forward process adds noise, the reverse process learns to denoise:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

The neural network learns to predict:

  • Either the noise that was added
  • Or the mean of the reverse distribution
  • Or the score function

We’ll cover the reverse process in detail in the DDPM paper analysis.

Key Papers

  • Sohl-Dickstein et al. (2015): “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” - Original diffusion paper
  • Ho et al. (2020): “Denoising Diffusion Probabilistic Models” - Modern formulation
  • Song et al. (2020): “Score-Based Generative Modeling through Stochastic Differential Equations” - Score-based view

Learning Resources

Mathematical Foundations

Visual Explanations

Implementation