Classifier-Free Guidance

Classifier-free guidance (CFG) is the breakthrough technique that enables high-quality text-to-image generation in DALL-E, Stable Diffusion, and Midjourney. It solves the key challenge: how to make diffusion models strongly follow text prompts without sacrificing generation quality.

The Goal

Generate images conditioned on text prompts:


Prompt: "A cat wearing a wizard hat, digital art"
→ Diffusion model generates corresponding image

Challenge: Simple conditioning (just adding text embeddings) produces weak adherence to prompts and lower quality than unconditional generation.

Text Encoding with CLIP

From the CLIP paper, we learned that CLIP creates a shared embedding space for text and images. We use CLIP’s text encoder to get semantic embeddings:


import clip
 
# Load CLIP model
clip_model, preprocess = clip.load("ViT-B/32")
 
# Encode text
text = "A cat wearing a wizard hat, digital art"
text_tokens = clip.tokenize([text])
text_embedding = clip_model.encode_text(text_tokens)  # Shape: (1, 512)

Conditional Diffusion Model

Modify the U-Net to accept text conditioning:


class ConditionalDiffusionModel(nn.Module):
    def __init__(self, text_embed_dim=512):
        super().__init__()
        self.unet = UNet(
            in_channels=3,
            out_channels=3,
            context_dim=text_embed_dim,  # Add text context
            num_res_blocks=2,
            attention_resolutions=[16, 8],
        )
 
    def forward(self, x, t, text_embedding=None):
        """
        Predict noise conditioned on text
 
        Args:
            x: noisy image (B, C, H, W)
            t: timestep (B,)
            text_embedding: CLIP text embedding (B, 512) or None
 
        Returns:
            predicted_noise: (B, C, H, W)
        """
        return self.unet(x, t, context=text_embedding)

Cross-Attention for Conditioning

Inside the U-Net, use cross-attention to inject text information:


class CrossAttentionBlock(nn.Module):
    def __init__(self, dim, context_dim):
        super().__init__()
        self.to_q = nn.Linear(dim, dim)
        self.to_k = nn.Linear(context_dim, dim)
        self.to_v = nn.Linear(context_dim, dim)
 
    def forward(self, x, context):
        """
        x: image features (B, H*W, dim)
        context: text embedding (B, seq_len, context_dim)
        """
        q = self.to_q(x)  # Query from image
        k = self.to_k(context)  # Key from text
        v = self.to_v(context)  # Value from text
 
        # Attention: image features attend to text
        attn = torch.softmax(q @ k.transpose(-2, -1) / math.sqrt(dim), dim=-1)
        out = attn @ v
 
        return out

How it works:

Image features act as queries
Text embedding provides keys and values
Attention allows each pixel to “look at” relevant text concepts

The Problem with Simple Conditioning

Training with just text embeddings produces:

❌ Weak adherence to prompts
❌ Lower quality than unconditional generation
❌ Trade-off between following prompt and image quality

Why? The model learns to generate good images OR follow prompts, but not both strongly.

The Classifier-Free Guidance Solution

Key idea: Train the model to work both with and without text conditioning, then interpolate during sampling to amplify the conditioning effect.

Training with Classifier-Free Guidance


def train_step_cfg(model, x0, text_embedding, noise_schedule, optimizer, drop_prob=0.1):
    """
    Train with classifier-free guidance
 
    Args:
        drop_prob: probability of dropping conditioning (e.g., 0.1 = 10%)
    """
    batch_size = x0.shape[0]
    device = x0.device
 
    # Randomly drop conditioning for some samples
    mask = torch.rand(batch_size, device=device) < drop_prob
    text_embedding_masked = text_embedding.clone()
    text_embedding_masked[mask] = 0  # Zero out for unconditional
 
    # Standard diffusion training
    t = torch.randint(0, noise_schedule.T, (batch_size,), device=device)
    noise = torch.randn_like(x0)
 
    alpha_bar_t = noise_schedule.alpha_bar[t].view(-1, 1, 1, 1)
    x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * noise
 
    # Predict noise with masked conditioning
    predicted_noise = model(x_t, t, text_embedding_masked)
 
    loss = F.mse_loss(predicted_noise, noise)
 
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
 
    return loss.item()

What this does:

90% of the time: train with text conditioning
10% of the time: train without conditioning (unconditional)
Model learns both conditional and unconditional generation simultaneously

The Guidance Formula

During generation, combine conditional and unconditional predictions:

\tilde{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))

where:

$c$ is text conditioning (CLIP embedding)
$\emptyset$ is unconditional (no text)
$s$ is guidance scale (typically 7.5)

In words:

Predict noise with text: $\epsilon_{\text{cond}}$
Predict noise without text: $\epsilon_{\text{uncond}}$
Amplify the difference: $\epsilon_{\text{final}} = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$

Intuition: The unconditional model generates “average” images. The conditional model shifts toward the text prompt. By amplifying this shift (with $s > 1$ ), we get much stronger conditioning.

Sampling with Classifier-Free Guidance


@torch.no_grad()
def sample_with_cfg(model, text_embedding, shape, steps=50, guidance_scale=7.5):
    """
    Sample with classifier-free guidance
 
    Args:
        guidance_scale: strength of conditioning
                       1.0 = no guidance (conditional only)
                       7.5 = strong guidance (standard)
                       15+ = very strong (may reduce quality)
    """
    model.eval()
    x = torch.randn(shape)
    timesteps = create_timestep_schedule(steps)
 
    for i in range(len(timesteps) - 1):
        t = timesteps[i]
        t_prev = timesteps[i + 1]
        t_batch = torch.full((shape[0],), t)
 
        # Unconditional prediction (no text)
        noise_uncond = model(x, t_batch, text_embedding=None)
 
        # Conditional prediction (with text)
        noise_cond = model(x, t_batch, text_embedding=text_embedding)
 
        # Classifier-free guidance
        noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
 
        # DDIM step with guided noise prediction
        x = ddim_step(x, noise_pred, t, t_prev)
 
    return x

Guidance Scale: The Key Hyperparameter

The guidance scale $s$ controls trade-off between prompt adherence and sample quality:

Guidance Scale	Effect	Use Case
s = 1.0	No guidance, just conditional	Maximum diversity, weak prompt adherence
s = 3-5	Mild guidance, natural looking	Subtle prompts, artistic freedom
s = 7.5	Standard, good balance	Default for most prompts
s = 10-15	Strong adherence to prompt	Detailed specifications
s = 20+	Very strong, may add artifacts	Forcing specific attributes


# Generate with different guidance scales
scales = [1.0, 5.0, 7.5, 15.0]
prompt = "A cat wearing a wizard hat"
 
for scale in scales:
    image = sample_with_cfg(
        model,
        encode_text(prompt),
        (1, 3, 256, 256),
        guidance_scale=scale
    )
    save_image(image, f"cat_wizard_scale_{scale}.png")

Practical recommendation: Start with guidance scale = 7.5 for most prompts. Increase for more prompt adherence, decrease for more natural/diverse images.

Negative Prompts

A powerful extension: guide generation away from unwanted concepts.


@torch.no_grad()
def sample_with_negative_prompt(
    model,
    positive_text,
    negative_text,
    shape,
    steps=50,
    guidance_scale=7.5
):
    """
    Generate with negative prompts
 
    Args:
        positive_text: what you want ("beautiful landscape")
        negative_text: what you don't want ("blurry, low quality")
    """
    positive_emb = encode_text(positive_text)
    negative_emb = encode_text(negative_text)
 
    x = torch.randn(shape)
    timesteps = create_timestep_schedule(steps)
 
    for i in range(len(timesteps) - 1):
        t = timesteps[i]
        t_prev = timesteps[i + 1]
        t_batch = torch.full((shape[0],), t)
 
        # Negative prompt as "unconditional"
        noise_negative = model(x, t_batch, text_embedding=negative_emb)
 
        # Positive prompt as "conditional"
        noise_positive = model(x, t_batch, text_embedding=positive_emb)
 
        # Guidance away from negative, toward positive
        noise_pred = noise_negative + guidance_scale * (noise_positive - noise_negative)
 
        x = ddim_step(x, noise_pred, t, t_prev)
 
    return x

Example usage:


positive = "A beautiful sunset over mountains, highly detailed, 8k"
negative = "blurry, low quality, distorted, ugly, watermark"
 
image = sample_with_negative_prompt(model, positive, negative, (1, 3, 512, 512))

Negative prompts dramatically improve quality in practice and are standard in all modern text-to-image systems.

Why Classifier-Free Guidance Works

Intuition

The unconditional model $\epsilon_\theta(x_t, t, \emptyset)$ generates “average” images
The conditional model $\epsilon_\theta(x_t, t, c)$ shifts toward the text prompt
The difference $(\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$ is the “direction” toward the prompt
By amplifying this with $s > 1$ , we get much stronger conditioning

Mathematical Insight

Classifier-free guidance is equivalent to adding the gradient of a classifier $\nabla_x \log p(c|x)$ , but we don’t need an actual classifier (hence “classifier-free”). The model implicitly learns classification through conditioning.

Formally, it approximates:

\nabla_x \log p(x_t | c) \approx \nabla_x \log p(x_t) + s \cdot \nabla_x \log p(c | x_t)

Comparison to Classifier Guidance

Aspect	Classifier Guidance	Classifier-Free Guidance
Requires separate classifier	Yes (must train classifier on noisy images)	No
Training complexity	Two models	One model
Sample quality	Good	Better
Flexibility	Limited	High (negative prompts, etc.)
Current use	Obsolete	Industry standard

All modern text-to-image systems (DALL-E 2, Stable Diffusion, Midjourney, Imagen) use classifier-free guidance.

Healthcare Application: Conditional Synthetic Data

Generate synthetic medical data with specific characteristics:

Medical Image Generation


# Chest X-ray with specific pathology
positive_prompt = "chest X-ray showing pneumonia, clear infiltrates, medical imaging"
negative_prompt = "healthy lungs, normal, no pathology, artifacts"
 
synthetic_xray = sample_with_negative_prompt(
    medical_diffusion_model,
    positive_prompt,
    negative_prompt,
    (1, 1, 512, 512),
    guidance_scale=10.0  # Strong adherence to pathology
)

Rare Condition Synthesis


# Generate synthetic EHR data for rare disease
condition_prompt = "patient trajectory with diabetic ketoacidosis and sepsis, ICU admission"
normal_prompt = "healthy patient, normal vitals, stable condition"
 
synthetic_trajectory = generate_ehr_sequence(
    ehr_diffusion_model,
    condition_prompt,
    normal_prompt,
    guidance_scale=12.0  # Very strong adherence to rare condition
)

Applications:

Data augmentation for rare diseases
Privacy-preserving synthetic datasets
Testing clinical decision support systems
Training models on balanced datasets

Practical Tips

Default guidance scale: Start with 7.5
Use negative prompts: Dramatically improves quality
- Common negative: “blurry, low quality, distorted, ugly, watermark, text”
Detailed prompts work better:
- ❌ Weak: “A cat”
- ✅ Better: “A fluffy orange cat sitting on a windowsill, digital art, highly detailed, trending on artstation”
Experiment with scales: Different prompts need different guidance strengths
- Abstract concepts: lower guidance (5-7)
- Specific objects: higher guidance (10-15)
Trade-offs: Higher guidance → better prompt adherence but possible artifacts

Computational Cost

Classifier-free guidance requires two forward passes per timestep:

One unconditional pass
One conditional pass

This doubles the computational cost compared to simple conditional generation, but the quality improvement is worth it.

Optimization: Some implementations batch both passes together for efficiency.

Extensions and Variations

Multi-Conditional Guidance

Combine multiple conditions (text + style + layout):

\epsilon = \epsilon_{\text{uncond}} + s_1 \cdot (\epsilon_{\text{text}} - \epsilon_{\text{uncond}}) + s_2 \cdot (\epsilon_{\text{style}} - \epsilon_{\text{uncond}})

Adaptive Guidance

Vary guidance scale across timesteps:

Early steps (high noise): lower guidance for structure
Late steps (low noise): higher guidance for details

Compositional Generation

Guide toward multiple prompts simultaneously:


# "Cat" AND "Wizard hat" AND "Digital art style"
noise = uncond + s1*(cat - uncond) + s2*(wizard - uncond) + s3*(style - uncond)

CLIP Paper - Text encoder for conditioning
DDPM Paper - Base diffusion algorithm
DDIM Paper - Fast sampling (used with CFG)
Multimodal Foundations - Cross-modal alignment

Key Papers

Ho & Salimans (2021): “Classifier-Free Diffusion Guidance” - Original technique
Nichol et al. (2021): “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” - First major application
Ramesh et al. (2022): “Hierarchical Text-Conditional Image Generation with CLIP Latents” - DALL-E 2
Rombach et al. (2022): “High-Resolution Image Synthesis with Latent Diffusion Models” - Stable Diffusion

Learning Resources

Tutorials

Hugging Face: Guidance Scale - Practical guide
Stable Diffusion Guide - Using negative prompts

Paper Explanations

AI Coffee Break: Classifier-Free Guidance - Video explanation
Lilian Weng: Diffusion Models - Mathematical treatment

Implementation

Hugging Face Diffusers - Production code
Stable Diffusion - Reference implementation