Classifier-Free Guidance
Classifier-free guidance (CFG) is the breakthrough technique that enables high-quality text-to-image generation in DALL-E, Stable Diffusion, and Midjourney. It solves the key challenge: how to make diffusion models strongly follow text prompts without sacrificing generation quality.
The Goal
Generate images conditioned on text prompts:
Prompt: "A cat wearing a wizard hat, digital art"
→ Diffusion model generates corresponding imageChallenge: Simple conditioning (just adding text embeddings) produces weak adherence to prompts and lower quality than unconditional generation.
Text Encoding with CLIP
From the CLIP paper, we learned that CLIP creates a shared embedding space for text and images. We use CLIP’s text encoder to get semantic embeddings:
import clip
# Load CLIP model
clip_model, preprocess = clip.load("ViT-B/32")
# Encode text
text = "A cat wearing a wizard hat, digital art"
text_tokens = clip.tokenize([text])
text_embedding = clip_model.encode_text(text_tokens) # Shape: (1, 512)Conditional Diffusion Model
Modify the U-Net to accept text conditioning:
class ConditionalDiffusionModel(nn.Module):
def __init__(self, text_embed_dim=512):
super().__init__()
self.unet = UNet(
in_channels=3,
out_channels=3,
context_dim=text_embed_dim, # Add text context
num_res_blocks=2,
attention_resolutions=[16, 8],
)
def forward(self, x, t, text_embedding=None):
"""
Predict noise conditioned on text
Args:
x: noisy image (B, C, H, W)
t: timestep (B,)
text_embedding: CLIP text embedding (B, 512) or None
Returns:
predicted_noise: (B, C, H, W)
"""
return self.unet(x, t, context=text_embedding)Cross-Attention for Conditioning
Inside the U-Net, use cross-attention to inject text information:
class CrossAttentionBlock(nn.Module):
def __init__(self, dim, context_dim):
super().__init__()
self.to_q = nn.Linear(dim, dim)
self.to_k = nn.Linear(context_dim, dim)
self.to_v = nn.Linear(context_dim, dim)
def forward(self, x, context):
"""
x: image features (B, H*W, dim)
context: text embedding (B, seq_len, context_dim)
"""
q = self.to_q(x) # Query from image
k = self.to_k(context) # Key from text
v = self.to_v(context) # Value from text
# Attention: image features attend to text
attn = torch.softmax(q @ k.transpose(-2, -1) / math.sqrt(dim), dim=-1)
out = attn @ v
return outHow it works:
- Image features act as queries
- Text embedding provides keys and values
- Attention allows each pixel to “look at” relevant text concepts
The Problem with Simple Conditioning
Training with just text embeddings produces:
- ❌ Weak adherence to prompts
- ❌ Lower quality than unconditional generation
- ❌ Trade-off between following prompt and image quality
Why? The model learns to generate good images OR follow prompts, but not both strongly.
The Classifier-Free Guidance Solution
Key idea: Train the model to work both with and without text conditioning, then interpolate during sampling to amplify the conditioning effect.
Training with Classifier-Free Guidance
def train_step_cfg(model, x0, text_embedding, noise_schedule, optimizer, drop_prob=0.1):
"""
Train with classifier-free guidance
Args:
drop_prob: probability of dropping conditioning (e.g., 0.1 = 10%)
"""
batch_size = x0.shape[0]
device = x0.device
# Randomly drop conditioning for some samples
mask = torch.rand(batch_size, device=device) < drop_prob
text_embedding_masked = text_embedding.clone()
text_embedding_masked[mask] = 0 # Zero out for unconditional
# Standard diffusion training
t = torch.randint(0, noise_schedule.T, (batch_size,), device=device)
noise = torch.randn_like(x0)
alpha_bar_t = noise_schedule.alpha_bar[t].view(-1, 1, 1, 1)
x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * noise
# Predict noise with masked conditioning
predicted_noise = model(x_t, t, text_embedding_masked)
loss = F.mse_loss(predicted_noise, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()What this does:
- 90% of the time: train with text conditioning
- 10% of the time: train without conditioning (unconditional)
- Model learns both conditional and unconditional generation simultaneously
The Guidance Formula
During generation, combine conditional and unconditional predictions:
where:
- is text conditioning (CLIP embedding)
- is unconditional (no text)
- is guidance scale (typically 7.5)
In words:
- Predict noise with text:
- Predict noise without text:
- Amplify the difference:
Intuition: The unconditional model generates “average” images. The conditional model shifts toward the text prompt. By amplifying this shift (with ), we get much stronger conditioning.
Sampling with Classifier-Free Guidance
@torch.no_grad()
def sample_with_cfg(model, text_embedding, shape, steps=50, guidance_scale=7.5):
"""
Sample with classifier-free guidance
Args:
guidance_scale: strength of conditioning
1.0 = no guidance (conditional only)
7.5 = strong guidance (standard)
15+ = very strong (may reduce quality)
"""
model.eval()
x = torch.randn(shape)
timesteps = create_timestep_schedule(steps)
for i in range(len(timesteps) - 1):
t = timesteps[i]
t_prev = timesteps[i + 1]
t_batch = torch.full((shape[0],), t)
# Unconditional prediction (no text)
noise_uncond = model(x, t_batch, text_embedding=None)
# Conditional prediction (with text)
noise_cond = model(x, t_batch, text_embedding=text_embedding)
# Classifier-free guidance
noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
# DDIM step with guided noise prediction
x = ddim_step(x, noise_pred, t, t_prev)
return xGuidance Scale: The Key Hyperparameter
The guidance scale controls trade-off between prompt adherence and sample quality:
| Guidance Scale | Effect | Use Case |
|---|---|---|
| s = 1.0 | No guidance, just conditional | Maximum diversity, weak prompt adherence |
| s = 3-5 | Mild guidance, natural looking | Subtle prompts, artistic freedom |
| s = 7.5 | Standard, good balance | Default for most prompts |
| s = 10-15 | Strong adherence to prompt | Detailed specifications |
| s = 20+ | Very strong, may add artifacts | Forcing specific attributes |
# Generate with different guidance scales
scales = [1.0, 5.0, 7.5, 15.0]
prompt = "A cat wearing a wizard hat"
for scale in scales:
image = sample_with_cfg(
model,
encode_text(prompt),
(1, 3, 256, 256),
guidance_scale=scale
)
save_image(image, f"cat_wizard_scale_{scale}.png")Practical recommendation: Start with guidance scale = 7.5 for most prompts. Increase for more prompt adherence, decrease for more natural/diverse images.
Negative Prompts
A powerful extension: guide generation away from unwanted concepts.
@torch.no_grad()
def sample_with_negative_prompt(
model,
positive_text,
negative_text,
shape,
steps=50,
guidance_scale=7.5
):
"""
Generate with negative prompts
Args:
positive_text: what you want ("beautiful landscape")
negative_text: what you don't want ("blurry, low quality")
"""
positive_emb = encode_text(positive_text)
negative_emb = encode_text(negative_text)
x = torch.randn(shape)
timesteps = create_timestep_schedule(steps)
for i in range(len(timesteps) - 1):
t = timesteps[i]
t_prev = timesteps[i + 1]
t_batch = torch.full((shape[0],), t)
# Negative prompt as "unconditional"
noise_negative = model(x, t_batch, text_embedding=negative_emb)
# Positive prompt as "conditional"
noise_positive = model(x, t_batch, text_embedding=positive_emb)
# Guidance away from negative, toward positive
noise_pred = noise_negative + guidance_scale * (noise_positive - noise_negative)
x = ddim_step(x, noise_pred, t, t_prev)
return xExample usage:
positive = "A beautiful sunset over mountains, highly detailed, 8k"
negative = "blurry, low quality, distorted, ugly, watermark"
image = sample_with_negative_prompt(model, positive, negative, (1, 3, 512, 512))Negative prompts dramatically improve quality in practice and are standard in all modern text-to-image systems.
Why Classifier-Free Guidance Works
Intuition
- The unconditional model generates “average” images
- The conditional model shifts toward the text prompt
- The difference is the “direction” toward the prompt
- By amplifying this with , we get much stronger conditioning
Mathematical Insight
Classifier-free guidance is equivalent to adding the gradient of a classifier , but we don’t need an actual classifier (hence “classifier-free”). The model implicitly learns classification through conditioning.
Formally, it approximates:
Comparison to Classifier Guidance
| Aspect | Classifier Guidance | Classifier-Free Guidance |
|---|---|---|
| Requires separate classifier | Yes (must train classifier on noisy images) | No |
| Training complexity | Two models | One model |
| Sample quality | Good | Better |
| Flexibility | Limited | High (negative prompts, etc.) |
| Current use | Obsolete | Industry standard |
All modern text-to-image systems (DALL-E 2, Stable Diffusion, Midjourney, Imagen) use classifier-free guidance.
Healthcare Application: Conditional Synthetic Data
Generate synthetic medical data with specific characteristics:
Medical Image Generation
# Chest X-ray with specific pathology
positive_prompt = "chest X-ray showing pneumonia, clear infiltrates, medical imaging"
negative_prompt = "healthy lungs, normal, no pathology, artifacts"
synthetic_xray = sample_with_negative_prompt(
medical_diffusion_model,
positive_prompt,
negative_prompt,
(1, 1, 512, 512),
guidance_scale=10.0 # Strong adherence to pathology
)Rare Condition Synthesis
# Generate synthetic EHR data for rare disease
condition_prompt = "patient trajectory with diabetic ketoacidosis and sepsis, ICU admission"
normal_prompt = "healthy patient, normal vitals, stable condition"
synthetic_trajectory = generate_ehr_sequence(
ehr_diffusion_model,
condition_prompt,
normal_prompt,
guidance_scale=12.0 # Very strong adherence to rare condition
)Applications:
- Data augmentation for rare diseases
- Privacy-preserving synthetic datasets
- Testing clinical decision support systems
- Training models on balanced datasets
Practical Tips
- Default guidance scale: Start with 7.5
- Use negative prompts: Dramatically improves quality
- Common negative: “blurry, low quality, distorted, ugly, watermark, text”
- Detailed prompts work better:
- ❌ Weak: “A cat”
- ✅ Better: “A fluffy orange cat sitting on a windowsill, digital art, highly detailed, trending on artstation”
- Experiment with scales: Different prompts need different guidance strengths
- Abstract concepts: lower guidance (5-7)
- Specific objects: higher guidance (10-15)
- Trade-offs: Higher guidance → better prompt adherence but possible artifacts
Computational Cost
Classifier-free guidance requires two forward passes per timestep:
- One unconditional pass
- One conditional pass
This doubles the computational cost compared to simple conditional generation, but the quality improvement is worth it.
Optimization: Some implementations batch both passes together for efficiency.
Extensions and Variations
Multi-Conditional Guidance
Combine multiple conditions (text + style + layout):
Adaptive Guidance
Vary guidance scale across timesteps:
- Early steps (high noise): lower guidance for structure
- Late steps (low noise): higher guidance for details
Compositional Generation
Guide toward multiple prompts simultaneously:
# "Cat" AND "Wizard hat" AND "Digital art style"
noise = uncond + s1*(cat - uncond) + s2*(wizard - uncond) + s3*(style - uncond)Related Concepts
- CLIP Paper - Text encoder for conditioning
- DDPM Paper - Base diffusion algorithm
- DDIM Paper - Fast sampling (used with CFG)
- Multimodal Foundations - Cross-modal alignment
Key Papers
- Ho & Salimans (2021): “Classifier-Free Diffusion Guidance” - Original technique
- Nichol et al. (2021): “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” - First major application
- Ramesh et al. (2022): “Hierarchical Text-Conditional Image Generation with CLIP Latents” - DALL-E 2
- Rombach et al. (2022): “High-Resolution Image Synthesis with Latent Diffusion Models” - Stable Diffusion
Learning Resources
Tutorials
- Hugging Face: Guidance Scale - Practical guide
- Stable Diffusion Guide - Using negative prompts
Paper Explanations
- AI Coffee Break: Classifier-Free Guidance - Video explanation
- Lilian Weng: Diffusion Models - Mathematical treatment
Implementation
- Hugging Face Diffusers - Production code
- Stable Diffusion - Reference implementation