Skip to Content
LibraryConceptsClassifier-Free Guidance

Classifier-Free Guidance

Classifier-free guidance (CFG) is the breakthrough technique that enables high-quality text-to-image generation in DALL-E, Stable Diffusion, and Midjourney. It solves the key challenge: how to make diffusion models strongly follow text prompts without sacrificing generation quality.

The Goal

Generate images conditioned on text prompts:

Prompt: "A cat wearing a wizard hat, digital art" → Diffusion model generates corresponding image

Challenge: Simple conditioning (just adding text embeddings) produces weak adherence to prompts and lower quality than unconditional generation.

Text Encoding with CLIP

From the CLIP paper, we learned that CLIP creates a shared embedding space for text and images. We use CLIP’s text encoder to get semantic embeddings:

import clip # Load CLIP model clip_model, preprocess = clip.load("ViT-B/32") # Encode text text = "A cat wearing a wizard hat, digital art" text_tokens = clip.tokenize([text]) text_embedding = clip_model.encode_text(text_tokens) # Shape: (1, 512)

Conditional Diffusion Model

Modify the U-Net to accept text conditioning:

class ConditionalDiffusionModel(nn.Module): def __init__(self, text_embed_dim=512): super().__init__() self.unet = UNet( in_channels=3, out_channels=3, context_dim=text_embed_dim, # Add text context num_res_blocks=2, attention_resolutions=[16, 8], ) def forward(self, x, t, text_embedding=None): """ Predict noise conditioned on text Args: x: noisy image (B, C, H, W) t: timestep (B,) text_embedding: CLIP text embedding (B, 512) or None Returns: predicted_noise: (B, C, H, W) """ return self.unet(x, t, context=text_embedding)

Cross-Attention for Conditioning

Inside the U-Net, use cross-attention to inject text information:

class CrossAttentionBlock(nn.Module): def __init__(self, dim, context_dim): super().__init__() self.to_q = nn.Linear(dim, dim) self.to_k = nn.Linear(context_dim, dim) self.to_v = nn.Linear(context_dim, dim) def forward(self, x, context): """ x: image features (B, H*W, dim) context: text embedding (B, seq_len, context_dim) """ q = self.to_q(x) # Query from image k = self.to_k(context) # Key from text v = self.to_v(context) # Value from text # Attention: image features attend to text attn = torch.softmax(q @ k.transpose(-2, -1) / math.sqrt(dim), dim=-1) out = attn @ v return out

How it works:

  1. Image features act as queries
  2. Text embedding provides keys and values
  3. Attention allows each pixel to “look at” relevant text concepts

The Problem with Simple Conditioning

Training with just text embeddings produces:

  • ❌ Weak adherence to prompts
  • ❌ Lower quality than unconditional generation
  • ❌ Trade-off between following prompt and image quality

Why? The model learns to generate good images OR follow prompts, but not both strongly.

The Classifier-Free Guidance Solution

Key idea: Train the model to work both with and without text conditioning, then interpolate during sampling to amplify the conditioning effect.

Training with Classifier-Free Guidance

def train_step_cfg(model, x0, text_embedding, noise_schedule, optimizer, drop_prob=0.1): """ Train with classifier-free guidance Args: drop_prob: probability of dropping conditioning (e.g., 0.1 = 10%) """ batch_size = x0.shape[0] device = x0.device # Randomly drop conditioning for some samples mask = torch.rand(batch_size, device=device) < drop_prob text_embedding_masked = text_embedding.clone() text_embedding_masked[mask] = 0 # Zero out for unconditional # Standard diffusion training t = torch.randint(0, noise_schedule.T, (batch_size,), device=device) noise = torch.randn_like(x0) alpha_bar_t = noise_schedule.alpha_bar[t].view(-1, 1, 1, 1) x_t = torch.sqrt(alpha_bar_t) * x0 + torch.sqrt(1 - alpha_bar_t) * noise # Predict noise with masked conditioning predicted_noise = model(x_t, t, text_embedding_masked) loss = F.mse_loss(predicted_noise, noise) optimizer.zero_grad() loss.backward() optimizer.step() return loss.item()

What this does:

  • 90% of the time: train with text conditioning
  • 10% of the time: train without conditioning (unconditional)
  • Model learns both conditional and unconditional generation simultaneously

The Guidance Formula

During generation, combine conditional and unconditional predictions:

ϵ~θ(xt,t,c)=ϵθ(xt,t,)+s(ϵθ(xt,t,c)ϵθ(xt,t,))\tilde{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))

where:

  • cc is text conditioning (CLIP embedding)
  • \emptyset is unconditional (no text)
  • ss is guidance scale (typically 7.5)

In words:

  1. Predict noise with text: ϵcond\epsilon_{\text{cond}}
  2. Predict noise without text: ϵuncond\epsilon_{\text{uncond}}
  3. Amplify the difference: ϵfinal=ϵuncond+s(ϵcondϵuncond)\epsilon_{\text{final}} = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})

Intuition: The unconditional model generates “average” images. The conditional model shifts toward the text prompt. By amplifying this shift (with s>1s > 1), we get much stronger conditioning.

Sampling with Classifier-Free Guidance

@torch.no_grad() def sample_with_cfg(model, text_embedding, shape, steps=50, guidance_scale=7.5): """ Sample with classifier-free guidance Args: guidance_scale: strength of conditioning 1.0 = no guidance (conditional only) 7.5 = strong guidance (standard) 15+ = very strong (may reduce quality) """ model.eval() x = torch.randn(shape) timesteps = create_timestep_schedule(steps) for i in range(len(timesteps) - 1): t = timesteps[i] t_prev = timesteps[i + 1] t_batch = torch.full((shape[0],), t) # Unconditional prediction (no text) noise_uncond = model(x, t_batch, text_embedding=None) # Conditional prediction (with text) noise_cond = model(x, t_batch, text_embedding=text_embedding) # Classifier-free guidance noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond) # DDIM step with guided noise prediction x = ddim_step(x, noise_pred, t, t_prev) return x

Guidance Scale: The Key Hyperparameter

The guidance scale ss controls trade-off between prompt adherence and sample quality:

Guidance ScaleEffectUse Case
s = 1.0No guidance, just conditionalMaximum diversity, weak prompt adherence
s = 3-5Mild guidance, natural lookingSubtle prompts, artistic freedom
s = 7.5Standard, good balanceDefault for most prompts
s = 10-15Strong adherence to promptDetailed specifications
s = 20+Very strong, may add artifactsForcing specific attributes
# Generate with different guidance scales scales = [1.0, 5.0, 7.5, 15.0] prompt = "A cat wearing a wizard hat" for scale in scales: image = sample_with_cfg( model, encode_text(prompt), (1, 3, 256, 256), guidance_scale=scale ) save_image(image, f"cat_wizard_scale_{scale}.png")

Practical recommendation: Start with guidance scale = 7.5 for most prompts. Increase for more prompt adherence, decrease for more natural/diverse images.

Negative Prompts

A powerful extension: guide generation away from unwanted concepts.

@torch.no_grad() def sample_with_negative_prompt( model, positive_text, negative_text, shape, steps=50, guidance_scale=7.5 ): """ Generate with negative prompts Args: positive_text: what you want ("beautiful landscape") negative_text: what you don't want ("blurry, low quality") """ positive_emb = encode_text(positive_text) negative_emb = encode_text(negative_text) x = torch.randn(shape) timesteps = create_timestep_schedule(steps) for i in range(len(timesteps) - 1): t = timesteps[i] t_prev = timesteps[i + 1] t_batch = torch.full((shape[0],), t) # Negative prompt as "unconditional" noise_negative = model(x, t_batch, text_embedding=negative_emb) # Positive prompt as "conditional" noise_positive = model(x, t_batch, text_embedding=positive_emb) # Guidance away from negative, toward positive noise_pred = noise_negative + guidance_scale * (noise_positive - noise_negative) x = ddim_step(x, noise_pred, t, t_prev) return x

Example usage:

positive = "A beautiful sunset over mountains, highly detailed, 8k" negative = "blurry, low quality, distorted, ugly, watermark" image = sample_with_negative_prompt(model, positive, negative, (1, 3, 512, 512))

Negative prompts dramatically improve quality in practice and are standard in all modern text-to-image systems.

Why Classifier-Free Guidance Works

Intuition

  • The unconditional model ϵθ(xt,t,)\epsilon_\theta(x_t, t, \emptyset) generates “average” images
  • The conditional model ϵθ(xt,t,c)\epsilon_\theta(x_t, t, c) shifts toward the text prompt
  • The difference (ϵcondϵuncond)(\epsilon_{\text{cond}} - \epsilon_{\text{uncond}}) is the “direction” toward the prompt
  • By amplifying this with s>1s > 1, we get much stronger conditioning

Mathematical Insight

Classifier-free guidance is equivalent to adding the gradient of a classifier xlogp(cx)\nabla_x \log p(c|x), but we don’t need an actual classifier (hence “classifier-free”). The model implicitly learns classification through conditioning.

Formally, it approximates:

xlogp(xtc)xlogp(xt)+sxlogp(cxt)\nabla_x \log p(x_t | c) \approx \nabla_x \log p(x_t) + s \cdot \nabla_x \log p(c | x_t)

Comparison to Classifier Guidance

AspectClassifier GuidanceClassifier-Free Guidance
Requires separate classifierYes (must train classifier on noisy images)No
Training complexityTwo modelsOne model
Sample qualityGoodBetter
FlexibilityLimitedHigh (negative prompts, etc.)
Current useObsoleteIndustry standard

All modern text-to-image systems (DALL-E 2, Stable Diffusion, Midjourney, Imagen) use classifier-free guidance.

Healthcare Application: Conditional Synthetic Data

Generate synthetic medical data with specific characteristics:

Medical Image Generation

# Chest X-ray with specific pathology positive_prompt = "chest X-ray showing pneumonia, clear infiltrates, medical imaging" negative_prompt = "healthy lungs, normal, no pathology, artifacts" synthetic_xray = sample_with_negative_prompt( medical_diffusion_model, positive_prompt, negative_prompt, (1, 1, 512, 512), guidance_scale=10.0 # Strong adherence to pathology )

Rare Condition Synthesis

# Generate synthetic EHR data for rare disease condition_prompt = "patient trajectory with diabetic ketoacidosis and sepsis, ICU admission" normal_prompt = "healthy patient, normal vitals, stable condition" synthetic_trajectory = generate_ehr_sequence( ehr_diffusion_model, condition_prompt, normal_prompt, guidance_scale=12.0 # Very strong adherence to rare condition )

Applications:

  • Data augmentation for rare diseases
  • Privacy-preserving synthetic datasets
  • Testing clinical decision support systems
  • Training models on balanced datasets

Practical Tips

  1. Default guidance scale: Start with 7.5
  2. Use negative prompts: Dramatically improves quality
    • Common negative: “blurry, low quality, distorted, ugly, watermark, text”
  3. Detailed prompts work better:
    • ❌ Weak: “A cat”
    • ✅ Better: “A fluffy orange cat sitting on a windowsill, digital art, highly detailed, trending on artstation”
  4. Experiment with scales: Different prompts need different guidance strengths
    • Abstract concepts: lower guidance (5-7)
    • Specific objects: higher guidance (10-15)
  5. Trade-offs: Higher guidance → better prompt adherence but possible artifacts

Computational Cost

Classifier-free guidance requires two forward passes per timestep:

  • One unconditional pass
  • One conditional pass

This doubles the computational cost compared to simple conditional generation, but the quality improvement is worth it.

Optimization: Some implementations batch both passes together for efficiency.

Extensions and Variations

Multi-Conditional Guidance

Combine multiple conditions (text + style + layout):

ϵ=ϵuncond+s1(ϵtextϵuncond)+s2(ϵstyleϵuncond)\epsilon = \epsilon_{\text{uncond}} + s_1 \cdot (\epsilon_{\text{text}} - \epsilon_{\text{uncond}}) + s_2 \cdot (\epsilon_{\text{style}} - \epsilon_{\text{uncond}})

Adaptive Guidance

Vary guidance scale across timesteps:

  • Early steps (high noise): lower guidance for structure
  • Late steps (low noise): higher guidance for details

Compositional Generation

Guide toward multiple prompts simultaneously:

# "Cat" AND "Wizard hat" AND "Digital art style" noise = uncond + s1*(cat - uncond) + s2*(wizard - uncond) + s3*(style - uncond)

Key Papers

  • Ho & Salimans (2021): “Classifier-Free Diffusion Guidance” - Original technique
  • Nichol et al. (2021): “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” - First major application
  • Ramesh et al. (2022): “Hierarchical Text-Conditional Image Generation with CLIP Latents” - DALL-E 2
  • Rombach et al. (2022): “High-Resolution Image Synthesis with Latent Diffusion Models” - Stable Diffusion

Learning Resources

Tutorials

Paper Explanations

Implementation