Skip to Content
LibraryPapersDALL-E 2 (2022)

DALL-E 2: Hierarchical Text-Conditional Image Generation

DALL-E 2 (Ramesh et al., 2022) represents OpenAI’s breakthrough text-to-image system that sparked the generative AI revolution. Unlike simple conditional diffusion, DALL-E 2 uses a sophisticated two-stage pipeline that separates semantic understanding from image generation.

Core Innovation

Two-stage design: Separate “what to generate” (semantic understanding) from “how to generate” (image synthesis)

Text prompt → [Prior] → CLIP image embedding → [Decoder] → Generated image

The Two-Stage Pipeline

DALL-E 2 consists of two separate models working in sequence:

  1. Prior: Maps text to CLIP image embedding space
  2. Decoder: Generates images from CLIP embeddings via diffusion

Why Two Stages?

Advantages:

  • Semantic guidance: CLIP embeddings capture high-level semantics
  • Modular design: Can swap components independently
  • Flexibility: Supports both text-to-image and image-to-image
  • Better control: CLIP latent space is more interpretable than pixel space
  • Image variations: Can generate variations by perturbing embeddings

Stage 1: The Prior

Goal: Map text embeddings to CLIP image embedding space

Input: Text prompt (string) Output: CLIP image embedding (512-dim vector)

Why a Prior?

CLIP aligns text and image embeddings in a shared space, but they’re not identical:

  • Text embedding: What CLIP extracts from text descriptions
  • Image embedding: What CLIP extracts from actual images

The prior learns to bridge this gap, predicting what image embedding would correspond to a given text.

Prior Architecture Options

DALL-E 2 paper explores two architectures:

Option 1: Autoregressive Prior

A GPT-style transformer that generates CLIP image embeddings autoregressively:

class AutoregressivePrior(nn.Module): """ Transformer that generates CLIP image embedding autoregressively """ def __init__(self, clip_embed_dim=512, num_layers=12): super().__init__() self.transformer = GPT( vocab_size=0, # Continuous tokens embed_dim=clip_embed_dim, num_layers=num_layers, num_heads=8, ) def forward(self, text_embedding): """ Generate CLIP image embedding from text embedding Args: text_embedding: CLIP text encoding (B, 512) Returns: image_embedding: CLIP image encoding (B, 512) """ return self.transformer.generate(text_embedding)

Option 2: Diffusion Prior (Used in DALL-E 2)

Performs diffusion directly in CLIP embedding space:

class DiffusionPrior(nn.Module): """ Diffusion model operating in CLIP embedding space """ def __init__(self, clip_embed_dim=512): super().__init__() # Small transformer for denoising self.model = Transformer( dim=clip_embed_dim, num_layers=6, num_heads=8, ) def forward(self, noisy_image_emb, t, text_emb): """ Denoise CLIP image embedding conditioned on text Args: noisy_image_emb: noised CLIP image embedding (B, 512) t: diffusion timestep text_emb: CLIP text embedding (B, 512) Returns: predicted_noise: (B, 512) """ return self.model(noisy_image_emb, t, context=text_emb) @torch.no_grad() def sample_prior(diffusion_prior, text_embedding, steps=64): """Generate CLIP image embedding from text""" # Start from noise in embedding space image_emb = torch.randn(text_embedding.shape) # Diffusion in embedding space (much smaller than pixel space!) for t in reversed(range(steps)): predicted_noise = diffusion_prior(image_emb, t, text_embedding) image_emb = ddpm_step(image_emb, predicted_noise, t) return image_emb

Key insight: The prior does diffusion in embedding space (512-dim vectors), not pixel space (256×256×3 images). Much more efficient!

Stage 2: The Decoder

Goal: Generate high-resolution image from CLIP image embedding

Input: CLIP image embedding (512-dim vector) Output: RGB image (1024×1024×3)

Decoder Architecture

Standard U-Net diffusion model conditioned on CLIP embeddings:

class DALLE2Decoder(nn.Module): """ Diffusion decoder that generates images from CLIP embeddings """ def __init__(self, clip_embed_dim=512): super().__init__() self.unet = UNet( in_channels=3, out_channels=3, context_dim=clip_embed_dim, # Condition on CLIP embedding base_channels=256, channel_mult=[1, 2, 3, 4], attention_resolutions=[32, 16, 8], ) def forward(self, noisy_image, t, clip_image_embedding): """ Denoise image conditioned on CLIP image embedding Args: noisy_image: (B, 3, H, W) t: timestep clip_image_embedding: (B, 512) Returns: predicted_noise: (B, 3, H, W) """ return self.unet(noisy_image, t, context=clip_image_embedding)

Multi-Resolution Generation

DALL-E 2 generates images progressively through cascaded diffusion:

  1. 64×64 base: Fast generation, captures overall structure
  2. 256×256 upsampler: Adds details and coherence
  3. 1024×1024 upsampler: Final high-resolution output

Each upsampler is a separate diffusion model conditioned on the lower-resolution image.

Complete DALL-E 2 Pipeline

class DALLE2(nn.Module): def __init__(self): super().__init__() # Frozen CLIP for encoding self.clip = load_clip_model() self.clip.eval() # Prior: text embedding → image embedding self.prior = DiffusionPrior(clip_embed_dim=512) # Decoder: image embedding → 64×64 image self.decoder_64 = DALLE2Decoder(clip_embed_dim=512) # Upsamplers for higher resolution self.upsampler_256 = UpsamplerDiffusion(64, 256) self.upsampler_1024 = UpsamplerDiffusion(256, 1024) @torch.no_grad() def generate(self, text_prompt, guidance_scale=4.0): """ Generate image from text Args: text_prompt: string (e.g., "A cat wearing a wizard hat") guidance_scale: classifier-free guidance strength Returns: image: (1, 3, 1024, 1024) """ # 1. Encode text with CLIP text_embedding = self.clip.encode_text(text_prompt) # (1, 512) # 2. Prior: generate CLIP image embedding clip_image_embedding = sample_prior( self.prior, text_embedding, steps=64 # Prior uses fewer steps ) # (1, 512) # 3. Decoder: generate 64×64 image image_64 = sample_with_cfg( self.decoder_64, clip_image_embedding, shape=(1, 3, 64, 64), steps=50, guidance_scale=guidance_scale ) # 4. Upsample to 256×256 image_256 = sample_upsampler( self.upsampler_256, image_64, clip_image_embedding ) # 5. Upsample to 1024×1024 image_1024 = sample_upsampler( self.upsampler_1024, image_256, clip_image_embedding ) return image_1024

Image Variations and Editing

DALL-E 2’s two-stage design enables powerful capabilities beyond text-to-image:

Image Variations

Generate variations by perturbing CLIP embeddings:

@torch.no_grad() def generate_variations(dalle2, input_image, num_variations=4): """Generate variations of an existing image""" # 1. Encode image with CLIP clip_image_embedding = dalle2.clip.encode_image(input_image) # 2. Add noise to embedding for diversity noisy_embeddings = clip_image_embedding + 0.1 * torch.randn(num_variations, 512) # 3. Decode each embedding to image variations = [] for emb in noisy_embeddings: image = sample_with_cfg(dalle2.decoder_64, emb, (1, 3, 64, 64)) variations.append(image) return variations

Text-Guided Image Editing

Interpolate between image and text embeddings:

@torch.no_grad() def text_guided_edit(dalle2, input_image, text_edit, alpha=0.5): """ Edit image based on text Args: input_image: original image text_edit: editing instruction (e.g., "make it winter") alpha: interpolation weight (0 = original, 1 = full edit) """ # Encode image image_emb = dalle2.clip.encode_image(input_image) # Encode text and generate corresponding image embedding text_emb = dalle2.clip.encode_text(text_edit) text_image_emb = sample_prior(dalle2.prior, text_emb) # Interpolate embeddings edited_emb = (1 - alpha) * image_emb + alpha * text_image_emb # Generate edited image edited_image = sample_with_cfg(dalle2.decoder_64, edited_emb, (1, 3, 64, 64)) return edited_image

Training DALL-E 2

Training happens in three stages:

1. Train CLIP (or use pre-trained)

# CLIP is pre-trained on 400M image-text pairs clip_model = load_pretrained_clip() # Frozen during DALL-E 2 training

2. Train the Prior

# Train prior to map text embeddings to image embeddings for text, image in dataset: text_emb = clip.encode_text(text) image_emb = clip.encode_image(image) # Ground truth target # Diffusion training in embedding space loss = train_diffusion_prior(prior, text_emb, target=image_emb)

3. Train the Decoder

# Train decoder to generate images from CLIP embeddings for image in dataset: image_emb = clip.encode_image(image) # Diffusion training in pixel space loss = train_diffusion_decoder(decoder, image, conditioning=image_emb)

Comparison to Stable Diffusion

AspectDALL-E 2Stable Diffusion
PriorYes (two-stage)No (one-stage)
Text encoderCLIPCLIP
Latent spaceCLIP embeddingsVAE latents
Resolution1024×1024512×512 (v1), 1024×1024 (SDXL)
Open sourceNoYes
EfficiencyModerateHigh (latent diffusion)
VariationsNative supportRequires techniques

Stable Diffusion simplified the architecture:

  • No prior (direct text conditioning via classifier-free guidance)
  • Diffusion in VAE latent space (more efficient)
  • Open source and widely adopted
  • Single-stage pipeline is simpler

Key Results

Quantitative Performance

  • CLIP score: 0.63 (human evaluators prefer DALL-E 2 to ground truth 5% of the time)
  • Prompt following: Strong adherence to complex multi-object prompts
  • Photorealism: Photorealistic images for many categories

Qualitative Capabilities

  • Complex scene composition with multiple objects
  • Style transfer (“in the style of Picasso”)
  • Unlikely combinations (“astronaut riding a horse”)
  • Variations maintain semantic content while changing details

Historical Impact

DALL-E 2 (April 2022) was the first widely-seen “mind-blowing” text-to-image system:

  • Demonstrated diffusion + CLIP could generate photorealistic images from arbitrary text
  • Sparked the generative AI revolution (2022-present)
  • Inspired Stable Diffusion, Midjourney, and other systems
  • Showed that multimodal pre-training enables powerful generation

Limitations

  • Slow: Two-stage generation requires more compute
  • Closed source: Cannot replicate exactly (proprietary)
  • Complexity: More components to train and tune than one-stage systems
  • Resource intensive: Requires significant training compute
  • Text rendering: Struggles with generating readable text in images
  • Fine details: Sometimes misses small details in complex prompts

Modern Alternatives

While DALL-E 2 was groundbreaking, newer systems have emerged:

  • DALL-E 3 (2023): Improved prompt following with caption refinement
  • Stable Diffusion (2022): Open source latent diffusion, widely adopted
  • Midjourney (2022): Closed source, exceptional artistic quality
  • Imagen (2022): Google’s approach with T5 text encoder

Prerequisites

Applications

Key Takeaways

  1. Two-stage pipeline: Prior (text → CLIP embedding) + Decoder (embedding → image)
  2. CLIP latent space: Semantic guidance through pre-trained CLIP embeddings
  3. Diffusion in embedding space: Prior operates on 512-dim vectors, not pixels
  4. Multi-resolution cascade: Progressive generation from 64×64 → 1024×1024
  5. Image variations: Natural support via CLIP embedding perturbation
  6. Historical significance: Sparked the generative AI revolution in 2022
  7. Trade-offs: Two stages add complexity but enable flexibility and control

Citation

@article{ramesh2022hierarchical, title={Hierarchical Text-Conditional Image Generation with CLIP Latents}, author={Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark}, journal={arXiv preprint arXiv:2204.06125}, year={2022} }

Resources