DALL-E 2: Hierarchical Text-Conditional Image Generation

DALL-E 2 (Ramesh et al., 2022) represents OpenAI’s breakthrough text-to-image system that sparked the generative AI revolution. Unlike simple conditional diffusion, DALL-E 2 uses a sophisticated two-stage pipeline that separates semantic understanding from image generation.

Core Innovation

Two-stage design: Separate “what to generate” (semantic understanding) from “how to generate” (image synthesis)


Text prompt → [Prior] → CLIP image embedding → [Decoder] → Generated image

The Two-Stage Pipeline

DALL-E 2 consists of two separate models working in sequence:

Prior: Maps text to CLIP image embedding space
Decoder: Generates images from CLIP embeddings via diffusion

Why Two Stages?

Advantages:

Semantic guidance: CLIP embeddings capture high-level semantics
Modular design: Can swap components independently
Flexibility: Supports both text-to-image and image-to-image
Better control: CLIP latent space is more interpretable than pixel space
Image variations: Can generate variations by perturbing embeddings

Stage 1: The Prior

Goal: Map text embeddings to CLIP image embedding space

Input: Text prompt (string) Output: CLIP image embedding (512-dim vector)

Why a Prior?

CLIP aligns text and image embeddings in a shared space, but they’re not identical:

Text embedding: What CLIP extracts from text descriptions
Image embedding: What CLIP extracts from actual images

The prior learns to bridge this gap, predicting what image embedding would correspond to a given text.

Prior Architecture Options

DALL-E 2 paper explores two architectures:

Option 1: Autoregressive Prior

A GPT-style transformer that generates CLIP image embeddings autoregressively:


class AutoregressivePrior(nn.Module):
    """
    Transformer that generates CLIP image embedding autoregressively
    """
    def __init__(self, clip_embed_dim=512, num_layers=12):
        super().__init__()
        self.transformer = GPT(
            vocab_size=0,  # Continuous tokens
            embed_dim=clip_embed_dim,
            num_layers=num_layers,
            num_heads=8,
        )
 
    def forward(self, text_embedding):
        """
        Generate CLIP image embedding from text embedding
 
        Args:
            text_embedding: CLIP text encoding (B, 512)
 
        Returns:
            image_embedding: CLIP image encoding (B, 512)
        """
        return self.transformer.generate(text_embedding)

Option 2: Diffusion Prior (Used in DALL-E 2)

Performs diffusion directly in CLIP embedding space:


class DiffusionPrior(nn.Module):
    """
    Diffusion model operating in CLIP embedding space
    """
    def __init__(self, clip_embed_dim=512):
        super().__init__()
        # Small transformer for denoising
        self.model = Transformer(
            dim=clip_embed_dim,
            num_layers=6,
            num_heads=8,
        )
 
    def forward(self, noisy_image_emb, t, text_emb):
        """
        Denoise CLIP image embedding conditioned on text
 
        Args:
            noisy_image_emb: noised CLIP image embedding (B, 512)
            t: diffusion timestep
            text_emb: CLIP text embedding (B, 512)
 
        Returns:
            predicted_noise: (B, 512)
        """
        return self.model(noisy_image_emb, t, context=text_emb)
 
@torch.no_grad()
def sample_prior(diffusion_prior, text_embedding, steps=64):
    """Generate CLIP image embedding from text"""
    # Start from noise in embedding space
    image_emb = torch.randn(text_embedding.shape)
 
    # Diffusion in embedding space (much smaller than pixel space!)
    for t in reversed(range(steps)):
        predicted_noise = diffusion_prior(image_emb, t, text_embedding)
        image_emb = ddpm_step(image_emb, predicted_noise, t)
 
    return image_emb

Key insight: The prior does diffusion in embedding space (512-dim vectors), not pixel space (256×256×3 images). Much more efficient!

Stage 2: The Decoder

Goal: Generate high-resolution image from CLIP image embedding

Input: CLIP image embedding (512-dim vector) Output: RGB image (1024×1024×3)

Decoder Architecture

Standard U-Net diffusion model conditioned on CLIP embeddings:


class DALLE2Decoder(nn.Module):
    """
    Diffusion decoder that generates images from CLIP embeddings
    """
    def __init__(self, clip_embed_dim=512):
        super().__init__()
        self.unet = UNet(
            in_channels=3,
            out_channels=3,
            context_dim=clip_embed_dim,  # Condition on CLIP embedding
            base_channels=256,
            channel_mult=[1, 2, 3, 4],
            attention_resolutions=[32, 16, 8],
        )
 
    def forward(self, noisy_image, t, clip_image_embedding):
        """
        Denoise image conditioned on CLIP image embedding
 
        Args:
            noisy_image: (B, 3, H, W)
            t: timestep
            clip_image_embedding: (B, 512)
 
        Returns:
            predicted_noise: (B, 3, H, W)
        """
        return self.unet(noisy_image, t, context=clip_image_embedding)

Multi-Resolution Generation

DALL-E 2 generates images progressively through cascaded diffusion:

64×64 base: Fast generation, captures overall structure
256×256 upsampler: Adds details and coherence
1024×1024 upsampler: Final high-resolution output

Each upsampler is a separate diffusion model conditioned on the lower-resolution image.

Complete DALL-E 2 Pipeline


class DALLE2(nn.Module):
    def __init__(self):
        super().__init__()
        # Frozen CLIP for encoding
        self.clip = load_clip_model()
        self.clip.eval()
 
        # Prior: text embedding → image embedding
        self.prior = DiffusionPrior(clip_embed_dim=512)
 
        # Decoder: image embedding → 64×64 image
        self.decoder_64 = DALLE2Decoder(clip_embed_dim=512)
 
        # Upsamplers for higher resolution
        self.upsampler_256 = UpsamplerDiffusion(64, 256)
        self.upsampler_1024 = UpsamplerDiffusion(256, 1024)
 
    @torch.no_grad()
    def generate(self, text_prompt, guidance_scale=4.0):
        """
        Generate image from text
 
        Args:
            text_prompt: string (e.g., "A cat wearing a wizard hat")
            guidance_scale: classifier-free guidance strength
 
        Returns:
            image: (1, 3, 1024, 1024)
        """
        # 1. Encode text with CLIP
        text_embedding = self.clip.encode_text(text_prompt)  # (1, 512)
 
        # 2. Prior: generate CLIP image embedding
        clip_image_embedding = sample_prior(
            self.prior,
            text_embedding,
            steps=64  # Prior uses fewer steps
        )  # (1, 512)
 
        # 3. Decoder: generate 64×64 image
        image_64 = sample_with_cfg(
            self.decoder_64,
            clip_image_embedding,
            shape=(1, 3, 64, 64),
            steps=50,
            guidance_scale=guidance_scale
        )
 
        # 4. Upsample to 256×256
        image_256 = sample_upsampler(
            self.upsampler_256,
            image_64,
            clip_image_embedding
        )
 
        # 5. Upsample to 1024×1024
        image_1024 = sample_upsampler(
            self.upsampler_1024,
            image_256,
            clip_image_embedding
        )
 
        return image_1024

Image Variations and Editing

DALL-E 2’s two-stage design enables powerful capabilities beyond text-to-image:

Image Variations

Generate variations by perturbing CLIP embeddings:


@torch.no_grad()
def generate_variations(dalle2, input_image, num_variations=4):
    """Generate variations of an existing image"""
    # 1. Encode image with CLIP
    clip_image_embedding = dalle2.clip.encode_image(input_image)
 
    # 2. Add noise to embedding for diversity
    noisy_embeddings = clip_image_embedding + 0.1 * torch.randn(num_variations, 512)
 
    # 3. Decode each embedding to image
    variations = []
    for emb in noisy_embeddings:
        image = sample_with_cfg(dalle2.decoder_64, emb, (1, 3, 64, 64))
        variations.append(image)
 
    return variations

Text-Guided Image Editing

Interpolate between image and text embeddings:


@torch.no_grad()
def text_guided_edit(dalle2, input_image, text_edit, alpha=0.5):
    """
    Edit image based on text
 
    Args:
        input_image: original image
        text_edit: editing instruction (e.g., "make it winter")
        alpha: interpolation weight (0 = original, 1 = full edit)
    """
    # Encode image
    image_emb = dalle2.clip.encode_image(input_image)
 
    # Encode text and generate corresponding image embedding
    text_emb = dalle2.clip.encode_text(text_edit)
    text_image_emb = sample_prior(dalle2.prior, text_emb)
 
    # Interpolate embeddings
    edited_emb = (1 - alpha) * image_emb + alpha * text_image_emb
 
    # Generate edited image
    edited_image = sample_with_cfg(dalle2.decoder_64, edited_emb, (1, 3, 64, 64))
 
    return edited_image

Training DALL-E 2

Training happens in three stages:

1. Train CLIP (or use pre-trained)


# CLIP is pre-trained on 400M image-text pairs
clip_model = load_pretrained_clip()  # Frozen during DALL-E 2 training

2. Train the Prior


# Train prior to map text embeddings to image embeddings
for text, image in dataset:
    text_emb = clip.encode_text(text)
    image_emb = clip.encode_image(image)  # Ground truth target
 
    # Diffusion training in embedding space
    loss = train_diffusion_prior(prior, text_emb, target=image_emb)

3. Train the Decoder


# Train decoder to generate images from CLIP embeddings
for image in dataset:
    image_emb = clip.encode_image(image)
 
    # Diffusion training in pixel space
    loss = train_diffusion_decoder(decoder, image, conditioning=image_emb)

Comparison to Stable Diffusion

Aspect	DALL-E 2	Stable Diffusion
Prior	Yes (two-stage)	No (one-stage)
Text encoder	CLIP	CLIP
Latent space	CLIP embeddings	VAE latents
Resolution	1024×1024	512×512 (v1), 1024×1024 (SDXL)
Open source	No	Yes
Efficiency	Moderate	High (latent diffusion)
Variations	Native support	Requires techniques

Stable Diffusion simplified the architecture:

No prior (direct text conditioning via classifier-free guidance)
Diffusion in VAE latent space (more efficient)
Open source and widely adopted
Single-stage pipeline is simpler

Key Results

Quantitative Performance

CLIP score: 0.63 (human evaluators prefer DALL-E 2 to ground truth 5% of the time)
Prompt following: Strong adherence to complex multi-object prompts
Photorealism: Photorealistic images for many categories

Qualitative Capabilities

Complex scene composition with multiple objects
Style transfer (“in the style of Picasso”)
Unlikely combinations (“astronaut riding a horse”)
Variations maintain semantic content while changing details

Historical Impact

DALL-E 2 (April 2022) was the first widely-seen “mind-blowing” text-to-image system:

Demonstrated diffusion + CLIP could generate photorealistic images from arbitrary text
Sparked the generative AI revolution (2022-present)
Inspired Stable Diffusion, Midjourney, and other systems
Showed that multimodal pre-training enables powerful generation

Limitations

Slow: Two-stage generation requires more compute
Closed source: Cannot replicate exactly (proprietary)
Complexity: More components to train and tune than one-stage systems
Resource intensive: Requires significant training compute
Text rendering: Struggles with generating readable text in images
Fine details: Sometimes misses small details in complex prompts

Modern Alternatives

While DALL-E 2 was groundbreaking, newer systems have emerged:

DALL-E 3 (2023): Improved prompt following with caption refinement
Stable Diffusion (2022): Open source latent diffusion, widely adopted
Midjourney (2022): Closed source, exceptional artistic quality
Imagen (2022): Google’s approach with T5 text encoder

Prerequisites

CLIP: Contrastive Language-Image Pre-training - Text-image alignment
Diffusion fundamentals - Forward and reverse processes
DDPM: Denoising Diffusion Probabilistic Models - Base diffusion method
Classifier-free guidance - Conditioning technique

DDIM: Fast sampling - Used in decoder
Generative models overview - Context for diffusion
Multimodal learning - Combining text and images

Applications

Healthcare diffusion applications - Medical image synthesis

Key Takeaways

Two-stage pipeline: Prior (text → CLIP embedding) + Decoder (embedding → image)
CLIP latent space: Semantic guidance through pre-trained CLIP embeddings
Diffusion in embedding space: Prior operates on 512-dim vectors, not pixels
Multi-resolution cascade: Progressive generation from 64×64 → 1024×1024
Image variations: Natural support via CLIP embedding perturbation
Historical significance: Sparked the generative AI revolution in 2022
Trade-offs: Two stages add complexity but enable flexibility and control

Citation


@article{ramesh2022hierarchical,
  title={Hierarchical Text-Conditional Image Generation with CLIP Latents},
  author={Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark},
  journal={arXiv preprint arXiv:2204.06125},
  year={2022}
}

Resources

Paper (arXiv)
DALL-E 2 Website - Try the system
OpenAI Blog Post - High-level overview
Improved DDPM Paper - Decoder improvements