CLIP: Contrastive Language-Image Pre-training
Radford et al. (2021) - “Learning Transferable Visual Models From Natural Language Supervision”
Published: ICML 2021 Citations: 25,000+ (as of 2024) Code: openai/CLIP
The Revolutionary Insight
While ImageNet changed computer vision by scaling labeled data, CLIP asks: what if we scale supervision from natural language instead? By training on 400 million image-text pairs scraped from the internet, CLIP learns visual concepts from text descriptions—enabling zero-shot transfer to new tasks without task-specific training.
Key breakthrough: Use contrastive learning to align images and text in a shared embedding space, then leverage natural language as a flexible interface for any vision task.
Core Innovation
From Supervised to Natural Language Supervision
Traditional approach (ImageNet):
- Collect images
- Manually label with predefined classes
- Train classifier on those classes
- Result: Works only on those specific 1000 classes
CLIP approach:
- Collect image-text pairs from the internet (no manual labeling!)
- Train to align images with their natural captions
- At test time, describe classes with text
- Result: Works on any classes described in natural language
This shift from closed-vocabulary to open-vocabulary learning is transformative.
The CLIP Training Objective
Given a batch of (image, text) pairs:
where each direction uses InfoNCE contrastive loss:
Intuition: For each image, the model must identify its matching text caption among options. This is a classification task with classes, where the “class” is dynamically defined by the batch.
Architecture
CLIP consists of four main components:
1. Image Encoder
Transforms images into fixed-dimensional embeddings:
Options:
- ResNet-50/101 (modified with attention pooling instead of average pooling)
- Vision Transformer (ViT) in multiple sizes:
- ViT-B/32: Base model with 32×32 patches
- ViT-B/16: Base model with 16×16 patches
- ViT-L/14: Large model with 14×14 patches (best performance)
Input: 224×224 RGB image Output: Visual features (e.g., 768-d for ResNet, 512-768-d for ViT)
See Vision Transformer for ViT details.
2. Text Encoder
Transformer encoder (similar to GPT) for text:
Architecture:
- 12 layers (standard CLIP)
- 8 attention heads
- 512-dimensional embeddings
- Causal attention mask (like GPT)
- Max sequence length: 77 tokens
Input: Tokenized text with BPE (49,152 vocab size) Output: Text features from [EOS] token (512-d)
Why [EOS] token? Similar to BERT’s [CLS], it aggregates information from the entire sequence.
3. Projection Heads
Linear projections to a shared multimodal embedding space:
# Both modalities project to same dimensionality
vision_projection = nn.Linear(vision_dim, embed_dim) # e.g., 768 → 512
text_projection = nn.Linear(text_dim, embed_dim) # e.g., 512 → 512Shared space dimension: Typically 512 (balances expressiveness and efficiency)
4. Contrastive Learning
Symmetric contrastive loss aligns the embeddings in both directions.
Complete PyTorch Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class CLIP(nn.Module):
def __init__(self,
vision_encoder,
text_encoder,
vision_dim=768,
text_dim=512,
embed_dim=512,
temperature=0.07):
"""
CLIP: Contrastive Language-Image Pre-training.
Args:
vision_encoder: Image encoder (ResNet or ViT)
text_encoder: Text encoder (Transformer)
vision_dim: Dimension of vision features
text_dim: Dimension of text features
embed_dim: Dimension of shared embedding space
temperature: Initial temperature for contrastive loss
"""
super().__init__()
self.vision_encoder = vision_encoder
self.text_encoder = text_encoder
# Projection heads to shared embedding space
self.vision_projection = nn.Linear(vision_dim, embed_dim, bias=False)
self.text_projection = nn.Linear(text_dim, embed_dim, bias=False)
# Learnable temperature (log-parameterized for stability)
self.log_temperature = nn.Parameter(torch.log(torch.tensor(temperature)))
def encode_image(self, images):
"""
Encode images to normalized embeddings.
Args:
images: (batch_size, 3, H, W)
Returns:
Normalized image embeddings (batch_size, embed_dim)
"""
vision_features = self.vision_encoder(images)
image_embeddings = self.vision_projection(vision_features)
image_embeddings = F.normalize(image_embeddings, dim=-1)
return image_embeddings
def encode_text(self, texts):
"""
Encode text to normalized embeddings.
Args:
texts: (batch_size, seq_len) - tokenized text
Returns:
Normalized text embeddings (batch_size, embed_dim)
"""
text_features = self.text_encoder(texts)
text_embeddings = self.text_projection(text_features)
text_embeddings = F.normalize(text_embeddings, dim=-1)
return text_embeddings
def forward(self, images, texts):
"""
Forward pass computing contrastive loss.
Args:
images: (batch_size, 3, H, W)
texts: (batch_size, seq_len)
Returns:
loss: Scalar contrastive loss
logits_per_image: Similarity matrix (batch_size, batch_size)
logits_per_text: Similarity matrix transposed
"""
# Encode both modalities
image_embeddings = self.encode_image(images)
text_embeddings = self.encode_text(texts)
# Compute similarity matrix
temperature = self.log_temperature.exp()
logits_per_image = image_embeddings @ text_embeddings.T / temperature
logits_per_text = logits_per_image.T
# Labels: diagonal elements are positive pairs
batch_size = len(images)
labels = torch.arange(batch_size, device=images.device)
# Symmetric contrastive loss
loss_i2t = F.cross_entropy(logits_per_image, labels)
loss_t2i = F.cross_entropy(logits_per_text, labels)
loss = (loss_i2t + loss_t2i) / 2
return loss, logits_per_image, logits_per_textZero-Shot Classification
The killer application of CLIP:
def zero_shot_classify(clip_model, image, class_names, templates=None):
"""
Classify image using natural language class descriptions.
Args:
clip_model: Trained CLIP model
image: Input image tensor (1, 3, H, W)
class_names: List of class names ["cat", "dog", ...]
templates: Optional prompt templates
Returns:
predicted_class: Index of most likely class
probabilities: Softmax probabilities for each class
"""
if templates is None:
templates = ["a photo of a {}."]
with torch.no_grad():
# Encode image once
image_embedding = clip_model.encode_image(image) # (1, embed_dim)
# Encode all class descriptions
text_embeddings = []
for class_name in class_names:
# Use multiple templates and average (ensemble prompts)
class_embeddings = []
for template in templates:
text = template.format(class_name)
text_tokens = tokenize(text)
embedding = clip_model.encode_text(text_tokens)
class_embeddings.append(embedding)
# Average over templates
class_embedding = torch.stack(class_embeddings).mean(dim=0)
text_embeddings.append(class_embedding)
text_embeddings = torch.stack(text_embeddings) # (num_classes, embed_dim)
# Compute similarity scores
logits = image_embedding @ text_embeddings.T # (1, num_classes)
probabilities = F.softmax(logits / 0.01, dim=-1).squeeze(0)
predicted_class = logits.argmax(dim=-1).item()
return predicted_class, probabilities
# Example usage
class_names = ["cat", "dog", "bird", "horse"]
templates = [
"a photo of a {}.",
"a picture of a {}.",
"an image showing a {}."
]
pred_idx, probs = zero_shot_classify(clip, image, class_names, templates)
print(f"Predicted: {class_names[pred_idx]}")
for name, prob in zip(class_names, probs):
print(f" {name}: {prob:.3f}")Image-Text Retrieval
def retrieve_images(clip_model, text_query, image_database):
"""
Retrieve most relevant images for a text query.
Args:
clip_model: Trained CLIP model
text_query: Natural language query string
image_database: Tensor of images (N, 3, H, W)
Returns:
top_k_indices: Indices of most similar images
similarities: Similarity scores
"""
with torch.no_grad():
# Encode text query
text_tokens = tokenize(text_query)
text_embedding = clip_model.encode_text(text_tokens) # (1, embed_dim)
# Encode all images
image_embeddings = clip_model.encode_image(image_database) # (N, embed_dim)
# Compute similarities
similarities = (text_embedding @ image_embeddings.T).squeeze(0) # (N,)
# Get top-k most similar
top_k_indices = similarities.topk(k=10).indices
return top_k_indices, similarities[top_k_indices]
# Example
results, scores = retrieve_images(
clip,
"a golden retriever playing in the park",
image_database
)Training Details
Dataset
WIT (WebImageText): 400 million (image, text) pairs collected from the internet:
- Scraped from various websites
- Includes alt-text, captions, titles
- No manual cleaning or verification
- Noisy but diverse
Why web data? Unlike ImageNet’s 1000 manually defined classes, web data provides:
- Natural language diversity (any concept can appear)
- Massive scale (400M pairs vs. 1.3M ImageNet images)
- Free supervision (no annotation cost)
- Real-world distribution
Training Setup
Batch size: 32,768 image-text pairs
- Provides 32,768 negatives per positive
- Requires distributed training across many GPUs
- Essential for learning discriminative embeddings
Optimization:
- AdamW optimizer
- Decoupled weight decay
- Cosine learning rate schedule with warmup
- Mixed precision (float16) training
Compute: ~256 V100 GPUs for 12 days (ViT-L/14 model)
Augmentation: Minimal (just random crop and resize)
- Unlike self-supervised vision (SimCLR), text provides natural augmentation
Model Configurations
| Model | Vision Encoder | Params | Top-1 ImageNet (Zero-Shot) |
|---|---|---|---|
| RN50 | ResNet-50 | 102M | 59.6% |
| RN101 | ResNet-101 | 119M | 62.3% |
| ViT-B/32 | ViT-Base, 32×32 patches | 151M | 63.3% |
| ViT-B/16 | ViT-Base, 16×16 patches | 149M | 68.3% |
| ViT-L/14 | ViT-Large, 14×14 patches | 428M | 75.5% |
Key Design Decisions
1. Contrastive vs Predictive Objective
Why contrastive? The paper compared:
- Predictive: Predict exact caption (like image captioning)
- Contrastive: Match image to caption among alternatives
Result: Contrastive learning scales better and transfers more effectively.
Reason: Contrastive learning captures semantic alignment without requiring exact word prediction.
2. Symmetric Loss
Both directions trained:
- Image → Text: “Which caption matches this image?”
- Text → Image: “Which image matches this caption?”
Benefit: Enables bidirectional retrieval and stronger alignment.
3. Batch Size
Massive batches (32,768 pairs) crucial for performance:
- More negatives = harder contrastive task = better representations
- Compute cost: similarity matrix, but critical for quality
4. Prompt Engineering
Text encoding uses prompt templates:
# Simple prompt
"a photo of a {class}"
# More descriptive
"a photo of a {class}, a type of {category}"
# Context-specific
"a satellite photo of a {class}" # for satellite imagery
"a CT scan showing {class}" # for medical imagesFinding: Prompt design significantly affects performance. Use ensemble of prompts for robustness.
5. No Fine-Tuning Required
Unlike traditional transfer learning:
- No need to train on downstream task data
- Just encode text descriptions of classes
- Immediate application to new tasks
Results
Zero-Shot Transfer Performance
ImageNet:
- CLIP ViT-L/14: 75.5% top-1 accuracy
- Without any ImageNet training!
- Matches ResNet-50 trained on ImageNet (76.2%)
27 additional datasets:
- Tested on diverse domains: OCR, satellite, texture, action recognition, etc.
- Competitive with task-specific models on many tasks
- Particularly strong on fine-grained classification
Few-Shot Learning
When fine-tuned on small amounts of labeled data:
- 16 shots: CLIP often matches models trained on full datasets
- Linear probe on top of frozen features works well
- Adaptation is data-efficient
Robustness
Zero-shot CLIP shows superior robustness to distribution shift:
- ImageNet variants (ImageNet-A, ImageNet-R, ObjectNet): Smaller accuracy drop
- Why? Natural language supervision provides more robust features than fixed ImageNet classes
Limitations
Despite impressive results, CLIP has notable weaknesses:
- Fine-grained classification: Struggles with subtle differences (e.g., car models, flower species)
- Counting: Poor at “how many X are in this image?”
- Compositional reasoning: Difficulty with spatial relationships (“X to the left of Y”)
- Abstract concepts: Better on concrete objects than abstract ideas
- Rare concepts: Depends on frequency in training data
- Biases: Reflects biases in web data (gender, race, etc.)
- OCR: Not trained to read text in images
Impact and Applications
Direct Applications
Zero-shot classification:
- Classify into any categories without training
- Useful for long-tail distributions
Image-text retrieval:
- Search images with natural language
- Generate captions by retrieving similar images
Prompt-based adaptation:
- Change behavior by changing prompts
- No model retraining needed
As Foundation for Other Models
CLIP embeddings used in:
- DALL-E 2: Text-to-image generation (CLIP guides diffusion)
- Stable Diffusion: Open-source text-to-image (CLIP text encoder)
- Flamingo: Vision-language generalist model
- GPT-4V: Multimodal GPT (likely uses CLIP-like pre-training)
Research Directions
Improvements:
- OpenCLIP: Open-source reproduction with better data curation
- ALIGN (Google): 1.8B noisy image-text pairs
- Florence: Hierarchical vision-language pre-training
- CoCa: Contrastive + captioning objectives combined
Extensions:
- Video-CLIP: Extend to video understanding
- Audio-CLIP: Add audio modality
- 3D-CLIP: Apply to 3D data
Practical Considerations
Using Pre-trained CLIP
import torch
import clip
# Load pre-trained model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14", device=device)
# Preprocess image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
# Tokenize text
text = clip.tokenize(["a cat", "a dog"]).to(device)
# Compute features
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(similarity) # [[0.95, 0.05]] - 95% cat, 5% dogPrompt Engineering Tips
Use descriptive prompts:
- Bad:
"cat" - Good:
"a photo of a cat" - Better:
"a high quality photo of a cat"
Ensemble multiple prompts:
templates = [
"a photo of a {}",
"a picture of a {}",
"{} in the scene",
"a rendering of a {}",
]
# Average embeddings from all templatesDomain-specific prompts:
- Medical:
"a CT scan showing {}" - Satellite:
"a satellite image of {}" - Sketches:
"a sketch of {}"
Fine-Tuning Strategies
Linear probe (freeze CLIP, train classifier):
# Freeze CLIP
for param in clip_model.parameters():
param.requires_grad = False
# Add classifier
classifier = nn.Linear(embed_dim, num_classes).to(device)
# Train only classifier
optimizer = torch.optim.Adam(classifier.parameters(), lr=1e-3)Full fine-tuning (unfreeze all):
# Unfreeze CLIP
for param in clip_model.parameters():
param.requires_grad = True
# Lower learning rate for pre-trained weights
optimizer = torch.optim.Adam([
{'params': clip_model.parameters(), 'lr': 1e-6},
{'params': classifier.parameters(), 'lr': 1e-3}
])For Healthcare Applications
CLIP’s approach is directly applicable to medical multimodal learning:
Adapting CLIP to Healthcare
Replace components:
- Vision encoder: Process medical images (X-rays, CT scans) or symptom sketches
- Text encoder: Use ClinicalBERT for medical text
- Training data: (image, report) pairs from medical databases
Benefits:
- Learn from naturally occurring (image, report) pairs
- Zero-shot classification of rare conditions
- Interpretable through text descriptions
- No need for extensive manual labeling
Example: Symptom sketch + text multimodal model
# Vision encoder: Process 3D body sketches
sketch_encoder = ResNet50() # or ViT
# Text encoder: ClinicalBERT for symptom descriptions
text_encoder = ClinicalBERT()
# Train with contrastive loss on (sketch, symptom text) pairs
# Zero-shot: "chest pain in left side" → retrieve similar sketchesKey Takeaways
- Natural language supervision scales better than fixed-class supervised learning
- Contrastive learning aligns vision and language in a shared space
- Zero-shot transfer emerges from massive-scale pre-training
- Prompt engineering is crucial for downstream performance
- Symmetric loss enables bidirectional retrieval
- Large batches (32,768!) critical for learning discriminative features
- Foundation model for many modern multimodal systems
Related Concepts
- Contrastive Learning - The training objective for CLIP
- Multimodal Learning - Foundations of multimodal models
- Vision Transformer - CLIP’s vision encoder option
- Attention Mechanism - Used in both encoders
- Transfer Learning - CLIP enables zero-shot transfer
- Text Generation - Prompting strategies
Learning Resources
Original Paper
Official Resources
Open-Source Implementations
- OpenCLIP - Improved open-source reproduction
- CLIP-as-service - Production deployment
- Hugging Face Transformers
Explanations
- Yannic Kilcher - CLIP Paper Explained
- Lilian Weng - Vision-Language Models
- Hugging Face Blog - CLIP Tutorial