Contrastive Learning
Contrastive learning is a self-supervised learning approach that learns representations by pulling similar examples together and pushing dissimilar examples apart in embedding space. It has become the foundation for modern vision-language models, self-supervised pre-training, and multimodal alignment.
Core Principle
Instead of learning from labeled data, contrastive learning exploits natural structure in data:
- Positive pairs: Examples that should be similar (same class, image-caption pairs, augmentations of same image)
- Negative pairs: Examples that should be different (different classes, mismatched pairs)
The model learns to:
- Maximize similarity between positive pairs
- Minimize similarity between negative pairs
- Create a semantically meaningful embedding space
This approach scales to billions of examples because it requires no manual labels—just the natural pairing or augmentation of data.
The InfoNCE Loss ⭐
The most widely used contrastive loss is InfoNCE (Information Noise-Contrastive Estimation), popularized by CLIP:
where:
- is the anchor example (e.g., an image embedding)
- is the positive example (e.g., matching text embedding)
- are all examples in the batch (including the positive)
- is the similarity function (usually cosine similarity)
- is the temperature parameter (controls distribution sharpness)
- is the batch size
Intuition
For each anchor :
- Numerator: Maximize similarity with its positive pair
- Denominator: Contrast against all other examples in the batch
This is essentially a classification problem where the model must identify which of the examples is the correct match. The batch provides implicit negative samples without requiring explicit annotation.
Multimodal Contrastive Loss
For vision-language pairs (image , text ), the loss is symmetric:
- : Image-to-text retrieval (can we find the right text for this image?)
- : Text-to-image retrieval (can we find the right image for this text?)
This bidirectional loss ensures the embedding space is aligned in both directions.
PyTorch Implementation
Basic Contrastive Loss
import torch
import torch.nn.functional as F
def contrastive_loss(image_embeddings, text_embeddings, temperature=0.07):
"""
Compute symmetric contrastive loss for image-text pairs.
Args:
image_embeddings: (batch_size, embed_dim)
text_embeddings: (batch_size, embed_dim)
temperature: Temperature parameter τ
Returns:
Scalar contrastive loss
"""
batch_size = image_embeddings.shape[0]
# Normalize embeddings to unit hypersphere
# This makes similarity = cosine similarity
image_embeddings = F.normalize(image_embeddings, dim=-1)
text_embeddings = F.normalize(text_embeddings, dim=-1)
# Compute similarity matrix: (batch_size, batch_size)
# logits[i, j] = similarity between image i and text j
logits = image_embeddings @ text_embeddings.T / temperature
# Labels: diagonal elements are positive pairs
labels = torch.arange(batch_size, device=logits.device)
# Image-to-text loss: for each image, find its text
loss_i2t = F.cross_entropy(logits, labels)
# Text-to-image loss: for each text, find its image
loss_t2i = F.cross_entropy(logits.T, labels)
# Symmetric loss
loss = (loss_i2t + loss_t2i) / 2
return lossWith Learnable Temperature
class ContrastiveLoss(nn.Module):
def __init__(self, init_temperature=0.07, learnable=True):
super().__init__()
# Temperature as learnable parameter (log-space for stability)
self.log_temperature = nn.Parameter(
torch.log(torch.tensor(init_temperature)),
requires_grad=learnable
)
def forward(self, image_embeddings, text_embeddings):
temperature = torch.exp(self.log_temperature)
# Normalize
image_embeddings = F.normalize(image_embeddings, dim=-1)
text_embeddings = F.normalize(text_embeddings, dim=-1)
# Compute logits
logits = image_embeddings @ text_embeddings.T / temperature
labels = torch.arange(len(logits), device=logits.device)
# Symmetric loss
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
return (loss_i2t + loss_t2i) / 2Self-Supervised Vision (SimCLR-style)
def simclr_loss(z_i, z_j, temperature=0.5):
"""
Contrastive loss for self-supervised vision (SimCLR).
Args:
z_i, z_j: Two augmented views of the same batch (batch_size, embed_dim)
temperature: Temperature parameter
Returns:
Contrastive loss
"""
batch_size = z_i.shape[0]
# Normalize
z_i = F.normalize(z_i, dim=-1)
z_j = F.normalize(z_j, dim=-1)
# Concatenate both views
representations = torch.cat([z_i, z_j], dim=0) # (2 * batch_size, embed_dim)
# Similarity matrix
similarity_matrix = representations @ representations.T / temperature
# Mask diagonal (self-similarity)
mask = torch.eye(2 * batch_size, device=similarity_matrix.device).bool()
similarity_matrix = similarity_matrix.masked_fill(mask, -1e9)
# Positive pairs: (i, i + batch_size) and (i + batch_size, i)
positive_indices = torch.cat([
torch.arange(batch_size, 2 * batch_size),
torch.arange(0, batch_size)
]).to(similarity_matrix.device)
# Compute loss
loss = F.cross_entropy(similarity_matrix, positive_indices)
return lossKey Components
1. Similarity Function
Cosine similarity (most common):
Why normalize?
- Makes similarity scale-invariant
- Bounds similarity to [-1, 1]
- Focuses on direction, not magnitude
- Prevents representation collapse
Alternatives:
- Dot product (without normalization)
- Euclidean distance (less common)
- Learned similarity metrics
2. Temperature Parameter τ
Temperature controls the “sharpness” of the distribution:
-
Low temperature (e.g., τ = 0.01):
- Sharp, confident predictions
- Strong penalty for incorrect negatives
- Can lead to overconfident embeddings
-
High temperature (e.g., τ = 0.5):
- Smooth, uncertain predictions
- Gentler training signal
- May not separate embeddings enough
-
Typical values: 0.07 (CLIP), 0.1-0.5 (SimCLR)
Learnable temperature: Most modern models learn τ as a parameter during training, often in log-space for stability.
3. Batch Size and Negative Sampling
Larger batches provide more negative samples:
Benefits:
- Better gradient estimates
- More diverse negatives
- Stronger contrastive signal
- More robust representations
Challenges:
- Memory cost (similarity matrix is )
- Computation cost
- Requires large-batch optimization techniques
Typical batch sizes:
- SimCLR: 4,096-8,192
- CLIP: 32,768 (!)
- MoCo: Uses queue of 65,536 negatives
Solutions for limited memory:
- Gradient accumulation (accumulate over multiple mini-batches)
- Negative queues (MoCo approach)
- Distributed training across GPUs
Contrastive Learning Variants
1. SimCLR (Self-Supervised Vision)
Key idea: Different augmentations of the same image are positive pairs.
# Two random augmentations of each image
x_i = augment(image) # e.g., crop + color jitter + blur
x_j = augment(image) # different random augmentation
# Encode both views
z_i = encoder(x_i)
z_j = encoder(x_j)
# Contrastive loss: pull z_i and z_j together
loss = simclr_loss(z_i, z_j)Why it works: Augmentations preserve semantic content while changing appearance. Model learns invariant representations.
2. MoCo (Momentum Contrast)
Key idea: Use a momentum-updated encoder and a queue of negatives.
Advantages:
- Large number of negatives (65k+) without large batches
- Consistent negative representations (momentum encoder)
- Memory-efficient
3. CLIP (Vision-Language)
Key idea: Natural image-text pairs from the internet as positives.
Advantages:
- Scalable to billions of pairs (no manual labeling)
- Open-vocabulary (not limited to fixed classes)
- Zero-shot transfer to new tasks
See CLIP for details.
4. Triplet Loss
An earlier contrastive approach:
where is anchor, is positive, is negative.
Difference from InfoNCE:
- InfoNCE contrasts against all negatives in batch
- Triplet loss uses single negative at a time
- InfoNCE generally more effective and efficient
Why Contrastive Learning Works
1. Self-Supervision
No manual labels needed—supervision comes from:
- Natural co-occurrence (image-caption pairs)
- Data augmentation (different views of same image)
- Temporal consistency (frames in a video)
This enables scaling to billions of training examples.
2. Implicit Hard Negative Mining
Within each batch, the model encounters:
- Easy negatives: Very different examples (pushed apart easily)
- Hard negatives: Similar but different examples (more informative)
Hard negatives provide the strongest learning signal. Larger batches increase probability of hard negatives.
3. Invariance Learning
Contrastive learning forces the model to:
- Ignore superficial variations (lighting, viewpoint, style)
- Capture semantic content
- Learn robust, transferable features
4. Representation Quality
Contrastive pre-training creates representations that:
- Transfer well to downstream tasks
- Enable zero-shot capabilities (e.g., CLIP)
- Require less fine-tuning data
- Generalize better than supervised pre-training
Practical Considerations
False Negatives
Problem: Batch may accidentally contain similar pairs that aren’t matched.
Example: In a batch of dog images with captions, “golden retriever on grass” and “labrador on grass” may be semantically similar but treated as negatives.
Impact: Model pushes these apart incorrectly, hurting performance.
Solutions:
- Use larger, more diverse datasets (reduces collision probability)
- Carefully curate training data
- Use supervised contrastive learning (multiple positives per anchor)
- Apply soft labels instead of hard negatives
Representation Collapse
Problem: Model maps all inputs to the same point (trivial solution).
Why it happens:
- If encoder outputs constant, all similarities are equal → loss is zero
- No explicit penalty against collapse in basic contrastive loss
Solutions:
- Normalization (forces embeddings onto unit sphere)
- Stop-gradient on negative branch (MoCo)
- Batch normalization or LayerNorm
- Predictor head with different architecture (BYOL)
Computational Cost
Contrastive learning is expensive:
- Large batches for good negatives
- Similarity matrix is
- Multiple augmentations per image
Solutions:
- Distributed training (data parallelism)
- Gradient accumulation
- Efficient negative sampling (MoCo queue)
- Mixed precision training (float16)
Hyperparameter Sensitivity
Key hyperparameters:
- Temperature: Typically 0.07-0.5, often learned
- Batch size: Larger is better (but more expensive)
- Augmentations: Critical for self-supervised vision
- Optimizer: AdamW with warmup common
- Learning rate: Often higher than supervised learning
Applications
Vision-Language Models
- CLIP: Aligns images and text
- ALIGN: Similar to CLIP with noisy web data
- Florence: Large-scale vision-language foundation model
Self-Supervised Vision
- SimCLR: Augmentation-based contrastive learning
- MoCo v3: Momentum contrast with ViT
- DINO: Self-distillation with no labels
Multimodal Healthcare
- Align EHR text with medical images
- Align symptom descriptions with body sketches
- Contrastive pre-training on (report, image) pairs
Recommendation Systems
- Align user embeddings with item embeddings
- Session-based contrastive learning
- Multi-view contrastive learning
Contrastive vs Supervised Learning
| Aspect | Supervised Learning | Contrastive Learning |
|---|---|---|
| Labels | Requires manual class labels | Uses natural pairing or augmentation |
| Scalability | Limited by annotation cost | Scales to billions of examples |
| Flexibility | Fixed to predefined classes | Open-vocabulary, flexible |
| Transfer | Often requires fine-tuning | Strong zero-shot capabilities |
| Data Efficiency | Needs many labeled examples | Pre-trains on unlabeled data |
| Representation | Task-specific | General-purpose |
Key Takeaways
- Self-supervision: Learn from natural structure, not manual labels
- InfoNCE loss: Contrast positive pairs against batch negatives
- Temperature: Controls distribution sharpness, often learned
- Batch size matters: More negatives = better representations = more cost
- False negatives: Can hurt performance on small, homogeneous datasets
- Scalability: Enables training on billions of examples
- Transfer learning: Pre-trained contrastive models transfer exceptionally well
Related Concepts
- Multimodal Learning - Contrastive learning aligns modalities
- CLIP - Large-scale contrastive vision-language pre-training
- Optimization - Training with contrastive losses
- Regularization - Preventing representation collapse
- Vision Transformer - Often used as encoder in contrastive learning
Learning Resources
Foundational Papers
- van den Oord et al. (2018) - Representation Learning with Contrastive Predictive Coding (CPC, introduced InfoNCE)
- Chen et al. (2020) - SimCLR: A Simple Framework for Contrastive Learning
- He et al. (2020) - Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)
- Radford et al. (2021) - CLIP: Learning Transferable Visual Models From Natural Language Supervision
Tutorials
- Lilian Weng - Contrastive Representation Learning
- Google Research - Understanding Contrastive Learning
- Hugging Face - Contrastive Learning Guide