Contrastive Learning

Contrastive learning is a self-supervised learning approach that learns representations by pulling similar examples together and pushing dissimilar examples apart in embedding space. It has become the foundation for modern vision-language models, self-supervised pre-training, and multimodal alignment.

Core Principle

Instead of learning from labeled data, contrastive learning exploits natural structure in data:

Positive pairs: Examples that should be similar (same class, image-caption pairs, augmentations of same image)
Negative pairs: Examples that should be different (different classes, mismatched pairs)

The model learns to:

Maximize similarity between positive pairs
Minimize similarity between negative pairs
Create a semantically meaningful embedding space

This approach scales to billions of examples because it requires no manual labels—just the natural pairing or augmentation of data.

The InfoNCE Loss ⭐

The most widely used contrastive loss is InfoNCE (Information Noise-Contrastive Estimation), popularized by CLIP:

\mathcal{L}_{\text{InfoNCE}} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(\text{sim}(x_i, x_i^+) / \tau)}{\sum_{j=1}^N \exp(\text{sim}(x_i, x_j) / \tau)}

where:

$x_i$ is the anchor example (e.g., an image embedding)
$x_i^+$ is the positive example (e.g., matching text embedding)
$x_j$ are all examples in the batch (including the positive)
$\text{sim}$ is the similarity function (usually cosine similarity)
$\tau$ is the temperature parameter (controls distribution sharpness)
$N$ is the batch size

Intuition

For each anchor $x_i$ :

Numerator: Maximize similarity with its positive pair $x_i^+$
Denominator: Contrast against all other examples in the batch

This is essentially a classification problem where the model must identify which of the $N$ examples is the correct match. The batch provides implicit negative samples without requiring explicit annotation.

Multimodal Contrastive Loss

For vision-language pairs (image $I$ , text $T$ ), the loss is symmetric:

\mathcal{L}_{\text{contrastive}} = \frac{1}{2}(\mathcal{L}_{I \to T} + \mathcal{L}_{T \to I})

$\mathcal{L}_{I \to T}$ : Image-to-text retrieval (can we find the right text for this image?)
$\mathcal{L}_{T \to I}$ : Text-to-image retrieval (can we find the right image for this text?)

This bidirectional loss ensures the embedding space is aligned in both directions.

PyTorch Implementation

Basic Contrastive Loss


import torch
import torch.nn.functional as F
 
def contrastive_loss(image_embeddings, text_embeddings, temperature=0.07):
    """
    Compute symmetric contrastive loss for image-text pairs.
 
    Args:
        image_embeddings: (batch_size, embed_dim)
        text_embeddings: (batch_size, embed_dim)
        temperature: Temperature parameter τ
 
    Returns:
        Scalar contrastive loss
    """
    batch_size = image_embeddings.shape[0]
 
    # Normalize embeddings to unit hypersphere
    # This makes similarity = cosine similarity
    image_embeddings = F.normalize(image_embeddings, dim=-1)
    text_embeddings = F.normalize(text_embeddings, dim=-1)
 
    # Compute similarity matrix: (batch_size, batch_size)
    # logits[i, j] = similarity between image i and text j
    logits = image_embeddings @ text_embeddings.T / temperature
 
    # Labels: diagonal elements are positive pairs
    labels = torch.arange(batch_size, device=logits.device)
 
    # Image-to-text loss: for each image, find its text
    loss_i2t = F.cross_entropy(logits, labels)
 
    # Text-to-image loss: for each text, find its image
    loss_t2i = F.cross_entropy(logits.T, labels)
 
    # Symmetric loss
    loss = (loss_i2t + loss_t2i) / 2
    return loss

With Learnable Temperature


class ContrastiveLoss(nn.Module):
    def __init__(self, init_temperature=0.07, learnable=True):
        super().__init__()
        # Temperature as learnable parameter (log-space for stability)
        self.log_temperature = nn.Parameter(
            torch.log(torch.tensor(init_temperature)),
            requires_grad=learnable
        )
 
    def forward(self, image_embeddings, text_embeddings):
        temperature = torch.exp(self.log_temperature)
 
        # Normalize
        image_embeddings = F.normalize(image_embeddings, dim=-1)
        text_embeddings = F.normalize(text_embeddings, dim=-1)
 
        # Compute logits
        logits = image_embeddings @ text_embeddings.T / temperature
        labels = torch.arange(len(logits), device=logits.device)
 
        # Symmetric loss
        loss_i2t = F.cross_entropy(logits, labels)
        loss_t2i = F.cross_entropy(logits.T, labels)
 
        return (loss_i2t + loss_t2i) / 2

Self-Supervised Vision (SimCLR-style)


def simclr_loss(z_i, z_j, temperature=0.5):
    """
    Contrastive loss for self-supervised vision (SimCLR).
 
    Args:
        z_i, z_j: Two augmented views of the same batch (batch_size, embed_dim)
        temperature: Temperature parameter
 
    Returns:
        Contrastive loss
    """
    batch_size = z_i.shape[0]
 
    # Normalize
    z_i = F.normalize(z_i, dim=-1)
    z_j = F.normalize(z_j, dim=-1)
 
    # Concatenate both views
    representations = torch.cat([z_i, z_j], dim=0)  # (2 * batch_size, embed_dim)
 
    # Similarity matrix
    similarity_matrix = representations @ representations.T / temperature
 
    # Mask diagonal (self-similarity)
    mask = torch.eye(2 * batch_size, device=similarity_matrix.device).bool()
    similarity_matrix = similarity_matrix.masked_fill(mask, -1e9)
 
    # Positive pairs: (i, i + batch_size) and (i + batch_size, i)
    positive_indices = torch.cat([
        torch.arange(batch_size, 2 * batch_size),
        torch.arange(0, batch_size)
    ]).to(similarity_matrix.device)
 
    # Compute loss
    loss = F.cross_entropy(similarity_matrix, positive_indices)
    return loss

Key Components

1. Similarity Function

Cosine similarity (most common):

\text{sim}(x, y) = \frac{x \cdot y}{\|x\| \|y\|} = \frac{x}{\|x\|} \cdot \frac{y}{\|y\|}

Why normalize?

Makes similarity scale-invariant
Bounds similarity to [-1, 1]
Focuses on direction, not magnitude
Prevents representation collapse

Alternatives:

Dot product (without normalization)
Euclidean distance (less common)
Learned similarity metrics

2. Temperature Parameter τ

Temperature controls the “sharpness” of the distribution:

Low temperature (e.g., τ = 0.01):
- Sharp, confident predictions
- Strong penalty for incorrect negatives
- Can lead to overconfident embeddings
High temperature (e.g., τ = 0.5):
- Smooth, uncertain predictions
- Gentler training signal
- May not separate embeddings enough
Typical values: 0.07 (CLIP), 0.1-0.5 (SimCLR)

Learnable temperature: Most modern models learn τ as a parameter during training, often in log-space for stability.

3. Batch Size and Negative Sampling

Larger batches provide more negative samples:

Benefits:

Better gradient estimates
More diverse negatives
Stronger contrastive signal
More robust representations

Challenges:

Memory cost (similarity matrix is $O(N^2)$ )
Computation cost
Requires large-batch optimization techniques

Typical batch sizes:

SimCLR: 4,096-8,192
CLIP: 32,768 (!)
MoCo: Uses queue of 65,536 negatives

Solutions for limited memory:

Gradient accumulation (accumulate over multiple mini-batches)
Negative queues (MoCo approach)
Distributed training across GPUs

Contrastive Learning Variants

1. SimCLR (Self-Supervised Vision)

Key idea: Different augmentations of the same image are positive pairs.


# Two random augmentations of each image
x_i = augment(image)  # e.g., crop + color jitter + blur
x_j = augment(image)  # different random augmentation
 
# Encode both views
z_i = encoder(x_i)
z_j = encoder(x_j)
 
# Contrastive loss: pull z_i and z_j together
loss = simclr_loss(z_i, z_j)

Why it works: Augmentations preserve semantic content while changing appearance. Model learns invariant representations.

2. MoCo (Momentum Contrast)

Key idea: Use a momentum-updated encoder and a queue of negatives.

Advantages:

Large number of negatives (65k+) without large batches
Consistent negative representations (momentum encoder)
Memory-efficient

3. CLIP (Vision-Language)

Key idea: Natural image-text pairs from the internet as positives.

Advantages:

Scalable to billions of pairs (no manual labeling)
Open-vocabulary (not limited to fixed classes)
Zero-shot transfer to new tasks

See CLIP for details.

4. Triplet Loss

An earlier contrastive approach:

\mathcal{L}_{\text{triplet}} = \max(0, \text{sim}(a, p) - \text{sim}(a, n) + \text{margin})

where $a$ is anchor, $p$ is positive, $n$ is negative.

Difference from InfoNCE:

InfoNCE contrasts against all negatives in batch
Triplet loss uses single negative at a time
InfoNCE generally more effective and efficient

Why Contrastive Learning Works

1. Self-Supervision

No manual labels needed—supervision comes from:

Natural co-occurrence (image-caption pairs)
Data augmentation (different views of same image)
Temporal consistency (frames in a video)

This enables scaling to billions of training examples.

2. Implicit Hard Negative Mining

Within each batch, the model encounters:

Easy negatives: Very different examples (pushed apart easily)
Hard negatives: Similar but different examples (more informative)

Hard negatives provide the strongest learning signal. Larger batches increase probability of hard negatives.

3. Invariance Learning

Contrastive learning forces the model to:

Ignore superficial variations (lighting, viewpoint, style)
Capture semantic content
Learn robust, transferable features

4. Representation Quality

Contrastive pre-training creates representations that:

Transfer well to downstream tasks
Enable zero-shot capabilities (e.g., CLIP)
Require less fine-tuning data
Generalize better than supervised pre-training

Practical Considerations

False Negatives

Problem: Batch may accidentally contain similar pairs that aren’t matched.

Example: In a batch of dog images with captions, “golden retriever on grass” and “labrador on grass” may be semantically similar but treated as negatives.

Impact: Model pushes these apart incorrectly, hurting performance.

Solutions:

Use larger, more diverse datasets (reduces collision probability)
Carefully curate training data
Use supervised contrastive learning (multiple positives per anchor)
Apply soft labels instead of hard negatives

Representation Collapse

Problem: Model maps all inputs to the same point (trivial solution).

Why it happens:

If encoder outputs constant, all similarities are equal → loss is zero
No explicit penalty against collapse in basic contrastive loss

Solutions:

Normalization (forces embeddings onto unit sphere)
Stop-gradient on negative branch (MoCo)
Batch normalization or LayerNorm
Predictor head with different architecture (BYOL)

Computational Cost

Contrastive learning is expensive:

Large batches for good negatives
Similarity matrix is $O(N^2)$
Multiple augmentations per image

Solutions:

Distributed training (data parallelism)
Gradient accumulation
Efficient negative sampling (MoCo queue)
Mixed precision training (float16)

Hyperparameter Sensitivity

Key hyperparameters:

Temperature: Typically 0.07-0.5, often learned
Batch size: Larger is better (but more expensive)
Augmentations: Critical for self-supervised vision
Optimizer: AdamW with warmup common
Learning rate: Often higher than supervised learning

Applications

Vision-Language Models

CLIP: Aligns images and text
ALIGN: Similar to CLIP with noisy web data
Florence: Large-scale vision-language foundation model

Self-Supervised Vision

SimCLR: Augmentation-based contrastive learning
MoCo v3: Momentum contrast with ViT
DINO: Self-distillation with no labels

Multimodal Healthcare

Align EHR text with medical images
Align symptom descriptions with body sketches
Contrastive pre-training on (report, image) pairs

Recommendation Systems

Align user embeddings with item embeddings
Session-based contrastive learning
Multi-view contrastive learning

Contrastive vs Supervised Learning

Aspect	Supervised Learning	Contrastive Learning
Labels	Requires manual class labels	Uses natural pairing or augmentation
Scalability	Limited by annotation cost	Scales to billions of examples
Flexibility	Fixed to predefined classes	Open-vocabulary, flexible
Transfer	Often requires fine-tuning	Strong zero-shot capabilities
Data Efficiency	Needs many labeled examples	Pre-trains on unlabeled data
Representation	Task-specific	General-purpose

Key Takeaways

Self-supervision: Learn from natural structure, not manual labels
InfoNCE loss: Contrast positive pairs against batch negatives
Temperature: Controls distribution sharpness, often learned
Batch size matters: More negatives = better representations = more cost
False negatives: Can hurt performance on small, homogeneous datasets
Scalability: Enables training on billions of examples
Transfer learning: Pre-trained contrastive models transfer exceptionally well

Multimodal Learning - Contrastive learning aligns modalities
CLIP - Large-scale contrastive vision-language pre-training
Optimization - Training with contrastive losses
Regularization - Preventing representation collapse
Vision Transformer - Often used as encoder in contrastive learning

Learning Resources

Foundational Papers

van den Oord et al. (2018) - Representation Learning with Contrastive Predictive Coding (CPC, introduced InfoNCE)
Chen et al. (2020) - SimCLR: A Simple Framework for Contrastive Learning
He et al. (2020) - Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)
Radford et al. (2021) - CLIP: Learning Transferable Visual Models From Natural Language Supervision

Contrastive Learning

Core Principle

The InfoNCE Loss ⭐

Intuition

Multimodal Contrastive Loss

PyTorch Implementation

Basic Contrastive Loss

With Learnable Temperature

Self-Supervised Vision (SimCLR-style)

Key Components

1. Similarity Function

2. Temperature Parameter τ

3. Batch Size and Negative Sampling

Contrastive Learning Variants

1. SimCLR (Self-Supervised Vision)

2. MoCo (Momentum Contrast)

3. CLIP (Vision-Language)

4. Triplet Loss

Why Contrastive Learning Works

1. Self-Supervision

2. Implicit Hard Negative Mining

3. Invariance Learning

4. Representation Quality

Practical Considerations

False Negatives

Representation Collapse

Computational Cost

Hyperparameter Sensitivity

Applications

Vision-Language Models

Self-Supervised Vision

Multimodal Healthcare

Recommendation Systems

Contrastive vs Supervised Learning

Key Takeaways

Learning Resources

Foundational Papers

Tutorials

Code

Videos

Contrastive Learning

Core Principle

The InfoNCE Loss ⭐

Intuition

Multimodal Contrastive Loss

PyTorch Implementation

Basic Contrastive Loss

With Learnable Temperature

Self-Supervised Vision (SimCLR-style)

Key Components

1. Similarity Function

2. Temperature Parameter τ

3. Batch Size and Negative Sampling

Contrastive Learning Variants

1. SimCLR (Self-Supervised Vision)

2. MoCo (Momentum Contrast)

3. CLIP (Vision-Language)

4. Triplet Loss

Why Contrastive Learning Works

1. Self-Supervision

2. Implicit Hard Negative Mining

3. Invariance Learning

4. Representation Quality

Practical Considerations

False Negatives

Representation Collapse

Computational Cost

Hyperparameter Sensitivity

Applications

Vision-Language Models

Self-Supervised Vision

Multimodal Healthcare

Recommendation Systems

Contrastive vs Supervised Learning

Key Takeaways

Related Concepts

Learning Resources

Foundational Papers

Tutorials

Code

Videos