Skip to Content
LibraryConceptsSelf-Supervised Learning

Self-Supervised Learning

Self-supervised learning is a paradigm for training deep learning models on unlabeled data by designing pretext tasks that create supervision signals from the data itself.

The Core Motivation

Problem: Labeled data is expensive and scarce, especially in specialized domains requiring expert annotation.

Solution: Learn useful representations from abundant unlabeled data before fine-tuning on limited labeled examples.

Key Insight: We can create learning tasks from the data structure itself, without requiring human labels.

Why Self-Supervised Learning Matters

The Data Reality

In many domains, we face an imbalance:

  • Abundant unlabeled data: Millions of examples (web images, text, medical records)
  • Scarce labeled data: Only a small subset has outcome labels
  • Expensive annotation: Expert review is costly and time-consuming

The Solution: Two-Stage Training

# Stage 1: Self-supervised pre-training on ALL data # Learn general representations without labels pretrain_data = all_unlabeled_examples # Millions of examples model = pretrain_self_supervised(pretrain_data) # Stage 2: Supervised fine-tuning on LABELED data # Adapt pre-trained representations to specific task labeled_data = small_labeled_subset # Thousands of examples final_model = fine_tune(model, labeled_data, task="classification")

This two-stage approach powers modern foundation models (BERT, GPT, CLIP) - pre-train on massive unlabeled datasets, then fine-tune for specific applications.

Two Main Paradigms

Self-supervised learning has converged on two complementary approaches:

1. Contrastive Learning

Core Idea: Pull similar examples together, push dissimilar examples apart in representation space.

Method:

  1. Create multiple augmented views of the same data point
  2. Make their representations similar (positive pairs)
  3. Make them different from other data points (negative pairs)
  4. Learn representations that capture semantic similarity

Key Methods:

  • SimCLR (Simple Framework for Contrastive Learning)
  • MoCo (Momentum Contrast)
  • BYOL (Bootstrap Your Own Latent)
  • CLIP (for vision-language)

See Contrastive Learning for detailed coverage.

2. Masked Prediction

Core Idea: Predict missing parts of the input to learn contextual relationships.

Method:

  1. Hide/mask portions of the input data
  2. Train model to reconstruct the masked portions
  3. Forces model to understand context and structure
  4. Learn representations that capture dependencies

Key Methods:

  • BERT (Bidirectional Encoder Representations from Transformers) - masks words in text
  • MAE (Masked Autoencoders) - masks image patches
  • Masked language modeling in language model pre-training

See Masked Prediction for detailed coverage.

Comparison to Supervised Learning

ApproachTraining DataLabeled Data RequiredPerformance on Small Labeled Sets
Supervised OnlyOnly labeled examples100%Poor (overfits easily with limited data)
Self-Supervised + Fine-tuningAll unlabeled + labeled subsetLabels only for fine-tuning (~10-20%)Excellent (leverages all data)
Transfer LearningPre-trained on different domainTarget domain labelsGood (if domains are similar)

Key Advantages

  1. Data Efficiency: Leverage millions of unlabeled examples to improve performance on thousands of labeled examples
  2. Better Generalization: Pre-trained representations reduce overfitting on small labeled sets
  3. Domain Adaptation: Learn domain-specific patterns before task-specific fine-tuning
  4. Cost Effective: Dramatically reduces labeling requirements

Domain-Specific Applications

Healthcare Example

Scenario: 8M emergency department visits, but only 100K have specific outcome labels (e.g., hospital admission within 7 days).

Approach:

  1. Pre-train on all 8M visits using self-supervised learning (no labels needed)
  2. Fine-tune on 100K labeled visits for admission prediction
  3. Achieve much better performance than training only on 100K labeled examples

Result: Self-supervised pre-training can improve performance by 20-40% compared to supervised-only training on the labeled subset.

Other Applications

  • Medical Imaging: Pre-train on millions of unlabeled scans, fine-tune on smaller labeled datasets
  • Clinical NLP: Pre-train on all clinical notes (like ClinicalBERT)
  • Scientific Domains: Learn from abundant experimental data before task-specific prediction
  • Industry: Pre-train on internal proprietary data without requiring expensive labeling

The Foundation Model Era

Self-supervised learning is the core training paradigm for foundation models:

Language:

  • GPT series: Predict next token (masked prediction variant)
  • BERT: Masked language modeling
  • Clinical language models: Pre-train on medical text

Vision:

  • Vision Transformers: Masked image modeling (MAE)
  • CLIP: Contrastive vision-language learning
  • Medical imaging models: Contrastive or masked pre-training

Multimodal:

  • CLIP: Contrastive image-text pairs
  • Flamingo, BLIP-2: Multi-stage self-supervised + supervised pre-training
  • Healthcare VLMs: Image-report contrastive learning

Key Principles

1. Design Good Pretext Tasks

The pretext task should:

  • Be solvable from the data alone (no labels)
  • Force the model to learn useful representations
  • Transfer well to downstream tasks

2. Data Augmentation is Critical

For contrastive learning:

  • Create views that preserve semantic content
  • Augmentations should be strong enough to provide learning signal
  • But not so strong they destroy the underlying concept

3. Large-Scale Pre-Training

Self-supervised learning benefits from:

  • Large unlabeled datasets (millions to billions of examples)
  • Long pre-training (multiple epochs over large data)
  • Computational resources (distributed training)

4. Fine-Tuning Strategies

After pre-training:

  • Full fine-tuning: Update all parameters (best performance, requires more labeled data)
  • Linear probing: Freeze representations, train only final layer (data efficient)
  • Few-shot learning: Fine-tune with very few examples (most data efficient)

Implementation Pattern

import torch import torch.nn as nn class SelfSupervisedModel(nn.Module): """General pattern for self-supervised learning""" def __init__(self, encoder): super().__init__() self.encoder = encoder # Backbone (ResNet, Transformer, etc.) self.projection_head = nn.Linear(encoder.output_dim, 128) def forward(self, x): # Extract features features = self.encoder(x) # Project to embedding space embeddings = self.projection_head(features) return embeddings # Pre-training loop (contrastive example) def pretrain(model, unlabeled_data, epochs=100): optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) for epoch in range(epochs): for batch in unlabeled_data: # Create two augmented views view1, view2 = augment(batch), augment(batch) # Get embeddings z1 = model(view1) z2 = model(view2) # Contrastive loss (pull view1 and view2 together) loss = contrastive_loss(z1, z2) # Update optimizer.zero_grad() loss.backward() optimizer.step() return model # Fine-tuning loop def fine_tune(pretrained_model, labeled_data, num_classes, epochs=20): # Replace projection head with task-specific head model = pretrained_model.encoder classifier = nn.Linear(model.output_dim, num_classes) optimizer = torch.optim.Adam([ {'params': model.parameters(), 'lr': 1e-4}, # Lower LR for pre-trained {'params': classifier.parameters(), 'lr': 1e-3} # Higher LR for new layer ]) for epoch in range(epochs): for batch_x, batch_y in labeled_data: features = model(batch_x) logits = classifier(features) loss = nn.CrossEntropyLoss()(logits, batch_y) optimizer.zero_grad() loss.backward() optimizer.step() return model, classifier

Historical Context

Self-supervised learning has evolved from early approaches to become the dominant pre-training paradigm:

Early Work (2000s-2015):

  • Autoencoders for unsupervised feature learning
  • Restricted Boltzmann Machines (RBMs)
  • Word2Vec (2013) - early successful self-supervised language model

Contrastive Learning Era (2018-2020):

  • Instance discrimination (2018)
  • MoCo (2019), SimCLR (2020) - breakthrough contrastive methods
  • BYOL (2020) - contrastive without explicit negatives

Masked Prediction Era (2018-present):

  • BERT (2018) - masked language modeling revolution
  • GPT series (2018-present) - autoregressive language modeling
  • MAE (2021) - masked autoencoders for images

Multimodal Era (2021-present):

  • CLIP (2021) - vision-language contrastive learning
  • Modern foundation models combine both paradigms

Further Reading

Papers

  • SimCLR: “A Simple Framework for Contrastive Learning of Visual Representations” (Chen et al., 2020)
  • MoCo: “Momentum Contrast for Unsupervised Visual Representation Learning” (He et al., 2020)
  • BYOL: “Bootstrap Your Own Latent” (Grill et al., 2020)
  • BERT: “BERT: Pre-training of Deep Bidirectional Transformers” (Devlin et al., 2018)
  • MAE: “Masked Autoencoders Are Scalable Vision Learners” (He et al., 2021)

Tutorials

  • Lilian Weng: “Self-Supervised Representation Learning” - Comprehensive blog post
  • SimCLR Paper Explained: Step-by-step walkthrough (Yannic Kilcher)
  • BERT Illustrated: Visual guide to masked language modeling (Jay Alammar)

Code

  • SimCLR: Official TensorFlow implementation
  • MoCo: Official PyTorch implementation
  • Hugging Face Transformers: BERT and other self-supervised models
  • timm library: Pre-trained vision models with self-supervised variants