Skip to Content
Learning PathsAdvancedTraining Techniques

Advanced Training Topics

Master cutting-edge concepts in deep learning training: self-supervised learning methods that enable learning from unlabeled data, and modern understanding of training dynamics including the double descent phenomenon.

Optional Advanced Module

This path covers advanced topics that enhance your deep learning understanding but aren’t strictly required for basic applications. Focus on foundation and intermediate paths first, then return here when ready for advanced theory.

Learning Objectives

By the end of this path, you will:

  • Understand self-supervised learning: Master the two main paradigms (contrastive learning and masked prediction)
  • Design pre-training strategies: Create effective pre-training approaches for domains with limited labels
  • Understand modern generalization: Grasp double descent and why overparameterized models generalize
  • Apply advanced techniques: Use self-supervised pre-training in your own projects

Prerequisites

Required paths:

Required knowledge:

  • Neural network training fundamentals
  • Bias-variance tradeoff
  • Regularization techniques
  • Transformer architecture

Path Structure

Duration: 1 week (8-12 hours total)

Format: Concept-driven with theoretical depth

Week 1: Self-Supervised Learning and Training Dynamics

Day 1-2: Self-Supervised Learning Foundations (3-4 hours)

Core Concept

Self-Supervised Learning - Learning from unlabeled data

Key topics:

  • Motivation: Why self-supervised learning?
  • Two-stage training paradigm (pre-training + fine-tuning)
  • The data efficiency problem
  • Foundation model pre-training strategy

Learning activities:

  1. Read the self-supervised learning concept page
  2. Understand the two main paradigms:
    • Contrastive learning (pull similar together, push dissimilar apart)
    • Masked prediction (reconstruct hidden portions)
  3. Study real-world applications (BERT, GPT, CLIP pre-training)

Healthcare connection:

  • Scenario: 8M emergency department visits, only 100K with outcome labels
  • Solution: Pre-train on all 8M visits (self-supervised) → fine-tune on 100K labeled examples
  • Result: 20-40% improvement over training only on labeled data

Review these concepts for deeper understanding:

Checkpoint:

  • Can explain the motivation for self-supervised learning
  • Understand the difference between contrastive and masked prediction
  • Know when to use self-supervised pre-training in your own projects

Day 3-4: Masked Prediction Methods (3-4 hours)

Core Concept

Masked Prediction - Learning by predicting hidden input

Key topics:

  • BERT: Masked language modeling (hide 15% of tokens, predict them)
  • MAE: Masked autoencoders for images (hide 75% of patches!)
  • EHR applications: Masked event prediction for patient trajectories

Deep dive:

  1. BERT masking strategy: 80-10-10 rule (80% mask, 10% random, 10% keep)
  2. MAE architecture: Asymmetric encoder-decoder design for efficiency
  3. Why high mask ratios work: 75% masking forces global understanding

Implementation focus:

# BERT-style masked prediction def mask_events(sequence, mask_prob=0.15): # Select 15% of positions # Apply 80-10-10 masking strategy # Train model to predict masked events # MAE-style masked prediction def mask_image_patches(image, mask_ratio=0.75): # Mask 75% of image patches # Encode only visible patches (efficient!) # Decode to reconstruct full image

Healthcare application:

  • Mask medical event codes in EHR sequences
  • Model learns temporal patterns and clinical relationships
  • Fine-tune for specific prediction tasks (readmission, mortality)

Checkpoint:

  • Understand BERT’s 80-10-10 masking strategy and why it works
  • Know why MAE uses 75% masking (not 15% like BERT)
  • Can design masked prediction task for sequential data

Day 5-7: Training Dynamics and Double Descent (2-4 hours)

Core Concept

Training Dynamics and Double Descent - Why overparameterization helps

Key topics:

  • Double descent curve: Test error goes down → up → down again
  • Interpolation threshold: The worst performance zone
  • Overparameterization benefits: Larger models generalize better
  • Implicit regularization: How SGD finds good solutions

Critical insights:

  1. Classical view (wrong): More parameters → overfitting
  2. Modern reality: Past interpolation threshold, more parameters → better generalization
  3. Grokking: Generalization can happen long after perfect train accuracy

The three regimes:

RegimePerformanceExample
UnderparameterizedImproves with sizeSmall linear models
Interpolation thresholdWorst performanceJust fits training data
OverparameterizedImproves with size!Modern deep networks

Practical implications:

  • ✅ Use large models when possible
  • ✅ Don’t fear perfect training accuracy
  • ✅ Train past perfect accuracy (grokking!)
  • ❌ Don’t stop at interpolation threshold
  • ❌ Don’t assume perfect train acc = overfitting

Essential viewing:

  • “What the Books Get Wrong about AI (Double Descent)” by Welch Labs
  • This 20-minute video is one of the most important in the entire curriculum

Checkpoint:

  • Understand the double descent phenomenon
  • Know why the interpolation threshold is dangerous
  • Can explain why larger models generalize better
  • Watched the Welch Labs double descent video

Key Concepts Summary

Three Core Concepts

  1. Self-Supervised Learning
    • Learning from unlabeled data
    • Two paradigms: contrastive and masked prediction
    • Foundation for modern pre-training
  2. Masked Prediction
    • BERT for sequences (mask 15% of tokens)
    • MAE for images (mask 75% of patches)
    • Applications to EHR and sequential data
  3. Training Dynamics
    • Double descent curve
    • Overparameterization benefits
    • Grokking and delayed generalization

Supporting Concepts

Practical Applications

Domain-Specific Pre-Training

Healthcare example:

# Stage 1: Self-supervised pre-training (ALL data, no labels) def pretrain_on_all_visits(model, all_8M_visits): """Pre-train on all visits using masked event prediction""" for epoch in range(100): for batch in all_8M_visits: # Mask 15% of events masked_events, targets = mask_ehr_events(batch) # Predict masked events predictions = model(masked_events) loss = cross_entropy(predictions, targets) loss.backward() optimizer.step() return model # Pre-trained representations # Stage 2: Supervised fine-tuning (LABELED data only) def finetune_on_labeled(pretrained_model, labeled_100K_visits): """Fine-tune on smaller labeled subset""" for epoch in range(20): for batch_x, batch_y in labeled_100K_visits: predictions = pretrained_model(batch_x) loss = cross_entropy(predictions, batch_y) loss.backward() optimizer.step() return pretrained_model # Final task-specific model

Result: 20-40% improvement vs. training only on 100K labeled examples

When to Use Self-Supervised Learning

Use self-supervised pre-training when:

  • ✅ You have abundant unlabeled data (millions of examples)
  • ✅ Labels are expensive (require expert annotation)
  • ✅ You have a small labeled subset (thousands to hundreds of thousands)
  • ✅ Domain-specific patterns exist in unlabeled data

Examples:

  • Medical imaging: Millions of scans, limited diagnoses
  • Clinical text: All clinical notes (unlabeled) → specific outcome prediction (labeled)
  • EHR sequences: All patient visits → specific outcome labels
  • Scientific data: Abundant experimental data → limited annotations

Design Principles

Choosing the paradigm:

Use CaseParadigmMethodExample
Sequential dataMasked predictionBERT-styleEHR events, clinical notes, time series
ImagesMasked prediction or contrastiveMAE or SimCLRMedical scans, satellite imagery
MultimodalContrastiveCLIP-styleImage-text pairs, radiology reports
TextMasked predictionBERT/GPTClinical notes, scientific papers

Advanced Topics

The Lottery Ticket Hypothesis

Dense networks contain sparse “winning ticket” subnetworks:

  • Train large network → find important weights → retrain sparse network
  • Explains why overparameterization helps training
  • Enables efficient model compression

Grokking Phenomenon

Models can generalize long after perfect training accuracy:

  • Phase 1: Memorize training data (fast)
  • Phase 2: Implicit regularization simplifies solution (slow)
  • Phase 3: Sudden generalization (grokking!)

Lesson: Don’t stop training too early!

Scaling Laws

Performance scales predictably with model size and data:

  • Larger models consistently outperform smaller ones
  • Optimal training: ~20 tokens per parameter (Chinchilla)
  • See Language Model Scaling Laws

Assessment

Test your understanding of advanced training topics:

Self-Supervised Learning

  • Can explain the motivation for self-supervised learning
  • Know the difference between contrastive and masked prediction
  • Understand BERT’s masking strategy (80-10-10 rule)
  • Know why MAE uses 75% masking for images
  • Can design a self-supervised pre-training task for a new domain

Training Dynamics

  • Understand the double descent phenomenon
  • Can explain why interpolation threshold has worst performance
  • Know why overparameterized models generalize better
  • Understand implicit regularization via SGD
  • Can explain grokking and delayed generalization

Practical Application

  • Can design a two-stage training strategy (pre-train + fine-tune)
  • Know when to use self-supervised learning in your projects
  • Understand tradeoffs between contrastive and masked prediction
  • Can debug training issues using double descent insights

Next Steps

After completing this path, you’re ready for:

Advanced Paths

  • Continue exploring advanced deep learning research
  • Study specific self-supervised methods in detail (SimCLR, MoCo, BYOL)
  • Dive deeper into neural tangent kernel theory

Domain Applications

  • Apply self-supervised pre-training to your domain
  • Design custom pre-training tasks for your data type
  • Implement two-stage training pipelines

Healthcare Specialization

Resources

Essential Papers

Self-Supervised Learning:

  • SimCLR: “A Simple Framework for Contrastive Learning” (Chen et al., 2020)
  • MoCo: “Momentum Contrast for Unsupervised Visual Representation Learning” (He et al., 2020)
  • BERT: “BERT: Pre-training of Deep Bidirectional Transformers” (Devlin et al., 2018)
  • MAE: “Masked Autoencoders Are Scalable Vision Learners” (He et al., 2021)

Training Dynamics:

  • Double Descent: “Deep Double Descent” (Nakkiran et al., 2019)
  • Lottery Ticket: “The Lottery Ticket Hypothesis” (Frankle & Carbin, 2019)
  • Grokking: “Grokking: Generalization Beyond Overfitting” (Power et al., 2021)

Videos

Essential viewing:

  • “What the Books Get Wrong about AI (Double Descent)” - Welch Labs (20 min) - Must watch!
  • “BERT Explained” - CodeEmporium
  • “Masked Autoencoders” - Yannic Kilcher

Blog Posts

  • “Self-Supervised Representation Learning” - Lilian Weng (comprehensive overview)
  • “The Illustrated BERT” - Jay Alammar (visual BERT guide)
  • “Understanding Double Descent” - Stanford CS229 notes

Code & Tutorials

  • Hugging Face Transformers: BERT and variants
  • timm library: MAE and vision self-supervised models
  • SimCLR implementation: PyTorch tutorial
  • BERT from scratch: MinBERT tutorial

Study Tips

  1. Start with self-supervised learning: Understand the motivation before diving into methods
  2. Watch the Welch Labs video: Double descent is counterintuitive - visual explanation helps
  3. Connect to previous concepts: Self-supervised learning builds on contrastive learning from VLMs path
  4. Focus on principles: Don’t get lost in implementation details
  5. Think about your domain: How would you apply these concepts to your own data?

Time Estimates

  • Self-supervised learning foundations: 3-4 hours
  • Masked prediction methods: 3-4 hours
  • Training dynamics and double descent: 2-4 hours
  • Total: 8-12 hours over 1 week

This is an optional advanced module - take your time and revisit concepts as needed!