Advanced Training Topics

Master cutting-edge concepts in deep learning training: self-supervised learning methods that enable learning from unlabeled data, and modern understanding of training dynamics including the double descent phenomenon.

Optional Advanced Module

This path covers advanced topics that enhance your deep learning understanding but aren’t strictly required for basic applications. Focus on foundation and intermediate paths first, then return here when ready for advanced theory.

Learning Objectives

By the end of this path, you will:

Understand self-supervised learning: Master the two main paradigms (contrastive learning and masked prediction)
Design pre-training strategies: Create effective pre-training approaches for domains with limited labels
Understand modern generalization: Grasp double descent and why overparameterized models generalize
Apply advanced techniques: Use self-supervised pre-training in your own projects

Prerequisites

Required paths:

Attention and Transformers - Understanding transformers and attention
Multimodal Vision-Language Models - Experience with contrastive learning (CLIP)

Required knowledge:

Neural network training fundamentals
Bias-variance tradeoff
Regularization techniques
Transformer architecture

Path Structure

Duration: 1 week (8-12 hours total)

Format: Concept-driven with theoretical depth

Week 1: Self-Supervised Learning and Training Dynamics

Day 1-2: Self-Supervised Learning Foundations (3-4 hours)

Core Concept

Self-Supervised Learning - Learning from unlabeled data

Key topics:

Motivation: Why self-supervised learning?
Two-stage training paradigm (pre-training + fine-tuning)
The data efficiency problem
Foundation model pre-training strategy

Learning activities:

Read the self-supervised learning concept page
Understand the two main paradigms:
- Contrastive learning (pull similar together, push dissimilar apart)
- Masked prediction (reconstruct hidden portions)
Study real-world applications (BERT, GPT, CLIP pre-training)

Healthcare connection:

Scenario: 8M emergency department visits, only 100K with outcome labels
Solution: Pre-train on all 8M visits (self-supervised) → fine-tune on 100K labeled examples
Result: 20-40% improvement over training only on labeled data

Review these concepts for deeper understanding:

Contrastive Learning - CLIP-style contrastive pre-training
Transfer Learning - Pre-training and fine-tuning strategies

Checkpoint:

Can explain the motivation for self-supervised learning
Understand the difference between contrastive and masked prediction
Know when to use self-supervised pre-training in your own projects

Day 3-4: Masked Prediction Methods (3-4 hours)

Core Concept

Masked Prediction - Learning by predicting hidden input

Key topics:

BERT: Masked language modeling (hide 15% of tokens, predict them)
MAE: Masked autoencoders for images (hide 75% of patches!)
EHR applications: Masked event prediction for patient trajectories

Deep dive:

BERT masking strategy: 80-10-10 rule (80% mask, 10% random, 10% keep)
MAE architecture: Asymmetric encoder-decoder design for efficiency
Why high mask ratios work: 75% masking forces global understanding

Implementation focus:


# BERT-style masked prediction
def mask_events(sequence, mask_prob=0.15):
    # Select 15% of positions
    # Apply 80-10-10 masking strategy
    # Train model to predict masked events
 
# MAE-style masked prediction
def mask_image_patches(image, mask_ratio=0.75):
    # Mask 75% of image patches
    # Encode only visible patches (efficient!)
    # Decode to reconstruct full image

Healthcare application:

Mask medical event codes in EHR sequences
Model learns temporal patterns and clinical relationships
Fine-tune for specific prediction tasks (readmission, mortality)

Checkpoint:

Understand BERT’s 80-10-10 masking strategy and why it works
Know why MAE uses 75% masking (not 15% like BERT)
Can design masked prediction task for sequential data

Day 5-7: Training Dynamics and Double Descent (2-4 hours)

Core Concept

Training Dynamics and Double Descent - Why overparameterization helps

Key topics:

Double descent curve: Test error goes down → up → down again
Interpolation threshold: The worst performance zone
Overparameterization benefits: Larger models generalize better
Implicit regularization: How SGD finds good solutions

Critical insights:

Classical view (wrong): More parameters → overfitting
Modern reality: Past interpolation threshold, more parameters → better generalization
Grokking: Generalization can happen long after perfect train accuracy

The three regimes:

Regime	Performance	Example
Underparameterized	Improves with size	Small linear models
Interpolation threshold	Worst performance	Just fits training data
Overparameterized	Improves with size!	Modern deep networks

Practical implications:

✅ Use large models when possible
✅ Don’t fear perfect training accuracy
✅ Train past perfect accuracy (grokking!)
❌ Don’t stop at interpolation threshold
❌ Don’t assume perfect train acc = overfitting

Essential viewing:

“What the Books Get Wrong about AI (Double Descent)” by Welch Labs
This 20-minute video is one of the most important in the entire curriculum

Checkpoint:

Understand the double descent phenomenon
Know why the interpolation threshold is dangerous
Can explain why larger models generalize better
Watched the Welch Labs double descent video

Key Concepts Summary

Three Core Concepts

Self-Supervised Learning
- Learning from unlabeled data
- Two paradigms: contrastive and masked prediction
- Foundation for modern pre-training
Masked Prediction
- BERT for sequences (mask 15% of tokens)
- MAE for images (mask 75% of patches)
- Applications to EHR and sequential data
Training Dynamics
- Double descent curve
- Overparameterization benefits
- Grokking and delayed generalization

Supporting Concepts

Contrastive Learning - Alternative self-supervised paradigm
CLIP - Contrastive vision-language pre-training at scale
Language Model Training - Autoregressive variant of masked prediction
Scaling Laws - How performance scales with model size and data

Practical Applications

Domain-Specific Pre-Training

Healthcare example:


# Stage 1: Self-supervised pre-training (ALL data, no labels)
def pretrain_on_all_visits(model, all_8M_visits):
    """Pre-train on all visits using masked event prediction"""
    for epoch in range(100):
        for batch in all_8M_visits:
            # Mask 15% of events
            masked_events, targets = mask_ehr_events(batch)
 
            # Predict masked events
            predictions = model(masked_events)
            loss = cross_entropy(predictions, targets)
 
            loss.backward()
            optimizer.step()
 
    return model  # Pre-trained representations
 
# Stage 2: Supervised fine-tuning (LABELED data only)
def finetune_on_labeled(pretrained_model, labeled_100K_visits):
    """Fine-tune on smaller labeled subset"""
    for epoch in range(20):
        for batch_x, batch_y in labeled_100K_visits:
            predictions = pretrained_model(batch_x)
            loss = cross_entropy(predictions, batch_y)
 
            loss.backward()
            optimizer.step()
 
    return pretrained_model  # Final task-specific model

Result: 20-40% improvement vs. training only on 100K labeled examples

When to Use Self-Supervised Learning

Use self-supervised pre-training when:

✅ You have abundant unlabeled data (millions of examples)
✅ Labels are expensive (require expert annotation)
✅ You have a small labeled subset (thousands to hundreds of thousands)
✅ Domain-specific patterns exist in unlabeled data

Examples:

Medical imaging: Millions of scans, limited diagnoses
Clinical text: All clinical notes (unlabeled) → specific outcome prediction (labeled)
EHR sequences: All patient visits → specific outcome labels
Scientific data: Abundant experimental data → limited annotations

Design Principles

Choosing the paradigm:

Use Case	Paradigm	Method	Example
Sequential data	Masked prediction	BERT-style	EHR events, clinical notes, time series
Images	Masked prediction or contrastive	MAE or SimCLR	Medical scans, satellite imagery
Multimodal	Contrastive	CLIP-style	Image-text pairs, radiology reports
Text	Masked prediction	BERT/GPT	Clinical notes, scientific papers

Advanced Topics

The Lottery Ticket Hypothesis

Dense networks contain sparse “winning ticket” subnetworks:

Train large network → find important weights → retrain sparse network
Explains why overparameterization helps training
Enables efficient model compression

Grokking Phenomenon

Models can generalize long after perfect training accuracy:

Phase 1: Memorize training data (fast)
Phase 2: Implicit regularization simplifies solution (slow)
Phase 3: Sudden generalization (grokking!)

Lesson: Don’t stop training too early!

Scaling Laws

Performance scales predictably with model size and data:

Larger models consistently outperform smaller ones
Optimal training: ~20 tokens per parameter (Chinchilla)
See Language Model Scaling Laws

Assessment

Test your understanding of advanced training topics:

Self-Supervised Learning

Can explain the motivation for self-supervised learning
Know the difference between contrastive and masked prediction
Understand BERT’s masking strategy (80-10-10 rule)
Know why MAE uses 75% masking for images
Can design a self-supervised pre-training task for a new domain

Training Dynamics

Understand the double descent phenomenon
Can explain why interpolation threshold has worst performance
Know why overparameterized models generalize better
Understand implicit regularization via SGD
Can explain grokking and delayed generalization

Practical Application

Can design a two-stage training strategy (pre-train + fine-tune)
Know when to use self-supervised learning in your projects
Understand tradeoffs between contrastive and masked prediction
Can debug training issues using double descent insights

Next Steps

After completing this path, you’re ready for:

Advanced Paths

Continue exploring advanced deep learning research
Study specific self-supervised methods in detail (SimCLR, MoCo, BYOL)
Dive deeper into neural tangent kernel theory

Domain Applications

Apply self-supervised pre-training to your domain
Design custom pre-training tasks for your data type
Implement two-stage training pipelines

Healthcare Specialization

Healthcare Foundation Models - Domain-specific pre-trained models
Healthcare AI & EHR Analysis - Apply to medical data

Resources

Essential Papers

Self-Supervised Learning:

SimCLR: “A Simple Framework for Contrastive Learning” (Chen et al., 2020)
MoCo: “Momentum Contrast for Unsupervised Visual Representation Learning” (He et al., 2020)
BERT: “BERT: Pre-training of Deep Bidirectional Transformers” (Devlin et al., 2018)
MAE: “Masked Autoencoders Are Scalable Vision Learners” (He et al., 2021)

Training Dynamics:

Double Descent: “Deep Double Descent” (Nakkiran et al., 2019)
Lottery Ticket: “The Lottery Ticket Hypothesis” (Frankle & Carbin, 2019)
Grokking: “Grokking: Generalization Beyond Overfitting” (Power et al., 2021)

Videos

Essential viewing:

“What the Books Get Wrong about AI (Double Descent)” - Welch Labs (20 min) - Must watch!
“BERT Explained” - CodeEmporium
“Masked Autoencoders” - Yannic Kilcher

Blog Posts

“Self-Supervised Representation Learning” - Lilian Weng (comprehensive overview)
“The Illustrated BERT” - Jay Alammar (visual BERT guide)
“Understanding Double Descent” - Stanford CS229 notes

Code & Tutorials

Hugging Face Transformers: BERT and variants
timm library: MAE and vision self-supervised models
SimCLR implementation: PyTorch tutorial
BERT from scratch: MinBERT tutorial

Study Tips

Start with self-supervised learning: Understand the motivation before diving into methods
Watch the Welch Labs video: Double descent is counterintuitive - visual explanation helps
Connect to previous concepts: Self-supervised learning builds on contrastive learning from VLMs path
Focus on principles: Don’t get lost in implementation details
Think about your domain: How would you apply these concepts to your own data?

Time Estimates

Self-supervised learning foundations: 3-4 hours
Masked prediction methods: 3-4 hours
Training dynamics and double descent: 2-4 hours
Total: 8-12 hours over 1 week

This is an optional advanced module - take your time and revisit concepts as needed!

Advanced Training Topics

Learning Objectives

Prerequisites

Path Structure

Week 1: Self-Supervised Learning and Training Dynamics

Day 1-2: Self-Supervised Learning Foundations (3-4 hours)

Core Concept

Related Concepts

Day 3-4: Masked Prediction Methods (3-4 hours)

Core Concept

Day 5-7: Training Dynamics and Double Descent (2-4 hours)

Core Concept

Key Concepts Summary

Three Core Concepts

Supporting Concepts

Practical Applications

Domain-Specific Pre-Training

When to Use Self-Supervised Learning

Design Principles

Advanced Topics

The Lottery Ticket Hypothesis

Grokking Phenomenon

Scaling Laws

Assessment

Self-Supervised Learning

Training Dynamics

Practical Application

Next Steps

Advanced Paths

Domain Applications

Healthcare Specialization

Resources

Essential Papers

Videos

Blog Posts

Code & Tutorials

Study Tips

Time Estimates