Skip to Content
Learning PathsFoundationsAttention & Transformers

The Attention Mechanism & The Rise of Transformers

The attention mechanism and transformers represent the most significant architectural innovation in modern AI. This module deconstructs the “Attention Is All You Need” paper, teaching you how attention solves the limitations of recurrent networks and enables modeling of long-range dependencies in sequences.

Why This Module Is Critical

Understanding transformers is absolutely essential for modern AI research:

  • Dominant Architecture: Transformers power GPT, BERT, Vision Transformers, CLIP, and virtually all state-of-the-art models
  • Sequence Modeling Foundation: Essential for modeling EHR event sequences, clinical notes, and temporal patient data
  • Multimodal Fusion: Attention enables alignment across modalities (images + text, symptoms + EHR)
  • Research Requirement: Can’t read modern papers without deep transformer understanding

For healthcare AI specifically:

  • ETHOS (baseline for EmergAI thesis) uses transformer architecture
  • Clinical language models (ClinicalBERT, Med-PaLM) are transformer-based
  • Multimodal medical AI requires cross-attention mechanisms

Learning Objectives

After completing this module, you will:

  • Understand RNN Limitations: Grasp why recurrent architectures struggle with long sequences and why attention emerged as the solution
  • Master Attention Mechanism: Deeply understand query-key-value framework, scaled dot-product attention, and multi-head attention
  • Deconstruct Transformers: Break down the complete transformer architecture - encoder, decoder, positional encodings, and training
  • Sequence Modeling Foundation: Prepare for modeling sequential data (patient trajectories, clinical notes, time series)

Prerequisites

Before starting this module, ensure you have:

  • Neural Network Foundations - Strong backpropagation understanding
  • CNN Foundations - Architectural thinking
  • Sequence Concepts: Basic understanding of sequence-to-sequence problems
  • Linear Algebra: Matrix operations, dot products, softmax

The Most Important Equation in Modern AI

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

By the end of this module, you will understand every element of this equation and why it revolutionized AI.

Week 1: From RNNs to Attention

Day 1-2: The Problem Transformers Solve

Core Concept:

RNN Limitations

What You’ll Learn:

  • Recurrent neural networks (RNNs) and sequential processing
  • LSTM and GRU as attempted solutions
  • Fundamental limitations:
    • Vanishing/exploding gradients over long sequences
    • Sequential processing prevents parallelization
    • Fixed-size context bottleneck
    • Difficulty capturing long-range dependencies

Why This Matters:

  • Patient event sequences can be very long (years of history)
  • Clinical notes have long-range dependencies (diagnosis depends on symptoms mentioned paragraphs earlier)
  • Understanding what transformers fix helps you use them correctly

Learning Resources:

  • Videos: CS231n RNN lecture (optional background)
  • Reading: Colah’s blog “Understanding LSTM Networks”
  • Papers: Original LSTM paper (Hochreiter & Schmidhuber, 1997)

Checkpoint: Can you explain why RNNs struggle with sequences longer than 100 timesteps?

Day 3-5: The Attention Mechanism

Core Concept:

Attention Mechanism

What You’ll Learn:

  • Core Intuition: Weighted combination of values based on query-key similarity
  • Alignment Scores: Measuring similarity between query and keys
  • Attention Weights: Softmax over alignment scores
  • Context Vector: Weighted sum of values
  • Key Innovation: Constant-time access to any position in sequence

The Attention Process (Step-by-Step):

  1. Compute Alignment Scores: How much should we focus on each position?

    • Compare query with each key
    • Various methods: dot product, additive, multiplicative
  2. Attention Weights: Normalize scores with softmax

    • Convert scores to probability distribution
    • Sum to 1, all positive
  3. Context Vector: Weighted combination of values

    • Multiply each value by its attention weight
    • Sum weighted values

Intuitive Example (Patient Diagnosis):

  • Query: Current symptoms
  • Keys: Past medical history events
  • Values: Diagnostic information from those events
  • Attention: Focuses on historically relevant events for current diagnosis

Learning Resources:

  • Reading:
    • Jay Alammar’s “Visualizing Attention” (essential)
    • Lilian Weng’s “Attention? Attention!” (comprehensive)
    • Distill.pub “Attention and Augmented Recurrent Neural Networks”
  • Videos: 3Blue1Brown attention visualization
  • Interactive: Online attention visualizers

Exercises:

  • Implement simple attention mechanism in NumPy
  • Visualize attention weights on toy examples
  • Compute attention manually for small example

Checkpoint: Can you implement basic attention (without transformers) from scratch?

Day 6-7: Scaled Dot-Product Attention

Core Concept:

Scaled Dot-Product Attention

What You’ll Learn:

  • Query-Key-Value (QKV) Framework: The universal attention formulation
  • Dot Product Similarity: Why dot product works as alignment function
  • Scaling Factor: Why divide by sqrt(d_k)?
  • Masking: Padding masks and causal masks
  • Computational Complexity: O(n²d) for sequence length n

The Formula Deconstructed:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  • Q (Query): “What am I looking for?” - shape (n, d_k)
  • K (Key): “What can I match with?” - shape (m, d_k)
  • V (Value): “What information do I return?” - shape (m, d_v)
  • QK^T: Compute similarity between all query-key pairs - shape (n, m)
  • √d_k: Scaling prevents softmax saturation for large d_k
  • softmax: Normalize scores to probability distribution
  • Result: Weighted combination of values - shape (n, d_v)

Why Scaling Matters:

  • For large d_k, dot products grow large in magnitude
  • Pushes softmax into saturation regions (gradients ≈ 0)
  • Division by √d_k keeps variance constant

Masking Types:

  1. Padding Mask: Ignore padding tokens (length normalization)
  2. Causal Mask: Prevent attending to future tokens (GPT-style)

Learning Resources:

  • Papers: “Attention Is All You Need” (Section 3.2.1)
  • Videos: 3Blue1Brown scaled attention explanation
  • Code: PyTorch scaled dot-product attention implementation

Exercises:

  • Implement scaled dot-product attention in PyTorch
  • Visualize effect of scaling factor
  • Implement padding and causal masking
  • Compute attention complexity for different sequence lengths

Checkpoint: Can you explain why we scale by √d_k and not just d_k?

Week 2: Multi-Head Attention and Transformers

Day 8-9: Multi-Head Attention

Core Concept:

Multi-Head Attention

What You’ll Learn:

  • Multiple Attention “Heads”: Run scaled dot-product attention in parallel
  • Different Representation Subspaces: Each head learns different relationships
  • Concatenation and Projection: Combine head outputs
  • Why Multiple Heads?: Attend to different aspects simultaneously

The Multi-Head Formulation:

MultiHead(Q,K,V)=Concat(head1,...,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

What Different Heads Learn:

  • Some heads attend to syntactic relationships (subject-verb)
  • Some heads attend to semantic relationships (entity-attribute)
  • Some heads attend to positional relationships (next token, previous token)
  • Empirically discovered patterns through attention visualization

Practical Design Choices:

  • Typical: 8-16 heads for most models
  • d_model = 512 → 8 heads × 64 dimensions per head
  • More heads = more capacity but more computation
  • Diminishing returns beyond 16 heads

Learning Resources:

  • Papers:
    • “Attention Is All You Need” (Section 3.2.2)
    • “Are Sixteen Heads Really Better than One?” (head pruning study)
  • Reading: Voita et al. “Analyzing Multi-Head Self-Attention”
  • Code: Implement multi-head attention in PyTorch

Exercises:

  • Implement multi-head attention from scratch
  • Visualize different head attention patterns
  • Experiment with different numbers of heads
  • Understand parameter count calculation

Checkpoint: Can you implement multi-head attention without references and explain what different heads might learn?

Day 10-12: The Complete Transformer Architecture

Architecture Paper:

Attention Is All You Need (Vaswani et al., 2017)

What You’ll Learn:

  • Complete Transformer: Encoder-decoder architecture
  • Encoder Stack: Multi-head self-attention + feed-forward
  • Decoder Stack: Masked self-attention + encoder-decoder attention + feed-forward
  • Positional Encoding: Injecting position information
  • Layer Normalization: Stabilizing training
  • Residual Connections: Enabling deep stacks

Encoder Architecture:

Encoder Block (×N): 1. Multi-Head Self-Attention 2. Add & Norm (residual + layer norm) 3. Feed-Forward Network (2-layer MLP) 4. Add & Norm

Decoder Architecture:

Decoder Block (×N): 1. Masked Multi-Head Self-Attention (causal) 2. Add & Norm 3. Multi-Head Cross-Attention (attending to encoder) 4. Add & Norm 5. Feed-Forward Network 6. Add & Norm

Three Types of Attention in Transformers:

  1. Encoder Self-Attention: Each position attends to all positions in input

    • Bidirectional (can see full sequence)
    • Used in BERT-style models
  2. Decoder Self-Attention: Each position attends only to previous positions

    • Causal masking (can’t see future)
    • Used in GPT-style models
  3. Encoder-Decoder Cross-Attention: Decoder attends to encoder outputs

    • Machine translation, image captioning
    • Query from decoder, Key+Value from encoder

Positional Encoding:

  • Transformers have no inherent notion of position
  • Need to inject positional information
  • Sinusoidal functions or learned embeddings
  • Added to input embeddings

Learning Resources:

  • Papers: “Attention Is All You Need” (read 3+ times)
  • Reading:
    • The Illustrated Transformer (Jay Alammar) - Essential
    • The Annotated Transformer (Harvard NLP) - Line-by-line implementation
  • Videos:
    • 3Blue1Brown: Attention in transformers
    • CS231n: Transformers lecture
  • Code: Implement transformer from scratch following Annotated Transformer

Exercises:

  • Draw complete transformer architecture from memory
  • Implement transformer encoder and decoder
  • Understand all three attention types
  • Compute parameter counts
  • Trace a forward pass through the architecture

Checkpoint:

  • Can you draw the complete transformer architecture?
  • Can you explain the purpose of each component?
  • Can you identify which attention type to use for different tasks?

Day 13-14: Training and Practical Considerations

Training Transformers:

Optimizer:

  • Adam with custom learning rate schedule
  • Warmup + cosine decay is standard

Learning Rate Schedule:

lr = d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5))
  • Increases linearly during warmup
  • Decreases proportional to inverse square root after

Label Smoothing:

  • Soft targets instead of hard one-hot
  • Improves generalization
  • Prevents overconfidence

Common Issues:

  1. Exploding Attention Scores: Use proper scaling
  2. Vanishing Gradients: Residual connections help
  3. Unstable Training: Layer norm + learning rate warmup
  4. Memory Constraints: Gradient checkpointing, mixed precision

Checkpoint: Can you explain the learning rate warmup schedule and why it’s needed?

Module Completion Criteria

You have completed this module when you can:

  • ✅ Explain why attention is revolutionary and what problems it solves
  • ✅ Implement scaled dot-product attention from scratch
  • ✅ Implement multi-head attention from scratch
  • ✅ Draw and explain the complete transformer architecture from memory
  • ✅ Distinguish between encoder self-attention, decoder self-attention, and cross-attention
  • ✅ Understand when to use encoder-only (BERT), decoder-only (GPT), or encoder-decoder (T5) architectures
  • ✅ Read transformer code in papers and understand it immediately
  • ✅ Debug attention-related implementation issues

Key Resources

Essential Paper (Read 3+ Times)

“Attention Is All You Need” (Vaswani et al., 2017)

  • The most important paper in modern AI
  • Introduced the transformer architecture
  • Foundation for GPT, BERT, T5, and virtually all modern models

Essential Reading

  1. The Illustrated Transformer (Jay Alammar)

    • Best visual explanation of transformers
    • Start here before reading the paper
  2. The Annotated Transformer (Harvard NLP)

    • Line-by-line PyTorch implementation
    • Code along with this
  3. Attention? Attention! (Lilian Weng)

    • Comprehensive attention survey
    • Historical context and variants

Videos

  • 3Blue1Brown: Attention in transformers (beautiful visualizations)
  • CS231n: Transformers lecture (Stanford)

Code Resources

  • The Annotated Transformer (Harvard NLP) - Reference implementation
  • PyTorch nn.MultiheadAttention - Production implementation
  • HuggingFace Transformers library - Pre-trained models

Common Pitfalls

1. Confusing Attention Types

Problem: Mixing up self-attention vs cross-attention, or encoder vs decoder attention Solution: Draw diagrams showing Query/Key/Value sources for each type

2. Forgetting Scaling Factor

Problem: Attention scores explode for large d_k Solution: Always divide by √d_k in scaled dot-product attention

3. Wrong Masking

Problem: Causal mask in encoder or no mask in decoder Solution:

  • Encoder: No causal mask (bidirectional)
  • Decoder: Causal mask (can’t see future)

4. Positional Encoding Issues

Problem: Forgetting to add positional encodings Solution: Transformers have no position sense without positional encodings!

5. Dimension Mismatches

Problem: Q, K, V have wrong dimensions Solution: Carefully track tensor shapes through attention computation

Success Tips

  1. Read the Paper Multiple Times

    • First read: Get high-level understanding
    • Second read: Understand each equation
    • Third read: Implementation details
    • Minimum 3 reads for “Attention Is All You Need”
  2. Implement from Scratch

    • Don’t just use nn.MultiheadAttention
    • Build your own attention module
    • Debug dimension errors - that’s where learning happens
  3. Visualize Attention

    • Plot attention weight matrices
    • See what the model attends to
    • Helps debug and understand model behavior
  4. Understand Task-Architecture Mapping

    • Classification → Encoder-only (BERT)
    • Generation → Decoder-only (GPT)
    • Seq2Seq → Encoder-decoder (T5)
  5. Follow The Annotated Transformer

    • Best code walkthrough available
    • Implement line-by-line
    • Test each component

Connection to Healthcare AI

Transformers are essential for healthcare applications:

EHR Analysis (Your Thesis):

  • ETHOS uses transformer encoder for patient event sequences
  • Self-attention captures dependencies between medical events
  • Positional encoding represents temporal information

Clinical NLP:

  • ClinicalBERT: BERT fine-tuned on clinical notes
  • Med-PaLM: GPT-style model for medical QA
  • BioClinicalBERT: Domain-specific pre-training

Multimodal Healthcare AI:

  • Cross-attention fuses imaging + EHR + text
  • Vision encoder (ViT) + language encoder (BERT) + fusion (cross-attention)

Next Steps

After completing this module, you have several paths:

  1. Immediate Next: Module 4: Language Models (GPT)

    • Build on transformer knowledge
    • Implement complete decoder-only model
    • Understand modern LLMs
  2. Parallel Study:

    • Read BERT paper (encoder-only transformers)
    • Explore T5 (encoder-decoder at scale)
  3. Advanced Topics:

    • Vision Transformers (ViT) - Transformers for images
    • CLIP - Vision-language transformers
    • Efficient transformers (Linformer, Reformer)

Time Investment

Total estimated time: 12-16 hours over 2 weeks

  • Papers: 4-6 hours (Attention Is All You Need, 3+ reads)
  • Reading: 2-3 hours (Illustrated Transformer, Annotated Transformer)
  • Videos: 2-3 hours (3Blue1Brown, lectures)
  • Implementation: 4-6 hours (Code along with Annotated Transformer)

This is the most important module for your thesis. Transformers are the foundation of ETHOS and all modern sequence modeling. Don’t rush.

Key Takeaway

“Attention Is All You Need” - Vaswani et al., 2017

This single architectural innovation revolutionized AI. By replacing recurrence with attention, transformers enabled:

  • Parallelization (faster training)
  • Long-range dependencies (better modeling)
  • Transfer learning at scale (foundation models)

Understanding transformers deeply is non-negotiable for modern AI research. Read the paper 3+ times. Implement from scratch. Master this architecture.


Ready to begin? Start with RNN Limitations.