Clinical Language Models
Transformer language models like GPT and BERT can be adapted to model patient event sequences and clinical text, enabling trajectory prediction, risk stratification, and clinical decision support. This concept bridges language modeling techniques with healthcare applications.
Overview
Language models treat text as sequences of tokens. In healthcare, we can apply the same principles to:
- Patient event sequences: ICD codes, procedures, medications, lab results as discrete tokens
- Clinical text: Medical notes, radiology reports, discharge summaries as natural language
- Hybrid approaches: Combining structured events with unstructured text
Key insight: Medical events are already discrete tokens (ICD codes, procedure codes). Unlike text that needs character-level or subword tokenization, healthcare events are natural units—making them ideal for transformer models.
Event Sequence Tokenization
From Text Tokens to Medical Event Tokens
Text GPT:
# Tokenize natural language
text = "The patient has chest pain"
tokens = tokenizer.encode(text) # [464, 5827, 468, 7721, 2356]Healthcare GPT:
# Tokenize patient trajectory
trajectory = [
"ICD:I21.0", # Acute myocardial infarction
"PROC:PCI", # Percutaneous coronary intervention
"MED:aspirin", # Medication
"LAB:troponin_high", # Lab result
"VISIT:followup_30d" # Follow-up visit
]
# Create vocabulary from medical coding systems
vocab = {
"ICD:I21.0": 1,
"PROC:PCI": 2,
"MED:aspirin": 3,
"LAB:troponin_high": 4,
"VISIT:followup_30d": 5,
# ... thousands more event codes
}
# Simple tokenization (events are already tokens!)
tokens = [vocab[event] for event in trajectory] # [1, 2, 3, 4, 5]BPE for Medical Event Patterns
Byte-Pair Encoding can learn common clinical workflows:
# Frequent co-occurring events
"chest_pain" + "ECG" + "troponin_test" → "acute_cardiac_workup"
# Benefits:
# - Captures clinical protocols
# - Reduces sequence length
# - Learns domain knowledge from dataKey Advantage: Medical events from coding systems (ICD-10, CPT, ATC) are pre-tokenized at a clinically meaningful granularity. No need for character or subword tokenization—use codes directly as vocabulary.
Autoregressive Trajectory Prediction
Modeling Patient Futures
GPT’s autoregressive formulation maps directly to predicting patient trajectories:
Example:
# Given patient history
history = ["ICD:I21.0", "PROC:PCI", "MED:aspirin"]
# Model predicts next likely events
next_events = model.generate(history, top_k=5)
# Predicted: ["MED:statin", "VISIT:followup_30d", "LAB:lipid_panel", ...]Clinical Use Cases
1. Disease Progression Modeling
- Input: Current diagnoses and treatments
- Output: Likely future events and complications
- Application: Anticipate deterioration, allocate resources
2. Treatment Trajectory Generation
- Input: Patient diagnosis and condition
- Output: Typical care pathway
- Application: Clinical decision support, care planning
3. Outcome Risk Stratification
- Input: Patient event history
- Output: Probability of adverse outcomes
- Application: Identify high-risk patients for early intervention
Encoder vs Decoder for Healthcare
Understanding both encoder-only (BERT/ETHOS) and decoder-only (GPT) architectures provides complete perspective on clinical AI:
| Aspect | Encoder (BERT, ETHOS) | Decoder (GPT) |
|---|---|---|
| Attention | Bidirectional (see past & future) | Causal (see only past) |
| Use Case | Representation learning | Sequential generation |
| Training | Masked language modeling | Autoregressive prediction |
| Output | Fixed-size representation | Step-by-step generation |
When to Use Encoder (ETHOS Approach)
Best for: Patient representation, cohort analysis, risk classification
# Encode entire patient trajectory into fixed representation
trajectory = ["event1", "event2", "event3", "event4"]
embedding = encoder(trajectory) # Single vector [768]
# Use embedding for:
# - Patient similarity (find similar cases)
# - Risk scores (classify into risk categories)
# - Cohort identification (clustering)
# - Zero-shot prediction (transfer to new tasks)Example: ETHOS uses bidirectional attention to create patient representations for zero-shot outcome prediction.
When to Use Decoder (GPT Approach)
Best for: Trajectory generation, sequential prediction, what-if scenarios
# Generate future trajectory step-by-step
history = ["event1", "event2"]
future = decoder.generate(history, max_new_tokens=10)
# Output: ["event3", "event4", "event5", ...]
# Use for:
# - Trajectory prediction (what happens next?)
# - Counterfactual reasoning (what if different treatment?)
# - Synthetic data generation (augmentation)Example: Train GPT on patient event sequences to generate plausible future trajectories for rare conditions (data augmentation).
Hybrid Encoder-Decoder
Best for: Complex clinical tasks requiring both understanding and generation
# Encoder-decoder (T5-style) for healthcare
input_trajectory = ["diagnosis", "symptoms", "test_results"]
encoder_hidden = encoder(input_trajectory)
# Decoder generates treatment plan
treatment_plan = decoder.generate(encoder_hidden)
# Output: ["med1", "med2", "procedure", "followup"]Practical Recommendation:
- Use encoder (ETHOS) for risk stratification and patient representation
- Use decoder (GPT) for trajectory forecasting and synthetic data
- Use encoder-decoder for complex generation tasks (e.g., care plan generation)
Temporal Modeling in Healthcare
Time-Aware Tokenization
Healthcare trajectories unfold over time. Incorporate temporal information:
Option 1: Time tokens
tokens = ["ICD:I21.0", "TIME:day_0", "PROC:PCI", "TIME:day_0",
"MED:aspirin", "TIME:day_1", "VISIT:followup", "TIME:day_30"]Option 2: Time delta tokens
tokens = ["ICD:I21.0", "PROC:PCI", "DELTA:+1day", "MED:aspirin",
"DELTA:+29days", "VISIT:followup"]Option 3: Time-aware positional encodings
# Modify positional encoding to capture time intervals
pos_encoding = sinusoidal_encoding(time_in_days)Causal Attention for Temporal Dependencies
Decoder models with causal attention naturally respect temporal order:
- Events can only attend to past events (no information leakage)
- Model learns temporal patterns (e.g., “acute event → immediate treatment → delayed monitoring”)
- Can predict timing of future events, not just event types
Zero-Shot Trajectory Prediction
Generalizing to Rare Conditions
Large clinical language models can predict outcomes for rare conditions never seen during training:
Mechanism:
# Trained on general patient population
model = ClinicalGPT(trained_on="all_patients")
# Zero-shot prediction for rare condition
rare_patient_history = [rare_condition_events]
prediction = model.predict(rare_patient_history)
# Model generalizes from similar patterns in common conditionsWhy it works:
- Model learns general clinical patterns (symptoms → diagnosis → treatment)
- Rare conditions share underlying patterns with common ones
- Transfer learning enables generalization
Example: Model trained on common cardiac conditions can predict reasonable treatment pathways for rare congenital heart defects by transferring knowledge about general cardiac care patterns.
Synthetic Data Generation
Augmentation for Rare Conditions
Problem: Insufficient data for rare conditions
# Real data: Only 10 patients with rare condition X
rare_data = [trajectory_1, trajectory_2, ..., trajectory_10]Solution: Generate synthetic trajectories
# Fine-tune clinical GPT on rare condition subset
gpt = fine_tune(pretrained_clinical_gpt, rare_data)
# Generate synthetic patient trajectories
synthetic_trajectories = []
for _ in range(100):
trajectory = gpt.generate(initial_event="rare_condition_X")
synthetic_trajectories.append(trajectory)
# Augmented dataset: 10 real + 100 synthetic
augmented_data = rare_data + synthetic_trajectoriesValidation Requirements
Synthetic medical data requires rigorous validation:
- Clinical plausibility: Expert review for medical correctness
- Statistical similarity: Match real data distributions
- Downstream performance: Improve model accuracy on real test set
- Bias amplification check: Ensure no exacerbation of existing biases
Critical Caution: Synthetic medical data should augment, never replace, real patient data. Always validate with clinical experts before use in training or decision-making systems.
Attention Interpretability
Visualizing Clinical Reasoning
Attention weights reveal which past events influence predictions:
# Patient trajectory
trajectory = [
"ICD:I21.0", # Acute MI (day 0)
"PROC:PCI", # Intervention (day 0)
"MED:aspirin", # Medication (day 1)
"MED:statin", # Statin started (day 1)
"VISIT:followup", # Follow-up (day 30)
"LAB:lipid_high" # High lipids detected → predict next
]
# Extract attention patterns
attention = model.get_attention(trajectory)
# Shows: "LAB:lipid_high" strongly attends to "MED:statin" and "VISIT:followup"
# Interpretation: Model recognizes statin therapy requires lipid monitoringClinical Insights:
- Do attention patterns align with clinical guidelines?
- Which historical events drive predictions?
- Are relationships medically sensible or spurious?
Application: Explain model predictions to clinicians, validate medical reasoning, identify potential errors.
Implementation: From NanoGPT to Healthcare GPT
Step 1: Healthcare Tokenizer
class HealthcareTokenizer:
def __init__(self, vocab_path):
# Load medical coding vocabulary (ICD, CPT, ATC, LOINC)
self.vocab = load_medical_vocab(vocab_path)
self.code_to_id = {code: idx for idx, code in enumerate(self.vocab)}
self.id_to_code = {idx: code for code, idx in self.code_to_id.items()}
def encode(self, trajectory):
return [self.code_to_id[event] for event in trajectory]
def decode(self, tokens):
return [self.id_to_code[token] for token in tokens]Step 2: Same GPT Architecture
# Identical architecture to text GPT, different vocabulary
model = GPT(
vocab_size=len(medical_vocab), # 50k-100k medical codes
block_size=512, # Longer context for patient history
n_layer=12, # Depth
n_embd=768, # Embedding dimension
n_head=12 # Attention heads
)Step 3: Training on Patient Trajectories
# Load patient event sequences
trajectories = load_ehr_trajectories(dataset="MIMIC-IV")
# Standard autoregressive training
for trajectory in trajectories:
# Input: all events except last
# Target: all events except first (shifted by 1)
x, y = trajectory[:-1], trajectory[1:]
logits = model(x)
loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
loss.backward()
optimizer.step()Step 4: Generation and Evaluation
# Generate patient trajectory
initial_event = ["ICD:I21.0"] # Start with acute MI
generated = model.generate(initial_event, max_new_tokens=20, temperature=0.8)
# Evaluation metrics:
# 1. Perplexity on held-out patient sequences
# 2. Clinical plausibility (expert review)
# 3. Statistical similarity to real trajectories
# 4. Downstream task performance (risk prediction)Key Healthcare Models
See Healthcare Foundation Models for comprehensive coverage of:
- BEHRT: BERT for patient event sequences
- Med-BERT: Medical event prediction with pre-training
- ClinicalBERT: BERT for clinical notes
- GatorTron: Large-scale clinical language model (8.9B parameters)
Thesis Application: ETHOS Comparison
Your EmergAI thesis uses ETHOS (encoder-only transformer). Understanding decoder-only (GPT) completes the picture:
ETHOS Advantages:
- Bidirectional context (see future and past during training)
- Efficient patient representation (single embedding)
- Zero-shot task transfer
GPT Advantages:
- Natural trajectory generation
- Synthetic data augmentation
- Sequential prediction with timing
Potential Thesis Extensions:
- Compare encoder vs decoder for trajectory prediction
- Use GPT to generate synthetic EmergAI patient trajectories
- Propose hybrid encoder-decoder for EmergAI
- Interpret ETHOS via attention visualization
Key Takeaways
- Medical events are natural tokens: No complex tokenization needed—use coding systems directly
- Encoder for representation, decoder for generation: Choose architecture based on task
- Temporal modeling critical: Healthcare sequences unfold over time—incorporate temporal information
- Zero-shot generalization: Pre-training enables transfer to rare conditions
- Synthetic data with caution: Can augment but must be validated by experts
- Interpretability required: Attention visualization essential for clinical trust
Learning Resources
Papers
- BEHRT (2020): BERT applied to EHR event sequences
- Med-BERT (2021): Medical event prediction with hierarchical tokenization
- GatorTron (2022): 8.9B parameter clinical language model
- ETHOS (2023): Zero-shot patient trajectory prediction (thesis baseline)
Code Examples
- Karpathy’s NanoGPT: Foundation for implementation
- Hugging Face Medical NLP: Pre-trained clinical models
- MIMIC-IV tutorials: EHR data processing
Courses
- Stanford CS224N: NLP fundamentals
- Fast.ai NLP: Practical deep learning for text
- Healthcare NLP with Transformers (DeepLearning.AI)
Related Concepts
- GPT Architecture - Decoder-only transformer fundamentals
- Language Model Training - Training techniques for LMs
- Healthcare Foundation Models - Survey of medical LMs
- EHR NLP & Tokenization - Processing clinical text and codes
- Text Generation - Sampling strategies for trajectory generation