Skip to Content

Clinical Language Models

Transformer language models like GPT and BERT can be adapted to model patient event sequences and clinical text, enabling trajectory prediction, risk stratification, and clinical decision support. This concept bridges language modeling techniques with healthcare applications.

Overview

Language models treat text as sequences of tokens. In healthcare, we can apply the same principles to:

  1. Patient event sequences: ICD codes, procedures, medications, lab results as discrete tokens
  2. Clinical text: Medical notes, radiology reports, discharge summaries as natural language
  3. Hybrid approaches: Combining structured events with unstructured text

Key insight: Medical events are already discrete tokens (ICD codes, procedure codes). Unlike text that needs character-level or subword tokenization, healthcare events are natural units—making them ideal for transformer models.

Event Sequence Tokenization

From Text Tokens to Medical Event Tokens

Text GPT:

# Tokenize natural language text = "The patient has chest pain" tokens = tokenizer.encode(text) # [464, 5827, 468, 7721, 2356]

Healthcare GPT:

# Tokenize patient trajectory trajectory = [ "ICD:I21.0", # Acute myocardial infarction "PROC:PCI", # Percutaneous coronary intervention "MED:aspirin", # Medication "LAB:troponin_high", # Lab result "VISIT:followup_30d" # Follow-up visit ] # Create vocabulary from medical coding systems vocab = { "ICD:I21.0": 1, "PROC:PCI": 2, "MED:aspirin": 3, "LAB:troponin_high": 4, "VISIT:followup_30d": 5, # ... thousands more event codes } # Simple tokenization (events are already tokens!) tokens = [vocab[event] for event in trajectory] # [1, 2, 3, 4, 5]

BPE for Medical Event Patterns

Byte-Pair Encoding can learn common clinical workflows:

# Frequent co-occurring events "chest_pain" + "ECG" + "troponin_test""acute_cardiac_workup" # Benefits: # - Captures clinical protocols # - Reduces sequence length # - Learns domain knowledge from data

Key Advantage: Medical events from coding systems (ICD-10, CPT, ATC) are pre-tokenized at a clinically meaningful granularity. No need for character or subword tokenization—use codes directly as vocabulary.

Autoregressive Trajectory Prediction

Modeling Patient Futures

GPT’s autoregressive formulation maps directly to predicting patient trajectories:

P(event1,...,eventn)=i=1nP(eventievent1,...,eventi1)P(\text{event}_1, ..., \text{event}_n) = \prod_{i=1}^n P(\text{event}_i | \text{event}_1, ..., \text{event}_{i-1})

Example:

# Given patient history history = ["ICD:I21.0", "PROC:PCI", "MED:aspirin"] # Model predicts next likely events next_events = model.generate(history, top_k=5) # Predicted: ["MED:statin", "VISIT:followup_30d", "LAB:lipid_panel", ...]

Clinical Use Cases

1. Disease Progression Modeling

  • Input: Current diagnoses and treatments
  • Output: Likely future events and complications
  • Application: Anticipate deterioration, allocate resources

2. Treatment Trajectory Generation

  • Input: Patient diagnosis and condition
  • Output: Typical care pathway
  • Application: Clinical decision support, care planning

3. Outcome Risk Stratification

  • Input: Patient event history
  • Output: Probability of adverse outcomes
  • Application: Identify high-risk patients for early intervention

Encoder vs Decoder for Healthcare

Understanding both encoder-only (BERT/ETHOS) and decoder-only (GPT) architectures provides complete perspective on clinical AI:

AspectEncoder (BERT, ETHOS)Decoder (GPT)
AttentionBidirectional (see past & future)Causal (see only past)
Use CaseRepresentation learningSequential generation
TrainingMasked language modelingAutoregressive prediction
OutputFixed-size representationStep-by-step generation

When to Use Encoder (ETHOS Approach)

Best for: Patient representation, cohort analysis, risk classification

# Encode entire patient trajectory into fixed representation trajectory = ["event1", "event2", "event3", "event4"] embedding = encoder(trajectory) # Single vector [768] # Use embedding for: # - Patient similarity (find similar cases) # - Risk scores (classify into risk categories) # - Cohort identification (clustering) # - Zero-shot prediction (transfer to new tasks)

Example: ETHOS uses bidirectional attention to create patient representations for zero-shot outcome prediction.

When to Use Decoder (GPT Approach)

Best for: Trajectory generation, sequential prediction, what-if scenarios

# Generate future trajectory step-by-step history = ["event1", "event2"] future = decoder.generate(history, max_new_tokens=10) # Output: ["event3", "event4", "event5", ...] # Use for: # - Trajectory prediction (what happens next?) # - Counterfactual reasoning (what if different treatment?) # - Synthetic data generation (augmentation)

Example: Train GPT on patient event sequences to generate plausible future trajectories for rare conditions (data augmentation).

Hybrid Encoder-Decoder

Best for: Complex clinical tasks requiring both understanding and generation

# Encoder-decoder (T5-style) for healthcare input_trajectory = ["diagnosis", "symptoms", "test_results"] encoder_hidden = encoder(input_trajectory) # Decoder generates treatment plan treatment_plan = decoder.generate(encoder_hidden) # Output: ["med1", "med2", "procedure", "followup"]

Practical Recommendation:

  • Use encoder (ETHOS) for risk stratification and patient representation
  • Use decoder (GPT) for trajectory forecasting and synthetic data
  • Use encoder-decoder for complex generation tasks (e.g., care plan generation)

Temporal Modeling in Healthcare

Time-Aware Tokenization

Healthcare trajectories unfold over time. Incorporate temporal information:

Option 1: Time tokens

tokens = ["ICD:I21.0", "TIME:day_0", "PROC:PCI", "TIME:day_0", "MED:aspirin", "TIME:day_1", "VISIT:followup", "TIME:day_30"]

Option 2: Time delta tokens

tokens = ["ICD:I21.0", "PROC:PCI", "DELTA:+1day", "MED:aspirin", "DELTA:+29days", "VISIT:followup"]

Option 3: Time-aware positional encodings

# Modify positional encoding to capture time intervals pos_encoding = sinusoidal_encoding(time_in_days)

Causal Attention for Temporal Dependencies

Decoder models with causal attention naturally respect temporal order:

  • Events can only attend to past events (no information leakage)
  • Model learns temporal patterns (e.g., “acute event → immediate treatment → delayed monitoring”)
  • Can predict timing of future events, not just event types

Zero-Shot Trajectory Prediction

Generalizing to Rare Conditions

Large clinical language models can predict outcomes for rare conditions never seen during training:

Mechanism:

# Trained on general patient population model = ClinicalGPT(trained_on="all_patients") # Zero-shot prediction for rare condition rare_patient_history = [rare_condition_events] prediction = model.predict(rare_patient_history) # Model generalizes from similar patterns in common conditions

Why it works:

  • Model learns general clinical patterns (symptoms → diagnosis → treatment)
  • Rare conditions share underlying patterns with common ones
  • Transfer learning enables generalization

Example: Model trained on common cardiac conditions can predict reasonable treatment pathways for rare congenital heart defects by transferring knowledge about general cardiac care patterns.

Synthetic Data Generation

Augmentation for Rare Conditions

Problem: Insufficient data for rare conditions

# Real data: Only 10 patients with rare condition X rare_data = [trajectory_1, trajectory_2, ..., trajectory_10]

Solution: Generate synthetic trajectories

# Fine-tune clinical GPT on rare condition subset gpt = fine_tune(pretrained_clinical_gpt, rare_data) # Generate synthetic patient trajectories synthetic_trajectories = [] for _ in range(100): trajectory = gpt.generate(initial_event="rare_condition_X") synthetic_trajectories.append(trajectory) # Augmented dataset: 10 real + 100 synthetic augmented_data = rare_data + synthetic_trajectories

Validation Requirements

Synthetic medical data requires rigorous validation:

  1. Clinical plausibility: Expert review for medical correctness
  2. Statistical similarity: Match real data distributions
  3. Downstream performance: Improve model accuracy on real test set
  4. Bias amplification check: Ensure no exacerbation of existing biases

Critical Caution: Synthetic medical data should augment, never replace, real patient data. Always validate with clinical experts before use in training or decision-making systems.

Attention Interpretability

Visualizing Clinical Reasoning

Attention weights reveal which past events influence predictions:

# Patient trajectory trajectory = [ "ICD:I21.0", # Acute MI (day 0) "PROC:PCI", # Intervention (day 0) "MED:aspirin", # Medication (day 1) "MED:statin", # Statin started (day 1) "VISIT:followup", # Follow-up (day 30) "LAB:lipid_high" # High lipids detected → predict next ] # Extract attention patterns attention = model.get_attention(trajectory) # Shows: "LAB:lipid_high" strongly attends to "MED:statin" and "VISIT:followup" # Interpretation: Model recognizes statin therapy requires lipid monitoring

Clinical Insights:

  • Do attention patterns align with clinical guidelines?
  • Which historical events drive predictions?
  • Are relationships medically sensible or spurious?

Application: Explain model predictions to clinicians, validate medical reasoning, identify potential errors.

Implementation: From NanoGPT to Healthcare GPT

Step 1: Healthcare Tokenizer

class HealthcareTokenizer: def __init__(self, vocab_path): # Load medical coding vocabulary (ICD, CPT, ATC, LOINC) self.vocab = load_medical_vocab(vocab_path) self.code_to_id = {code: idx for idx, code in enumerate(self.vocab)} self.id_to_code = {idx: code for code, idx in self.code_to_id.items()} def encode(self, trajectory): return [self.code_to_id[event] for event in trajectory] def decode(self, tokens): return [self.id_to_code[token] for token in tokens]

Step 2: Same GPT Architecture

# Identical architecture to text GPT, different vocabulary model = GPT( vocab_size=len(medical_vocab), # 50k-100k medical codes block_size=512, # Longer context for patient history n_layer=12, # Depth n_embd=768, # Embedding dimension n_head=12 # Attention heads )

Step 3: Training on Patient Trajectories

# Load patient event sequences trajectories = load_ehr_trajectories(dataset="MIMIC-IV") # Standard autoregressive training for trajectory in trajectories: # Input: all events except last # Target: all events except first (shifted by 1) x, y = trajectory[:-1], trajectory[1:] logits = model(x) loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1)) loss.backward() optimizer.step()

Step 4: Generation and Evaluation

# Generate patient trajectory initial_event = ["ICD:I21.0"] # Start with acute MI generated = model.generate(initial_event, max_new_tokens=20, temperature=0.8) # Evaluation metrics: # 1. Perplexity on held-out patient sequences # 2. Clinical plausibility (expert review) # 3. Statistical similarity to real trajectories # 4. Downstream task performance (risk prediction)

Key Healthcare Models

See Healthcare Foundation Models for comprehensive coverage of:

  • BEHRT: BERT for patient event sequences
  • Med-BERT: Medical event prediction with pre-training
  • ClinicalBERT: BERT for clinical notes
  • GatorTron: Large-scale clinical language model (8.9B parameters)

Thesis Application: ETHOS Comparison

Your EmergAI thesis uses ETHOS (encoder-only transformer). Understanding decoder-only (GPT) completes the picture:

ETHOS Advantages:

  • Bidirectional context (see future and past during training)
  • Efficient patient representation (single embedding)
  • Zero-shot task transfer

GPT Advantages:

  • Natural trajectory generation
  • Synthetic data augmentation
  • Sequential prediction with timing

Potential Thesis Extensions:

  1. Compare encoder vs decoder for trajectory prediction
  2. Use GPT to generate synthetic EmergAI patient trajectories
  3. Propose hybrid encoder-decoder for EmergAI
  4. Interpret ETHOS via attention visualization

Key Takeaways

  1. Medical events are natural tokens: No complex tokenization needed—use coding systems directly
  2. Encoder for representation, decoder for generation: Choose architecture based on task
  3. Temporal modeling critical: Healthcare sequences unfold over time—incorporate temporal information
  4. Zero-shot generalization: Pre-training enables transfer to rare conditions
  5. Synthetic data with caution: Can augment but must be validated by experts
  6. Interpretability required: Attention visualization essential for clinical trust

Learning Resources

Papers

  • BEHRT (2020): BERT applied to EHR event sequences
  • Med-BERT (2021): Medical event prediction with hierarchical tokenization
  • GatorTron (2022): 8.9B parameter clinical language model
  • ETHOS (2023): Zero-shot patient trajectory prediction (thesis baseline)

Code Examples

  • Karpathy’s NanoGPT: Foundation for implementation
  • Hugging Face Medical NLP: Pre-trained clinical models
  • MIMIC-IV tutorials: EHR data processing

Courses

  • Stanford CS224N: NLP fundamentals
  • Fast.ai NLP: Practical deep learning for text
  • Healthcare NLP with Transformers (DeepLearning.AI)