Clinical Language Models

Transformer language models like GPT and BERT can be adapted to model patient event sequences and clinical text, enabling trajectory prediction, risk stratification, and clinical decision support. This concept bridges language modeling techniques with healthcare applications.

Overview

Language models treat text as sequences of tokens. In healthcare, we can apply the same principles to:

Patient event sequences: ICD codes, procedures, medications, lab results as discrete tokens
Clinical text: Medical notes, radiology reports, discharge summaries as natural language
Hybrid approaches: Combining structured events with unstructured text

Key insight: Medical events are already discrete tokens (ICD codes, procedure codes). Unlike text that needs character-level or subword tokenization, healthcare events are natural units—making them ideal for transformer models.

Event Sequence Tokenization

From Text Tokens to Medical Event Tokens

Text GPT:


# Tokenize natural language
text = "The patient has chest pain"
tokens = tokenizer.encode(text)  # [464, 5827, 468, 7721, 2356]

Healthcare GPT:


# Tokenize patient trajectory
trajectory = [
    "ICD:I21.0",           # Acute myocardial infarction
    "PROC:PCI",            # Percutaneous coronary intervention
    "MED:aspirin",         # Medication
    "LAB:troponin_high",   # Lab result
    "VISIT:followup_30d"   # Follow-up visit
]
 
# Create vocabulary from medical coding systems
vocab = {
    "ICD:I21.0": 1,
    "PROC:PCI": 2,
    "MED:aspirin": 3,
    "LAB:troponin_high": 4,
    "VISIT:followup_30d": 5,
    # ... thousands more event codes
}
 
# Simple tokenization (events are already tokens!)
tokens = [vocab[event] for event in trajectory]  # [1, 2, 3, 4, 5]

BPE for Medical Event Patterns

Byte-Pair Encoding can learn common clinical workflows:


# Frequent co-occurring events
"chest_pain" + "ECG" + "troponin_test" → "acute_cardiac_workup"
 
# Benefits:
# - Captures clinical protocols
# - Reduces sequence length
# - Learns domain knowledge from data

Key Advantage: Medical events from coding systems (ICD-10, CPT, ATC) are pre-tokenized at a clinically meaningful granularity. No need for character or subword tokenization—use codes directly as vocabulary.

Autoregressive Trajectory Prediction

Modeling Patient Futures

GPT’s autoregressive formulation maps directly to predicting patient trajectories:

$P(\text{event}_1, ..., \text{event}_n) = \prod_{i=1}^n P(\text{event}_i | \text{event}_1, ..., \text{event}_{i-1})$

Example:


# Given patient history
history = ["ICD:I21.0", "PROC:PCI", "MED:aspirin"]
 
# Model predicts next likely events
next_events = model.generate(history, top_k=5)
# Predicted: ["MED:statin", "VISIT:followup_30d", "LAB:lipid_panel", ...]

Clinical Use Cases

1. Disease Progression Modeling

Input: Current diagnoses and treatments
Output: Likely future events and complications
Application: Anticipate deterioration, allocate resources

2. Treatment Trajectory Generation

Input: Patient diagnosis and condition
Output: Typical care pathway
Application: Clinical decision support, care planning

3. Outcome Risk Stratification

Input: Patient event history
Output: Probability of adverse outcomes
Application: Identify high-risk patients for early intervention

Encoder vs Decoder for Healthcare

Understanding both encoder-only (BERT/ETHOS) and decoder-only (GPT) architectures provides complete perspective on clinical AI:

Aspect	Encoder (BERT, ETHOS)	Decoder (GPT)
Attention	Bidirectional (see past & future)	Causal (see only past)
Use Case	Representation learning	Sequential generation
Training	Masked language modeling	Autoregressive prediction
Output	Fixed-size representation	Step-by-step generation

When to Use Encoder (ETHOS Approach)

Best for: Patient representation, cohort analysis, risk classification


# Encode entire patient trajectory into fixed representation
trajectory = ["event1", "event2", "event3", "event4"]
embedding = encoder(trajectory)  # Single vector [768]
 
# Use embedding for:
# - Patient similarity (find similar cases)
# - Risk scores (classify into risk categories)
# - Cohort identification (clustering)
# - Zero-shot prediction (transfer to new tasks)

Example: ETHOS uses bidirectional attention to create patient representations for zero-shot outcome prediction.

When to Use Decoder (GPT Approach)

Best for: Trajectory generation, sequential prediction, what-if scenarios


# Generate future trajectory step-by-step
history = ["event1", "event2"]
future = decoder.generate(history, max_new_tokens=10)
# Output: ["event3", "event4", "event5", ...]
 
# Use for:
# - Trajectory prediction (what happens next?)
# - Counterfactual reasoning (what if different treatment?)
# - Synthetic data generation (augmentation)

Example: Train GPT on patient event sequences to generate plausible future trajectories for rare conditions (data augmentation).

Hybrid Encoder-Decoder

Best for: Complex clinical tasks requiring both understanding and generation


# Encoder-decoder (T5-style) for healthcare
input_trajectory = ["diagnosis", "symptoms", "test_results"]
encoder_hidden = encoder(input_trajectory)
 
# Decoder generates treatment plan
treatment_plan = decoder.generate(encoder_hidden)
# Output: ["med1", "med2", "procedure", "followup"]

Practical Recommendation:

Use encoder (ETHOS) for risk stratification and patient representation
Use decoder (GPT) for trajectory forecasting and synthetic data
Use encoder-decoder for complex generation tasks (e.g., care plan generation)

Temporal Modeling in Healthcare

Time-Aware Tokenization

Healthcare trajectories unfold over time. Incorporate temporal information:

Option 1: Time tokens


tokens = ["ICD:I21.0", "TIME:day_0", "PROC:PCI", "TIME:day_0",
          "MED:aspirin", "TIME:day_1", "VISIT:followup", "TIME:day_30"]

Option 2: Time delta tokens


tokens = ["ICD:I21.0", "PROC:PCI", "DELTA:+1day", "MED:aspirin",
          "DELTA:+29days", "VISIT:followup"]

Option 3: Time-aware positional encodings


# Modify positional encoding to capture time intervals
pos_encoding = sinusoidal_encoding(time_in_days)

Causal Attention for Temporal Dependencies

Decoder models with causal attention naturally respect temporal order:

Events can only attend to past events (no information leakage)
Model learns temporal patterns (e.g., “acute event → immediate treatment → delayed monitoring”)
Can predict timing of future events, not just event types

Zero-Shot Trajectory Prediction

Generalizing to Rare Conditions

Large clinical language models can predict outcomes for rare conditions never seen during training:

Mechanism:


# Trained on general patient population
model = ClinicalGPT(trained_on="all_patients")
 
# Zero-shot prediction for rare condition
rare_patient_history = [rare_condition_events]
prediction = model.predict(rare_patient_history)
# Model generalizes from similar patterns in common conditions

Why it works:

Model learns general clinical patterns (symptoms → diagnosis → treatment)
Rare conditions share underlying patterns with common ones
Transfer learning enables generalization

Example: Model trained on common cardiac conditions can predict reasonable treatment pathways for rare congenital heart defects by transferring knowledge about general cardiac care patterns.

Synthetic Data Generation

Augmentation for Rare Conditions

Problem: Insufficient data for rare conditions


# Real data: Only 10 patients with rare condition X
rare_data = [trajectory_1, trajectory_2, ..., trajectory_10]

Solution: Generate synthetic trajectories


# Fine-tune clinical GPT on rare condition subset
gpt = fine_tune(pretrained_clinical_gpt, rare_data)
 
# Generate synthetic patient trajectories
synthetic_trajectories = []
for _ in range(100):
    trajectory = gpt.generate(initial_event="rare_condition_X")
    synthetic_trajectories.append(trajectory)
 
# Augmented dataset: 10 real + 100 synthetic
augmented_data = rare_data + synthetic_trajectories

Validation Requirements

Synthetic medical data requires rigorous validation:

Clinical plausibility: Expert review for medical correctness
Statistical similarity: Match real data distributions
Downstream performance: Improve model accuracy on real test set
Bias amplification check: Ensure no exacerbation of existing biases

Critical Caution: Synthetic medical data should augment, never replace, real patient data. Always validate with clinical experts before use in training or decision-making systems.

Attention Interpretability

Visualizing Clinical Reasoning

Attention weights reveal which past events influence predictions:


# Patient trajectory
trajectory = [
    "ICD:I21.0",       # Acute MI (day 0)
    "PROC:PCI",        # Intervention (day 0)
    "MED:aspirin",     # Medication (day 1)
    "MED:statin",      # Statin started (day 1)
    "VISIT:followup",  # Follow-up (day 30)
    "LAB:lipid_high"   # High lipids detected → predict next
]
 
# Extract attention patterns
attention = model.get_attention(trajectory)
# Shows: "LAB:lipid_high" strongly attends to "MED:statin" and "VISIT:followup"
# Interpretation: Model recognizes statin therapy requires lipid monitoring

Clinical Insights:

Do attention patterns align with clinical guidelines?
Which historical events drive predictions?
Are relationships medically sensible or spurious?

Application: Explain model predictions to clinicians, validate medical reasoning, identify potential errors.

Implementation: From NanoGPT to Healthcare GPT

Step 1: Healthcare Tokenizer


class HealthcareTokenizer:
    def __init__(self, vocab_path):
        # Load medical coding vocabulary (ICD, CPT, ATC, LOINC)
        self.vocab = load_medical_vocab(vocab_path)
        self.code_to_id = {code: idx for idx, code in enumerate(self.vocab)}
        self.id_to_code = {idx: code for code, idx in self.code_to_id.items()}
 
    def encode(self, trajectory):
        return [self.code_to_id[event] for event in trajectory]
 
    def decode(self, tokens):
        return [self.id_to_code[token] for token in tokens]

Step 2: Same GPT Architecture


# Identical architecture to text GPT, different vocabulary
model = GPT(
    vocab_size=len(medical_vocab),  # 50k-100k medical codes
    block_size=512,                 # Longer context for patient history
    n_layer=12,                     # Depth
    n_embd=768,                     # Embedding dimension
    n_head=12                       # Attention heads
)

Step 3: Training on Patient Trajectories


# Load patient event sequences
trajectories = load_ehr_trajectories(dataset="MIMIC-IV")
 
# Standard autoregressive training
for trajectory in trajectories:
    # Input: all events except last
    # Target: all events except first (shifted by 1)
    x, y = trajectory[:-1], trajectory[1:]
 
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, vocab_size), y.view(-1))
    loss.backward()
    optimizer.step()

Step 4: Generation and Evaluation


# Generate patient trajectory
initial_event = ["ICD:I21.0"]  # Start with acute MI
generated = model.generate(initial_event, max_new_tokens=20, temperature=0.8)
 
# Evaluation metrics:
# 1. Perplexity on held-out patient sequences
# 2. Clinical plausibility (expert review)
# 3. Statistical similarity to real trajectories
# 4. Downstream task performance (risk prediction)

Key Healthcare Models

See Healthcare Foundation Models for comprehensive coverage of:

BEHRT: BERT for patient event sequences
Med-BERT: Medical event prediction with pre-training
ClinicalBERT: BERT for clinical notes
GatorTron: Large-scale clinical language model (8.9B parameters)

Thesis Application: ETHOS Comparison

Your EmergAI thesis uses ETHOS (encoder-only transformer). Understanding decoder-only (GPT) completes the picture:

ETHOS Advantages:

Bidirectional context (see future and past during training)
Efficient patient representation (single embedding)
Zero-shot task transfer

GPT Advantages:

Natural trajectory generation
Synthetic data augmentation
Sequential prediction with timing

Potential Thesis Extensions:

Compare encoder vs decoder for trajectory prediction
Use GPT to generate synthetic EmergAI patient trajectories
Propose hybrid encoder-decoder for EmergAI
Interpret ETHOS via attention visualization

Key Takeaways

Medical events are natural tokens: No complex tokenization needed—use coding systems directly
Encoder for representation, decoder for generation: Choose architecture based on task
Temporal modeling critical: Healthcare sequences unfold over time—incorporate temporal information
Zero-shot generalization: Pre-training enables transfer to rare conditions
Synthetic data with caution: Can augment but must be validated by experts
Interpretability required: Attention visualization essential for clinical trust

Learning Resources

Papers

BEHRT (2020): BERT applied to EHR event sequences
Med-BERT (2021): Medical event prediction with hierarchical tokenization
GatorTron (2022): 8.9B parameter clinical language model
ETHOS (2023): Zero-shot patient trajectory prediction (thesis baseline)

Code Examples

Karpathy’s NanoGPT: Foundation for implementation
Hugging Face Medical NLP: Pre-trained clinical models
MIMIC-IV tutorials: EHR data processing

Courses

Stanford CS224N: NLP fundamentals
Fast.ai NLP: Practical deep learning for text
Healthcare NLP with Transformers (DeepLearning.AI)

GPT Architecture - Decoder-only transformer fundamentals
Language Model Training - Training techniques for LMs
Healthcare Foundation Models - Survey of medical LMs
EHR NLP & Tokenization - Processing clinical text and codes
Text Generation - Sampling strategies for trajectory generation