Medical Event Tokenization and Clinical NLP
Overview
To apply transformer models to electronic health records, we must convert structured medical events (diagnosis codes, procedures, medications) and unstructured clinical text into token sequences. This process requires specialized tokenization strategies and domain-specific language models.
Medical Event Tokenization
Challenge
Transform heterogeneous EHR data into sequences suitable for transformer processing, analogous to tokenizing words in natural language.
Approach 1: Code-Level Tokenization
Strategy: Treat each medical code as an atomic token (vocabulary item).
# Build vocabulary from medical codes
vocab = {
'[PAD]': 0,
'[CLS]': 1,
'[SEP]': 2,
'[MASK]': 3,
# ICD-10 diagnosis codes
'I21.0': 4, # STEMI
'I10': 5, # Hypertension
'E11.9': 6, # Diabetes Type 2
# CPT procedure codes
'93000': 7, # ECG
'82550': 8, # Troponin test
# ATC medication codes
'C01AA05': 9, # Digoxin
'C07AB03': 10,# Atenolol
# ... 50,000+ total codes
}
# Patient trajectory as token sequence
trajectory = [
vocab['[CLS]'],
vocab['I21.0'], # Diagnosis: STEMI
vocab['93000'], # Procedure: ECG
vocab['82550'], # Lab: Troponin
vocab['C01AA05'], # Medication: Digoxin
vocab['[SEP]']
]Advantages:
- Simple and interpretable
- Direct correspondence to medical events
- Compatible with standard transformer architectures
Disadvantages:
- Large vocabulary size (50,000+ codes)
- No handling of rare/unseen codes
- Ignores hierarchical structure
Approach 2: Hierarchical Tokenization
Strategy: Decompose codes into hierarchical components using medical ontology structure.
# Hierarchical representation of ICD-10 code
def hierarchical_tokenize(icd10_code):
"""
I21.0 → ['I', 'I21', 'I21.0']
Circulatory → Acute MI → Anterior STEMI
"""
if len(icd10_code) >= 1:
chapter = icd10_code[0] # 'I'
if len(icd10_code) >= 3:
category = icd10_code[:3] # 'I21'
if len(icd10_code) >= 4:
specific = icd10_code # 'I21.0'
return [chapter, category, specific]
# Example
tokens = hierarchical_tokenize('I21.0')
# Returns: ['I', 'I21', 'I21.0']Advantages:
- Captures semantic relationships (I21.0 is a type of I21)
- Enables generalization across hierarchy levels
- Useful for rare code handling
Disadvantages:
- Increases sequence length (3x per code)
- More complex modeling
Approach 3: Byte-Pair Encoding (BPE)
Strategy: Learn subword units from code sequences, similar to GPT tokenization.
# Train BPE on medical codes
from tokenizers import Tokenizer, models, trainers
# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
# Train on code corpus
trainer = trainers.BpeTrainer(vocab_size=30000, special_tokens=['[PAD]', '[CLS]', '[SEP]', '[MASK]'])
tokenizer.train_from_iterator(ehr_code_sequences, trainer)
# Tokenize rare code
rare_code = 'I21.09' # Not in training set
tokens = tokenizer.encode(rare_code).tokens
# May decompose as: ['I21', '.', '09']Advantages:
- Handles out-of-vocabulary codes
- Learns common code patterns
- Reduces vocabulary size
Disadvantages:
- Less interpretable decompositions
- May not respect medical semantics
Temporal Encoding
Time is critical in healthcare—the sequence and spacing of events matter for prediction.
Time Since Admission
import torch
import math
def temporal_positional_encoding(times, d_model=768):
"""
Encode time deltas using sinusoidal positional encoding
Args:
times: (seq_len,) tensor of days since admission
d_model: embedding dimension
Returns:
(seq_len, d_model) temporal embeddings
"""
seq_len = len(times)
pe = torch.zeros(seq_len, d_model)
position = times.unsqueeze(1) # (seq_len, 1)
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
# Example usage
event_times = torch.tensor([0.0, 0.5, 1.0, 3.0, 7.0]) # Days since admission
temporal_emb = temporal_positional_encoding(event_times)Learned Temporal Embeddings
class LearnedTemporalEmbedding(nn.Module):
def __init__(self, max_days=365, d_model=768):
super().__init__()
# Discretize time into bins
self.time_embedding = nn.Embedding(max_days, d_model)
def forward(self, event_times):
# Convert float days to integer bins
time_bins = torch.clamp(event_times.long(), 0, max_days-1)
return self.time_embedding(time_bins)Clinical Natural Language Processing
The Challenge
Clinical notes contain:
- Medical jargon: “diaphoresis”, “dyspnea”, “tachycardia”
- Abbreviations: “MI”, “CHF”, “SOB”, “CP”
- Negation: “No chest pain”, “denies dyspnea”
- Temporal expressions: “2 days ago”, “ongoing since morning”
- Misspellings: Common in rapid clinical documentation
Standard language models (BERT, GPT) trained on Wikipedia/books perform poorly on clinical text.
ClinicalBERT
BERT pre-trained on clinical notes from MIMIC-III database. Newer variants can also be trained on MIMIC-IV, which includes 269,573 additional ED admission notes.
from transformers import AutoTokenizer, AutoModel
# Load ClinicalBERT
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Encode clinical text
text = "Patient presents with acute chest pain radiating to left arm, diaphoresis, and dyspnea. Troponin elevated."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
outputs = model(**inputs)
# Extract text embedding (CLS token)
text_embedding = outputs.last_hidden_state[:, 0, :] # (1, 768)Paper: Publicly Available Clinical BERT Embeddings (Alsentzer et al., 2019)
ClinicalBERT vs General BERT
| Aspect | General BERT | ClinicalBERT |
|---|---|---|
| Pre-training Data | Wikipedia, BooksCorpus | MIMIC-III clinical notes (2M notes) |
| Vocabulary | General English words | Medical terminology, abbreviations |
| Medical NER | Poor (40-50% F1) | Excellent (85-90% F1) |
| Clinical Tasks | Low performance | State-of-the-art |
| Use Case | General NLP | Healthcare-specific NLP |
Medical Named Entity Recognition (NER)
Extract medical entities from text:
from transformers import pipeline
# Load medical NER pipeline
ner = pipeline("ner", model="alvaroalon2/biobert_diseases_ner", aggregation_strategy="simple")
# Extract entities
text = "Patient diagnosed with acute myocardial infarction and Type 2 diabetes mellitus"
entities = ner(text)
# Output:
# [
# {'entity_group': 'Disease', 'word': 'acute myocardial infarction', 'score': 0.99},
# {'entity_group': 'Disease', 'word': 'Type 2 diabetes mellitus', 'score': 0.98}
# ]Multimodal EHR Representation
Combining structured codes and clinical text:
class MultimodalEHREncoder(nn.Module):
def __init__(self, vocab_size=50000, d_model=768):
super().__init__()
# Structured event embeddings
self.event_embedding = nn.Embedding(vocab_size, d_model)
self.temporal_embedding = LearnedTemporalEmbedding(max_days=365, d_model=d_model)
# Clinical text encoder (ClinicalBERT)
self.text_encoder = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
# Transformer encoder for combined sequence
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=d_model, nhead=12),
num_layers=6
)
def forward(self, event_ids, event_times, clinical_notes):
# Embed structured events
event_emb = self.event_embedding(event_ids) # (batch, seq_len, d_model)
temporal_emb = self.temporal_embedding(event_times) # (batch, seq_len, d_model)
structured_emb = event_emb + temporal_emb
# Encode clinical text
text_outputs = self.text_encoder(clinical_notes)
text_emb = text_outputs.last_hidden_state[:, 0:1, :] # (batch, 1, d_model)
# Concatenate sequences
combined_emb = torch.cat([text_emb, structured_emb], dim=1) # (batch, seq_len+1, d_model)
# Encode with transformer
encoded = self.transformer(combined_emb)
return encodedRelated Concepts
- EHR Structure and Medical Coding: Foundation for understanding what’s being tokenized
- Healthcare Foundation Models: Models like ETHOS and BEHRT that use these tokenization strategies
- BPE Tokenization: General tokenization technique applied to medical codes
- Transformer Architecture: The model architecture these tokens feed into
Learning Resources
Papers
- ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission (Alsentzer et al., 2019)
- BioBERT: A pre-trained biomedical language representation model (Lee et al., 2019)
- Med-BERT: Pre-trained Contextualized Embeddings on Large-Scale Structured EHRs for Disease Prediction (Rasmy et al., 2020)
Models
- emilyalsentzer/Bio_ClinicalBERT - Clinical BERT on HuggingFace
- dmis-lab/biobert-v1.1 - BioBERT for biomedical text
- alvaroalon2/biobert_diseases_ner - Medical NER model
Tools
- MIMIC-IV Code Mappings - Standard code vocabularies
- Hugging Face Medical Models - Collection of clinical NLP models
Applications
- Clinical note classification: Identify diagnosis from admission notes
- Adverse event detection: Extract drug reactions from clinical text
- Symptom extraction: Parse patient-reported symptoms
- Medical question answering: Answer clinical queries using EHR data
- Disease progression modeling: Combine codes and notes for temporal prediction
Key Takeaways
- Code-level tokenization treats each medical code as a token (simple, interpretable)
- Hierarchical tokenization leverages ICD-10/ATC structure for generalization
- BPE tokenization handles rare codes through subword decomposition
- Temporal encoding is critical—use positional encoding for time deltas
- ClinicalBERT outperforms general BERT on medical text by 30-40%
- Multimodal approaches combine structured codes and clinical text for richer representations