Skip to Content

Medical Event Tokenization and Clinical NLP

Overview

To apply transformer models to electronic health records, we must convert structured medical events (diagnosis codes, procedures, medications) and unstructured clinical text into token sequences. This process requires specialized tokenization strategies and domain-specific language models.

Medical Event Tokenization

Challenge

Transform heterogeneous EHR data into sequences suitable for transformer processing, analogous to tokenizing words in natural language.

Approach 1: Code-Level Tokenization

Strategy: Treat each medical code as an atomic token (vocabulary item).

# Build vocabulary from medical codes vocab = { '[PAD]': 0, '[CLS]': 1, '[SEP]': 2, '[MASK]': 3, # ICD-10 diagnosis codes 'I21.0': 4, # STEMI 'I10': 5, # Hypertension 'E11.9': 6, # Diabetes Type 2 # CPT procedure codes '93000': 7, # ECG '82550': 8, # Troponin test # ATC medication codes 'C01AA05': 9, # Digoxin 'C07AB03': 10,# Atenolol # ... 50,000+ total codes } # Patient trajectory as token sequence trajectory = [ vocab['[CLS]'], vocab['I21.0'], # Diagnosis: STEMI vocab['93000'], # Procedure: ECG vocab['82550'], # Lab: Troponin vocab['C01AA05'], # Medication: Digoxin vocab['[SEP]'] ]

Advantages:

  • Simple and interpretable
  • Direct correspondence to medical events
  • Compatible with standard transformer architectures

Disadvantages:

  • Large vocabulary size (50,000+ codes)
  • No handling of rare/unseen codes
  • Ignores hierarchical structure

Approach 2: Hierarchical Tokenization

Strategy: Decompose codes into hierarchical components using medical ontology structure.

# Hierarchical representation of ICD-10 code def hierarchical_tokenize(icd10_code): """ I21.0 → ['I', 'I21', 'I21.0'] Circulatory → Acute MI → Anterior STEMI """ if len(icd10_code) >= 1: chapter = icd10_code[0] # 'I' if len(icd10_code) >= 3: category = icd10_code[:3] # 'I21' if len(icd10_code) >= 4: specific = icd10_code # 'I21.0' return [chapter, category, specific] # Example tokens = hierarchical_tokenize('I21.0') # Returns: ['I', 'I21', 'I21.0']

Advantages:

  • Captures semantic relationships (I21.0 is a type of I21)
  • Enables generalization across hierarchy levels
  • Useful for rare code handling

Disadvantages:

  • Increases sequence length (3x per code)
  • More complex modeling

Approach 3: Byte-Pair Encoding (BPE)

Strategy: Learn subword units from code sequences, similar to GPT tokenization.

# Train BPE on medical codes from tokenizers import Tokenizer, models, trainers # Initialize BPE tokenizer tokenizer = Tokenizer(models.BPE()) # Train on code corpus trainer = trainers.BpeTrainer(vocab_size=30000, special_tokens=['[PAD]', '[CLS]', '[SEP]', '[MASK]']) tokenizer.train_from_iterator(ehr_code_sequences, trainer) # Tokenize rare code rare_code = 'I21.09' # Not in training set tokens = tokenizer.encode(rare_code).tokens # May decompose as: ['I21', '.', '09']

Advantages:

  • Handles out-of-vocabulary codes
  • Learns common code patterns
  • Reduces vocabulary size

Disadvantages:

  • Less interpretable decompositions
  • May not respect medical semantics

Temporal Encoding

Time is critical in healthcare—the sequence and spacing of events matter for prediction.

Time Since Admission

import torch import math def temporal_positional_encoding(times, d_model=768): """ Encode time deltas using sinusoidal positional encoding Args: times: (seq_len,) tensor of days since admission d_model: embedding dimension Returns: (seq_len, d_model) temporal embeddings """ seq_len = len(times) pe = torch.zeros(seq_len, d_model) position = times.unsqueeze(1) # (seq_len, 1) div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe # Example usage event_times = torch.tensor([0.0, 0.5, 1.0, 3.0, 7.0]) # Days since admission temporal_emb = temporal_positional_encoding(event_times)

Learned Temporal Embeddings

class LearnedTemporalEmbedding(nn.Module): def __init__(self, max_days=365, d_model=768): super().__init__() # Discretize time into bins self.time_embedding = nn.Embedding(max_days, d_model) def forward(self, event_times): # Convert float days to integer bins time_bins = torch.clamp(event_times.long(), 0, max_days-1) return self.time_embedding(time_bins)

Clinical Natural Language Processing

The Challenge

Clinical notes contain:

  • Medical jargon: “diaphoresis”, “dyspnea”, “tachycardia”
  • Abbreviations: “MI”, “CHF”, “SOB”, “CP”
  • Negation: “No chest pain”, “denies dyspnea”
  • Temporal expressions: “2 days ago”, “ongoing since morning”
  • Misspellings: Common in rapid clinical documentation

Standard language models (BERT, GPT) trained on Wikipedia/books perform poorly on clinical text.

ClinicalBERT

BERT pre-trained on clinical notes from MIMIC-III database. Newer variants can also be trained on MIMIC-IV, which includes 269,573 additional ED admission notes.

from transformers import AutoTokenizer, AutoModel # Load ClinicalBERT model_name = "emilyalsentzer/Bio_ClinicalBERT" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) # Encode clinical text text = "Patient presents with acute chest pain radiating to left arm, diaphoresis, and dyspnea. Troponin elevated." inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512) outputs = model(**inputs) # Extract text embedding (CLS token) text_embedding = outputs.last_hidden_state[:, 0, :] # (1, 768)

Paper: Publicly Available Clinical BERT Embeddings  (Alsentzer et al., 2019)

ClinicalBERT vs General BERT

AspectGeneral BERTClinicalBERT
Pre-training DataWikipedia, BooksCorpusMIMIC-III clinical notes (2M notes)
VocabularyGeneral English wordsMedical terminology, abbreviations
Medical NERPoor (40-50% F1)Excellent (85-90% F1)
Clinical TasksLow performanceState-of-the-art
Use CaseGeneral NLPHealthcare-specific NLP

Medical Named Entity Recognition (NER)

Extract medical entities from text:

from transformers import pipeline # Load medical NER pipeline ner = pipeline("ner", model="alvaroalon2/biobert_diseases_ner", aggregation_strategy="simple") # Extract entities text = "Patient diagnosed with acute myocardial infarction and Type 2 diabetes mellitus" entities = ner(text) # Output: # [ # {'entity_group': 'Disease', 'word': 'acute myocardial infarction', 'score': 0.99}, # {'entity_group': 'Disease', 'word': 'Type 2 diabetes mellitus', 'score': 0.98} # ]

Multimodal EHR Representation

Combining structured codes and clinical text:

class MultimodalEHREncoder(nn.Module): def __init__(self, vocab_size=50000, d_model=768): super().__init__() # Structured event embeddings self.event_embedding = nn.Embedding(vocab_size, d_model) self.temporal_embedding = LearnedTemporalEmbedding(max_days=365, d_model=d_model) # Clinical text encoder (ClinicalBERT) self.text_encoder = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") # Transformer encoder for combined sequence self.transformer = nn.TransformerEncoder( nn.TransformerEncoderLayer(d_model=d_model, nhead=12), num_layers=6 ) def forward(self, event_ids, event_times, clinical_notes): # Embed structured events event_emb = self.event_embedding(event_ids) # (batch, seq_len, d_model) temporal_emb = self.temporal_embedding(event_times) # (batch, seq_len, d_model) structured_emb = event_emb + temporal_emb # Encode clinical text text_outputs = self.text_encoder(clinical_notes) text_emb = text_outputs.last_hidden_state[:, 0:1, :] # (batch, 1, d_model) # Concatenate sequences combined_emb = torch.cat([text_emb, structured_emb], dim=1) # (batch, seq_len+1, d_model) # Encode with transformer encoded = self.transformer(combined_emb) return encoded
  • EHR Structure and Medical Coding: Foundation for understanding what’s being tokenized
  • Healthcare Foundation Models: Models like ETHOS and BEHRT that use these tokenization strategies
  • BPE Tokenization: General tokenization technique applied to medical codes
  • Transformer Architecture: The model architecture these tokens feed into

Learning Resources

Papers

Models

Tools

Applications

  • Clinical note classification: Identify diagnosis from admission notes
  • Adverse event detection: Extract drug reactions from clinical text
  • Symptom extraction: Parse patient-reported symptoms
  • Medical question answering: Answer clinical queries using EHR data
  • Disease progression modeling: Combine codes and notes for temporal prediction

Key Takeaways

  1. Code-level tokenization treats each medical code as a token (simple, interpretable)
  2. Hierarchical tokenization leverages ICD-10/ATC structure for generalization
  3. BPE tokenization handles rare codes through subword decomposition
  4. Temporal encoding is critical—use positional encoding for time deltas
  5. ClinicalBERT outperforms general BERT on medical text by 30-40%
  6. Multimodal approaches combine structured codes and clinical text for richer representations