Medical Event Tokenization and Clinical NLP

Overview

To apply transformer models to electronic health records, we must convert structured medical events (diagnosis codes, procedures, medications) and unstructured clinical text into token sequences. This process requires specialized tokenization strategies and domain-specific language models.

Medical Event Tokenization

Challenge

Transform heterogeneous EHR data into sequences suitable for transformer processing, analogous to tokenizing words in natural language.

Approach 1: Code-Level Tokenization

Strategy: Treat each medical code as an atomic token (vocabulary item).


# Build vocabulary from medical codes
vocab = {
    '[PAD]': 0,
    '[CLS]': 1,
    '[SEP]': 2,
    '[MASK]': 3,
    # ICD-10 diagnosis codes
    'I21.0': 4,   # STEMI
    'I10': 5,     # Hypertension
    'E11.9': 6,   # Diabetes Type 2
    # CPT procedure codes
    '93000': 7,   # ECG
    '82550': 8,   # Troponin test
    # ATC medication codes
    'C01AA05': 9, # Digoxin
    'C07AB03': 10,# Atenolol
    # ... 50,000+ total codes
}
 
# Patient trajectory as token sequence
trajectory = [
    vocab['[CLS]'],
    vocab['I21.0'],    # Diagnosis: STEMI
    vocab['93000'],    # Procedure: ECG
    vocab['82550'],    # Lab: Troponin
    vocab['C01AA05'],  # Medication: Digoxin
    vocab['[SEP]']
]

Advantages:

Simple and interpretable
Direct correspondence to medical events
Compatible with standard transformer architectures

Disadvantages:

Large vocabulary size (50,000+ codes)
No handling of rare/unseen codes
Ignores hierarchical structure

Approach 2: Hierarchical Tokenization

Strategy: Decompose codes into hierarchical components using medical ontology structure.


# Hierarchical representation of ICD-10 code
def hierarchical_tokenize(icd10_code):
    """
    I21.0 → ['I', 'I21', 'I21.0']
    Circulatory → Acute MI → Anterior STEMI
    """
    if len(icd10_code) >= 1:
        chapter = icd10_code[0]           # 'I'
    if len(icd10_code) >= 3:
        category = icd10_code[:3]         # 'I21'
    if len(icd10_code) >= 4:
        specific = icd10_code              # 'I21.0'
 
    return [chapter, category, specific]
 
# Example
tokens = hierarchical_tokenize('I21.0')
# Returns: ['I', 'I21', 'I21.0']

Advantages:

Captures semantic relationships (I21.0 is a type of I21)
Enables generalization across hierarchy levels
Useful for rare code handling

Disadvantages:

Increases sequence length (3x per code)
More complex modeling

Approach 3: Byte-Pair Encoding (BPE)

Strategy: Learn subword units from code sequences, similar to GPT tokenization.


# Train BPE on medical codes
from tokenizers import Tokenizer, models, trainers
 
# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
 
# Train on code corpus
trainer = trainers.BpeTrainer(vocab_size=30000, special_tokens=['[PAD]', '[CLS]', '[SEP]', '[MASK]'])
tokenizer.train_from_iterator(ehr_code_sequences, trainer)
 
# Tokenize rare code
rare_code = 'I21.09'  # Not in training set
tokens = tokenizer.encode(rare_code).tokens
# May decompose as: ['I21', '.', '09']

Advantages:

Handles out-of-vocabulary codes
Learns common code patterns
Reduces vocabulary size

Disadvantages:

Less interpretable decompositions
May not respect medical semantics

Temporal Encoding

Time is critical in healthcare—the sequence and spacing of events matter for prediction.

Time Since Admission


import torch
import math
 
def temporal_positional_encoding(times, d_model=768):
    """
    Encode time deltas using sinusoidal positional encoding
 
    Args:
        times: (seq_len,) tensor of days since admission
        d_model: embedding dimension
 
    Returns:
        (seq_len, d_model) temporal embeddings
    """
    seq_len = len(times)
    pe = torch.zeros(seq_len, d_model)
 
    position = times.unsqueeze(1)  # (seq_len, 1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
 
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
 
    return pe
 
# Example usage
event_times = torch.tensor([0.0, 0.5, 1.0, 3.0, 7.0])  # Days since admission
temporal_emb = temporal_positional_encoding(event_times)

Learned Temporal Embeddings


class LearnedTemporalEmbedding(nn.Module):
    def __init__(self, max_days=365, d_model=768):
        super().__init__()
        # Discretize time into bins
        self.time_embedding = nn.Embedding(max_days, d_model)
 
    def forward(self, event_times):
        # Convert float days to integer bins
        time_bins = torch.clamp(event_times.long(), 0, max_days-1)
        return self.time_embedding(time_bins)

Clinical Natural Language Processing

The Challenge

Clinical notes contain:

Medical jargon: “diaphoresis”, “dyspnea”, “tachycardia”
Abbreviations: “MI”, “CHF”, “SOB”, “CP”
Negation: “No chest pain”, “denies dyspnea”
Temporal expressions: “2 days ago”, “ongoing since morning”
Misspellings: Common in rapid clinical documentation

Standard language models (BERT, GPT) trained on Wikipedia/books perform poorly on clinical text.

ClinicalBERT

BERT pre-trained on clinical notes from MIMIC-III database. Newer variants can also be trained on MIMIC-IV, which includes 269,573 additional ED admission notes.


from transformers import AutoTokenizer, AutoModel
 
# Load ClinicalBERT
model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
 
# Encode clinical text
text = "Patient presents with acute chest pain radiating to left arm, diaphoresis, and dyspnea. Troponin elevated."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
outputs = model(**inputs)
 
# Extract text embedding (CLS token)
text_embedding = outputs.last_hidden_state[:, 0, :]  # (1, 768)

Paper: Publicly Available Clinical BERT Embeddings (Alsentzer et al., 2019)

ClinicalBERT vs General BERT

Aspect	General BERT	ClinicalBERT
Pre-training Data	Wikipedia, BooksCorpus	MIMIC-III clinical notes (2M notes)
Vocabulary	General English words	Medical terminology, abbreviations
Medical NER	Poor (40-50% F1)	Excellent (85-90% F1)
Clinical Tasks	Low performance	State-of-the-art
Use Case	General NLP	Healthcare-specific NLP

Medical Named Entity Recognition (NER)

Extract medical entities from text:


from transformers import pipeline
 
# Load medical NER pipeline
ner = pipeline("ner", model="alvaroalon2/biobert_diseases_ner", aggregation_strategy="simple")
 
# Extract entities
text = "Patient diagnosed with acute myocardial infarction and Type 2 diabetes mellitus"
entities = ner(text)
 
# Output:
# [
#   {'entity_group': 'Disease', 'word': 'acute myocardial infarction', 'score': 0.99},
#   {'entity_group': 'Disease', 'word': 'Type 2 diabetes mellitus', 'score': 0.98}
# ]

Multimodal EHR Representation

Combining structured codes and clinical text:


class MultimodalEHREncoder(nn.Module):
    def __init__(self, vocab_size=50000, d_model=768):
        super().__init__()
        # Structured event embeddings
        self.event_embedding = nn.Embedding(vocab_size, d_model)
        self.temporal_embedding = LearnedTemporalEmbedding(max_days=365, d_model=d_model)
 
        # Clinical text encoder (ClinicalBERT)
        self.text_encoder = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
 
        # Transformer encoder for combined sequence
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=d_model, nhead=12),
            num_layers=6
        )
 
    def forward(self, event_ids, event_times, clinical_notes):
        # Embed structured events
        event_emb = self.event_embedding(event_ids)           # (batch, seq_len, d_model)
        temporal_emb = self.temporal_embedding(event_times)   # (batch, seq_len, d_model)
        structured_emb = event_emb + temporal_emb
 
        # Encode clinical text
        text_outputs = self.text_encoder(clinical_notes)
        text_emb = text_outputs.last_hidden_state[:, 0:1, :]  # (batch, 1, d_model)
 
        # Concatenate sequences
        combined_emb = torch.cat([text_emb, structured_emb], dim=1)  # (batch, seq_len+1, d_model)
 
        # Encode with transformer
        encoded = self.transformer(combined_emb)
 
        return encoded

EHR Structure and Medical Coding: Foundation for understanding what’s being tokenized
Healthcare Foundation Models: Models like ETHOS and BEHRT that use these tokenization strategies
BPE Tokenization: General tokenization technique applied to medical codes
Transformer Architecture: The model architecture these tokens feed into

Learning Resources

Papers

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission (Alsentzer et al., 2019)
BioBERT: A pre-trained biomedical language representation model (Lee et al., 2019)
Med-BERT: Pre-trained Contextualized Embeddings on Large-Scale Structured EHRs for Disease Prediction (Rasmy et al., 2020)

Models

emilyalsentzer/Bio_ClinicalBERT - Clinical BERT on HuggingFace
dmis-lab/biobert-v1.1 - BioBERT for biomedical text
alvaroalon2/biobert_diseases_ner - Medical NER model

Tools

MIMIC-IV Code Mappings - Standard code vocabularies
Hugging Face Medical Models - Collection of clinical NLP models

Applications

Clinical note classification: Identify diagnosis from admission notes
Adverse event detection: Extract drug reactions from clinical text
Symptom extraction: Parse patient-reported symptoms
Medical question answering: Answer clinical queries using EHR data
Disease progression modeling: Combine codes and notes for temporal prediction

Key Takeaways

Code-level tokenization treats each medical code as a token (simple, interpretable)
Hierarchical tokenization leverages ICD-10/ATC structure for generalization
BPE tokenization handles rare codes through subword decomposition
Temporal encoding is critical—use positional encoding for time deltas
ClinicalBERT outperforms general BERT on medical text by 30-40%
Multimodal approaches combine structured codes and clinical text for richer representations

Medical Event Tokenization and Clinical NLP

Overview

Medical Event Tokenization

Challenge

Approach 1: Code-Level Tokenization

Approach 2: Hierarchical Tokenization

Approach 3: Byte-Pair Encoding (BPE)

Temporal Encoding

Time Since Admission

Learned Temporal Embeddings

Clinical Natural Language Processing

The Challenge

ClinicalBERT

ClinicalBERT vs General BERT

Medical Named Entity Recognition (NER)

Multimodal EHR Representation

Related Concepts

Learning Resources

Papers

Models

Tools

Applications

Key Takeaways