Electronic Health Records: Structure and Medical Coding
Overview
Electronic Health Records (EHRs) are digital repositories of patient healthcare information, capturing structured and unstructured data across multiple encounters. Understanding EHR structure and medical coding systems is fundamental to applying machine learning in healthcare.
EHR Data Components
An EHR contains multiple data types representing a patient’s healthcare journey:
Structured Data
- Demographics: Age, sex, race, ethnicity
- Diagnoses: ICD-10 codes (International Classification of Diseases)
- Procedures: CPT codes (Current Procedural Terminology)
- Medications: ATC codes (Anatomical Therapeutic Chemical)
- Lab results: Continuous values (glucose, troponin, lactate, etc.)
- Vital signs: Temperature, heart rate, blood pressure, respiratory rate
Unstructured Data
- Clinical notes: Free-text documentation from clinicians
- Imaging reports: Radiology findings and interpretations
- Discharge summaries: Comprehensive encounter documentation
Temporal Structure
EHR data is inherently temporal—events occur as sequences over time, forming patient trajectories. This temporal dimension is critical for predictive modeling.
Patient Trajectory Example
Patient ID: 12345
Timeline:
2023-01-15 14:30 | ED arrival, chief complaint: chest pain
2023-01-15 14:45 | Vital signs: HR 95, BP 140/90, RR 18
2023-01-15 15:00 | ECG performed (CPT: 93000)
2023-01-15 15:15 | Troponin ordered (CPT: 82550)
2023-01-15 16:00 | Diagnosis: I21.0 (STEMI - anterior wall)
2023-01-15 16:15 | Medication: C01AA05 (digoxin)
2023-01-15 16:30 | Admission to cardiac ICU
2023-01-20 10:00 | Discharge, outcome: recoveredModeling these temporal sequences is the core task for healthcare prediction models like ETHOS and BEHRT.
Medical Coding Systems
ICD-10: International Classification of Diseases
Structure: Hierarchical classification of diagnoses
-
Example:
I21.0= ST elevation myocardial infarction (anterior wall)I= Diseases of the circulatory system (chapter)I21= Acute myocardial infarction (category)I21.0= Anterior wall STEMI (specific code)
-
Volume: 70,000+ diagnosis codes
-
Version: ICD-10-CM (Clinical Modification) used in U.S.
-
Hierarchy: Enables multi-level granularity for ML models
ATC: Anatomical Therapeutic Chemical Classification
Structure: Hierarchical classification of medications
-
Example:
C01AA05= DigoxinC= Cardiovascular system (anatomical)C01= Cardiac therapy (therapeutic)C01AA= Digitalis glycosides (pharmacological)C01AA05= Digoxin (chemical substance)
-
Volume: 6,000+ medication codes
-
Use: Standardized drug classification for research and clinical systems
CPT: Current Procedural Terminology
Structure: Flat classification of medical procedures
- Example:
93000= Electrocardiogram, routine ECG with interpretation - Volume: 10,000+ procedure codes
- Maintained by: American Medical Association (AMA)
Example Data Representation
Structured representation of a patient visit:
visit = {
'patient_id': 'P12345',
'timestamp': '2023-01-15 14:30:00',
'demographics': {
'age': 65,
'sex': 'M',
'race': 'White'
},
'diagnoses': ['I21.0', 'I10', 'E11.9'], # STEMI, hypertension, diabetes
'procedures': ['93000', '82550'], # ECG, troponin test
'medications': ['C01AA05', 'C07AB03'], # digoxin, atenolol
'vitals': {
'heart_rate': 95,
'systolic_bp': 140,
'diastolic_bp': 90,
'temperature': 37.2
},
'chief_complaint': 'Chest pain radiating to left arm',
'outcome': 'admitted'
}MIMIC Database
The Medical Information Mart for Intensive Care (MIMIC) is the gold standard public EHR dataset for research.
MIMIC-III
- Patients: 53,423 distinct ICU admissions
- Timespan: 2001-2012
- Data: Demographics, vitals, labs, medications, procedures, clinical notes
- Coding: ICD-9 diagnosis codes only
- Access: Free after completing CITI Data or Specimens Only Research course
MIMIC-IV (v3.1, October 2024)
- Patients: Over 65,000 ICU patients, over 200,000 ED patients
- Timespan: 2008-2022 (v3.0 added 2020-2022 data)
- Data Source: iMDsoft MetaVision clinical information system (more homogeneous than MIMIC-III)
- v3.1 Recent Improvements:
- Standardized primary language data
- Expanded insurance to 6 categories (Medicare, Medicaid, Private, Self-pay, No charge, Other)
- Added provider/caregiver tracking (caregiver_id, provider table, order_provider_id)
- v3.0 Improvements:
- Enhanced mortality data with state death records (+15,621 ICU patient deaths)
- Schema simplification (core module removed)
- General Enhancements:
- More precise digital information sources (electronic medicine administration record)
- Better temporal information (diagnosis offset, charting offset times)
- Both ICD-9 and ICD-10 diagnosis codes
- 269,573 additional ED admission notes
- Improved data quality and standardization
- Structure: Modular design (hosp, icu, ed modules) highlighting data provenance
- Access: PhysioNet credentialing; available on BigQuery (November 2024)
- Note: Neonatal data released separately
Learn more: MIMIC Database
MIMIC is the best proxy for understanding the structure of large-scale EHR datasets like the EmergAI 8M ED visits.
Challenges in EHR Data
1. Missing Data
EHR data is always incomplete—not all tests are ordered for all patients. Missingness can be:
- Random: Unrelated to patient state (MCAR)
- Informative: Related to patient severity (MAR, MNAR)
2. Temporal Sparsity
Events are irregularly spaced in time. Some patients have frequent encounters, others have gaps of months or years.
3. High Dimensionality
- Tens of thousands of possible diagnosis codes
- Thousands of procedure and medication codes
- Creates sparse, high-dimensional feature spaces
4. Hierarchical Relationships
Medical codes have inherent hierarchies (e.g., I21.0 is a subtype of I21). Models can leverage this structure.
5. Free Text Processing
Clinical notes require specialized NLP models trained on medical text (see ClinicalBERT).
Related Concepts
- EHR Tokenization and Clinical NLP: How to convert EHR events into token sequences for transformers
- Healthcare Foundation Models: Models like ETHOS and BEHRT that process EHR sequences
Learning Resources
Datasets
- MIMIC-III Clinical Database - Public ICU dataset (2001-2012)
- MIMIC-IV - Enhanced version with contemporary data, ED data, and ICD-9/10 codes
- eICU Collaborative Research Database - Multi-center ICU data
Papers
- MIMIC-III: A freely accessible critical care database (Johnson et al., 2016)
- MIMIC-IV: An updated and improved version (Johnson et al., 2023) - Describes MIMIC-IV v2.2
- PhysioNet Release Notes - Latest version information (v3.1, October 2024)
- The MIMIC Code Repository - SQL queries and analysis scripts
Tools
- MIMIC-Extract - Preprocessing pipeline
- MIMIC-IV Data Loader - Loading utilities
Applications
- Mortality prediction: Predict ICU/hospital mortality risk
- Readmission prediction: Identify high-risk patients for 30-day readmission
- Length of stay: Estimate hospital/ICU duration
- Disease progression: Model temporal evolution of chronic conditions
- Adverse event detection: Identify complications and drug interactions
Key Takeaways
- EHRs contain structured (codes, labs, vitals) and unstructured (notes, reports) data
- Medical events form temporal sequences (patient trajectories) that must be modeled
- Coding systems (ICD-10, ATC, CPT) provide hierarchical structure for ML models
- MIMIC database is the standard benchmark for EHR research
- EHR data presents unique challenges: missingness, sparsity, dimensionality, temporality