Skip to Content

Electronic Health Records: Structure and Medical Coding

Overview

Electronic Health Records (EHRs) are digital repositories of patient healthcare information, capturing structured and unstructured data across multiple encounters. Understanding EHR structure and medical coding systems is fundamental to applying machine learning in healthcare.

EHR Data Components

An EHR contains multiple data types representing a patient’s healthcare journey:

Structured Data

  • Demographics: Age, sex, race, ethnicity
  • Diagnoses: ICD-10 codes (International Classification of Diseases)
  • Procedures: CPT codes (Current Procedural Terminology)
  • Medications: ATC codes (Anatomical Therapeutic Chemical)
  • Lab results: Continuous values (glucose, troponin, lactate, etc.)
  • Vital signs: Temperature, heart rate, blood pressure, respiratory rate

Unstructured Data

  • Clinical notes: Free-text documentation from clinicians
  • Imaging reports: Radiology findings and interpretations
  • Discharge summaries: Comprehensive encounter documentation

Temporal Structure

EHR data is inherently temporal—events occur as sequences over time, forming patient trajectories. This temporal dimension is critical for predictive modeling.

Patient Trajectory Example

Patient ID: 12345 Timeline: 2023-01-15 14:30 | ED arrival, chief complaint: chest pain 2023-01-15 14:45 | Vital signs: HR 95, BP 140/90, RR 18 2023-01-15 15:00 | ECG performed (CPT: 93000) 2023-01-15 15:15 | Troponin ordered (CPT: 82550) 2023-01-15 16:00 | Diagnosis: I21.0 (STEMI - anterior wall) 2023-01-15 16:15 | Medication: C01AA05 (digoxin) 2023-01-15 16:30 | Admission to cardiac ICU 2023-01-20 10:00 | Discharge, outcome: recovered

Modeling these temporal sequences is the core task for healthcare prediction models like ETHOS and BEHRT.

Medical Coding Systems

ICD-10: International Classification of Diseases

Structure: Hierarchical classification of diagnoses

  • Example: I21.0 = ST elevation myocardial infarction (anterior wall)

    • I = Diseases of the circulatory system (chapter)
    • I21 = Acute myocardial infarction (category)
    • I21.0 = Anterior wall STEMI (specific code)
  • Volume: 70,000+ diagnosis codes

  • Version: ICD-10-CM (Clinical Modification) used in U.S.

  • Hierarchy: Enables multi-level granularity for ML models

ATC: Anatomical Therapeutic Chemical Classification

Structure: Hierarchical classification of medications

  • Example: C01AA05 = Digoxin

    • C = Cardiovascular system (anatomical)
    • C01 = Cardiac therapy (therapeutic)
    • C01AA = Digitalis glycosides (pharmacological)
    • C01AA05 = Digoxin (chemical substance)
  • Volume: 6,000+ medication codes

  • Use: Standardized drug classification for research and clinical systems

CPT: Current Procedural Terminology

Structure: Flat classification of medical procedures

  • Example: 93000 = Electrocardiogram, routine ECG with interpretation
  • Volume: 10,000+ procedure codes
  • Maintained by: American Medical Association (AMA)

Example Data Representation

Structured representation of a patient visit:

visit = { 'patient_id': 'P12345', 'timestamp': '2023-01-15 14:30:00', 'demographics': { 'age': 65, 'sex': 'M', 'race': 'White' }, 'diagnoses': ['I21.0', 'I10', 'E11.9'], # STEMI, hypertension, diabetes 'procedures': ['93000', '82550'], # ECG, troponin test 'medications': ['C01AA05', 'C07AB03'], # digoxin, atenolol 'vitals': { 'heart_rate': 95, 'systolic_bp': 140, 'diastolic_bp': 90, 'temperature': 37.2 }, 'chief_complaint': 'Chest pain radiating to left arm', 'outcome': 'admitted' }

MIMIC Database

The Medical Information Mart for Intensive Care (MIMIC) is the gold standard public EHR dataset for research.

MIMIC-III

  • Patients: 53,423 distinct ICU admissions
  • Timespan: 2001-2012
  • Data: Demographics, vitals, labs, medications, procedures, clinical notes
  • Coding: ICD-9 diagnosis codes only
  • Access: Free after completing CITI Data or Specimens Only Research course

MIMIC-IV (v3.1, October 2024)

  • Patients: Over 65,000 ICU patients, over 200,000 ED patients
  • Timespan: 2008-2022 (v3.0 added 2020-2022 data)
  • Data Source: iMDsoft MetaVision clinical information system (more homogeneous than MIMIC-III)
  • v3.1 Recent Improvements:
    • Standardized primary language data
    • Expanded insurance to 6 categories (Medicare, Medicaid, Private, Self-pay, No charge, Other)
    • Added provider/caregiver tracking (caregiver_id, provider table, order_provider_id)
  • v3.0 Improvements:
    • Enhanced mortality data with state death records (+15,621 ICU patient deaths)
    • Schema simplification (core module removed)
  • General Enhancements:
    • More precise digital information sources (electronic medicine administration record)
    • Better temporal information (diagnosis offset, charting offset times)
    • Both ICD-9 and ICD-10 diagnosis codes
    • 269,573 additional ED admission notes
    • Improved data quality and standardization
  • Structure: Modular design (hosp, icu, ed modules) highlighting data provenance
  • Access: PhysioNet credentialing; available on BigQuery (November 2024)
  • Note: Neonatal data released separately

Learn more: MIMIC Database 

MIMIC is the best proxy for understanding the structure of large-scale EHR datasets like the EmergAI 8M ED visits.

Challenges in EHR Data

1. Missing Data

EHR data is always incomplete—not all tests are ordered for all patients. Missingness can be:

  • Random: Unrelated to patient state (MCAR)
  • Informative: Related to patient severity (MAR, MNAR)

2. Temporal Sparsity

Events are irregularly spaced in time. Some patients have frequent encounters, others have gaps of months or years.

3. High Dimensionality

  • Tens of thousands of possible diagnosis codes
  • Thousands of procedure and medication codes
  • Creates sparse, high-dimensional feature spaces

4. Hierarchical Relationships

Medical codes have inherent hierarchies (e.g., I21.0 is a subtype of I21). Models can leverage this structure.

5. Free Text Processing

Clinical notes require specialized NLP models trained on medical text (see ClinicalBERT).

  • EHR Tokenization and Clinical NLP: How to convert EHR events into token sequences for transformers
  • Healthcare Foundation Models: Models like ETHOS and BEHRT that process EHR sequences

Learning Resources

Datasets

Papers

Tools

Applications

  • Mortality prediction: Predict ICU/hospital mortality risk
  • Readmission prediction: Identify high-risk patients for 30-day readmission
  • Length of stay: Estimate hospital/ICU duration
  • Disease progression: Model temporal evolution of chronic conditions
  • Adverse event detection: Identify complications and drug interactions

Key Takeaways

  1. EHRs contain structured (codes, labs, vitals) and unstructured (notes, reports) data
  2. Medical events form temporal sequences (patient trajectories) that must be modeled
  3. Coding systems (ICD-10, ATC, CPT) provide hierarchical structure for ML models
  4. MIMIC database is the standard benchmark for EHR research
  5. EHR data presents unique challenges: missingness, sparsity, dimensionality, temporality