Skip to Content

Healthcare AI Datasets

This page catalogs publicly available datasets for healthcare AI research. These datasets enable reproducible research and provide benchmarks for comparing approaches across electronic health records (EHR), medical imaging, clinical NLP, and multimodal data.

Electronic Health Records (EHR)

MIMIC-III

Medical Information Mart for Intensive Care

  • Size: 40,000+ ICU stays, 53,000+ admissions
  • Hospital: Beth Israel Deaconess Medical Center (Boston)
  • Years: 2001-2012
  • Data types: Demographics, vital signs, lab tests, medications, procedures, diagnoses, clinical notes
  • Access: Free after completing ethics course and signing data use agreement
  • Link: https://mimic.mit.edu/ 

Common tasks:

  • Mortality prediction
  • Length of stay prediction
  • Readmission prediction
  • Phenotype classification
  • Clinical note processing

Code libraries:

MIMIC-IV

Updated version of MIMIC with contemporary data and improvements (current version: v3.1, October 2024)

  • Size: Over 65,000 ICU patients, over 200,000 Emergency Department patients
  • Hospital: Same as MIMIC-III (Beth Israel Deaconess Medical Center)
  • Years: 2008-2022 (v3.0 added 2020-2022 data in July 2024)
  • Data Source: iMDsoft MetaVision clinical information system (more comprehensive and homogeneous than MIMIC-III)
  • v3.1 Improvements (October 2024):
    • Fixed itemid values in d_labitems and labevents tables
    • Standardized primary language data (previously ”?”)
    • Expanded insurance categories to 6: Medicare, Medicaid, Private, Self-pay, No charge, Other
    • Added provider/caregiver tracking: caregiver_id, provider table, order_provider_id columns
    • Data cleaning: removed 23,093 hadm_id and 3,762 stay_id
  • v3.0 Improvements (July 2024):
    • Extended timespan to include 2020-2022 patient stays
    • Enhanced mortality data: dod (date of death) now includes out-of-hospital mortality from state death records (+15,621 ICU patient death records)
    • Schema simplification: core module removed, admissions/patients/transfers moved to hosp module
  • General Improvements over MIMIC-III:
    • More precise digital information (electronic medicine administration record)
    • Better temporal information (diagnosis offset, charting offset times)
    • Both ICD-9 and ICD-10 diagnosis codes (vs. ICD-9 only in MIMIC-III)
    • Modular data organization (hosp, icu, ed modules) highlighting data provenance
    • 269,573 additional ED admission notes not in MIMIC-III
  • Note: Neonatal data released separately
  • Access: PhysioNet credentialing required; also available on BigQuery (as of November 25, 2024) for large-scale analysis without local infrastructure
  • Link: https://mimic.mit.edu/ 

eICU Collaborative Research Database

Multi-center ICU database

  • Size: 200,000+ ICU stays
  • Hospitals: 200+ hospitals across US
  • Years: 2014-2015
  • Advantage: Geographic diversity, multiple EHR systems
  • Data: Vital signs, labs, medications, diagnoses, notes (limited)
  • Access: Same as MIMIC (PhysioNet credentialing)
  • Link: https://eicu-crd.mit.edu/ 

Use cases:

  • Multi-site validation
  • Generalization across hospitals
  • Hospital-level variation analysis

CPRD (UK)

Clinical Practice Research Datalink

  • Size: 60+ million patients
  • Setting: Primary care (general practitioners)
  • Location: United Kingdom
  • Data: Longitudinal patient records, prescriptions, referrals, lab results
  • Access: Requires protocol approval and fees
  • Link: https://cprd.com/ 

Advantages:

  • Very large population
  • Longitudinal (follow patients over decades)
  • Representative of UK population

Medical Imaging

ChestX-ray14

Large-scale chest X-ray dataset

  • Size: 112,120 frontal chest X-rays, 30,805 patients
  • Labels: 14 disease classes (pneumonia, mass, effusion, infiltration, nodule, etc.)
  • Source: NIH Clinical Center
  • Label quality: Extracted from radiology reports (noisy labels)
  • Access: Free download
  • Link: https://nihcc.app.box.com/v/ChestXray-NIHCC 

Tasks:

  • Multi-label disease classification
  • Disease localization (bounding boxes available for subset)

CheXpert

Stanford chest X-ray dataset with uncertainty labels

  • Size: 224,316 chest X-rays, 65,240 patients
  • Labels: 14 observations with uncertainty (positive, negative, uncertain)
  • Source: Stanford Hospital
  • Years: 2002-2017
  • Advantage: Uncertainty labels model real-world ambiguity, higher quality than ChestX-ray14
  • Access: Free after registration
  • Link: https://stanfordmlgroup.github.io/competitions/chexpert/ 

Tasks:

  • Robust learning with label uncertainty
  • Comparison of uncertainty handling methods

RSNA Challenges

Radiology Society of North America competitions

Annual competitions with curated medical imaging datasets:

The Cancer Imaging Archive (TCIA)

Multi-modal cancer imaging

  • Size: 60+ million images
  • Types: CT, MRI, PET, microscopy
  • Cancers: Lung, breast, brain, prostate, and more
  • Access: Free, some datasets require application
  • Link: https://www.cancerimagingarchive.net/ 

Notable collections:

  • LIDC-IDRI: Lung nodules with radiologist annotations
  • TCGA: Cancer genomics + imaging
  • QIN: Quantitative imaging for cancer

ISIC (Skin Lesions)

International Skin Imaging Collaboration

  • Size: 50,000+ dermoscopic images
  • Labels: Melanoma, benign nevi, and other skin lesions
  • Metadata: Age, sex, anatomical site
  • Challenges: Annual challenges for melanoma detection
  • Access: Free download
  • Link: https://www.isic-archive.com/ 

Clinical NLP

i2b2/n2c2 Challenges

Informatics for Integrating Biology and the Bedside / National NLP Clinical Challenges

Shared tasks with annotated clinical notes:

  • 2006: De-identification
  • 2008: Obesity and comorbidities
  • 2009: Medication extraction
  • 2010: Concept and relation extraction
  • 2012: Temporal relations
  • 2014: De-identification and heart disease risk factors
  • 2018: Adverse drug events
  • 2019: Clinical concept normalization
  • 2022: Contextualized medication event extraction

Access: Register for data use agreement Link: https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ 

MIMIC-III/IV Clinical Notes

Free-text clinical documentation

  • MIMIC-III: 2 million notes (discharge summaries, nursing notes, radiology reports, etc.)
  • MIMIC-IV: Includes 269,573 additional ED admission notes beyond MIMIC-III
  • Types: 15+ note categories
  • Source: Part of MIMIC-III/IV databases
  • Tasks: Summarization, ICD coding, named entity recognition
  • Access: Via MIMIC-III/IV PhysioNet access

Clinical Case Reports

Published medical case studies

  • Size: Thousands of published case reports
  • Source: Medical journals (BMJ Case Reports, Journal of Medical Case Reports)
  • Format: Structured narratives with diagnosis, treatment, outcome
  • Access: Many are open access
  • Use cases: Training clinical QA systems, rare disease recognition

Multimodal Datasets

MIMIC-CXR

Chest X-rays + radiology reports

  • Size: 377,110 images, 227,835 reports
  • Patients: 65,379 patients
  • Source: MIMIC database
  • Pairing: Images linked to reports (vision-language)
  • Tasks: Report generation, image-text retrieval, visual QA
  • Access: Via MIMIC credentialing
  • Link: https://physionet.org/content/mimic-cxr/ 

PadChest

Chest X-rays + reports (Spanish)

  • Size: 160,000+ images, 109,000 reports
  • Source: Hospital Universitario de San Juan, Spain
  • Language: Spanish reports
  • Labels: Extracted from reports + manual validation
  • Access: Free for research
  • Link: http://bimcv.cipf.es/bimcv-projects/padchest/ 

OpenI

Chest X-rays + reports (Indiana University)

  • Size: 7,470 images, 3,955 reports
  • Source: Indiana University
  • Quality: High-quality manual annotations
  • Tasks: Report generation benchmark
  • Access: Free download
  • Link: https://openi.nlm.nih.gov/ 

Genetics and Molecular

UK Biobank

Large-scale biomedical database

  • Size: 500,000 participants
  • Data: Genomics, imaging (MRI, retinal), EHR, lifestyle
  • Follow-up: Longitudinal with ongoing tracking
  • Access: Requires application and fees
  • Link: https://www.ukbiobank.ac.uk/ 

GTEx

Genotype-Tissue Expression

  • Size: 17,000 samples, 54 tissue types
  • Data: Gene expression, genetic variants
  • Use: Understand genetic regulation, gene-disease links
  • Access: Free via dbGaP
  • Link: https://gtexportal.org/ 

TCGA

The Cancer Genome Atlas

  • Size: 20,000 samples, 33 cancer types
  • Data: Genomics, transcriptomics, epigenomics, imaging
  • Use: Cancer research, multi-omics integration
  • Access: Free via GDC Data Portal
  • Link: https://www.cancer.gov/tcga 

Time-Series Physiological Data

PhysioNet

Repository of physiological signals

  • Size: 100+ databases
  • Types: ECG, EEG, blood pressure, respiratory
  • Challenges: Annual challenges (e.g., AF detection, sepsis prediction)
  • Access: Most are freely available
  • Link: https://physionet.org/ 

Notable datasets:

  • MIT-BIH Arrhythmia Database: ECG arrhythmia annotations
  • MIMIC-III/IV Waveforms: High-frequency physiological waveforms from ICU
  • Computing in Cardiology Challenges: Annual competitions

PTB-XL

Large ECG dataset

  • Size: 21,837 ECGs, 18,885 patients
  • Source: Physikalisch-Technische Bundesanstalt (Germany)
  • Labels: Cardiologist annotations, diagnostic statements
  • Tasks: Arrhythmia classification, multi-label diagnosis
  • Access: Free download
  • Link: https://physionet.org/content/ptb-xl/ 

Specialized Datasets

Diabetic Retinopathy

  • Kaggle DR Detection: 35,000 retinal images (2015 competition)
  • Messidor-2: 1,748 images with DR grading
  • EyePACS: 10,000+ images
  • Use: Screening for diabetic eye disease

COVID-19

  • COVID-19 Image Data Collection: 1000+ chest X-rays/CTs
  • CORD-19: 1M+ scientific papers on COVID-19
  • COVIDx: 30,000+ chest X-rays
  • Note: Many datasets were rapidly released during pandemic

Alzheimer’s Disease

  • ADNI: Alzheimer’s Disease Neuroimaging Initiative

Swedish Healthcare Data (EmergAI Project)

Akademiska Sjukhuset Emergency Department

  • Size: 8 million ED visits
  • Hospital: Uppsala University Hospital
  • Data: Structured EHR (diagnoses, procedures, triage, outcomes)
  • Access: Restricted (research collaboration required)

Symptoms.se

  • Size: ~2,000 symptom reports
  • Data: Patient-drawn 3D body sketches + text descriptions
  • Unique: Patient-reported symptoms with spatial information
  • Access: Via collaboration with Uppsala University

Accessing Datasets

PhysioNet Credentialing

Required for MIMIC, eICU, and other sensitive datasets:

  1. Complete CITI “Data or Specimens Only Research” course (https://about.citiprogram.org/ )
  2. Create PhysioNet account
  3. Upload course completion certificate
  4. Sign data use agreement for specific dataset
  5. Wait for approval (~1-2 weeks)
  6. Download data

Institutional Review Board (IRB)

Some datasets require:

  • IRB approval from your institution
  • Protocol describing research use
  • Evidence of data security measures

Data Use Agreements

Common restrictions:

  • ✅ No re-identification attempts
  • ✅ No redistribution of data
  • ✅ Only specified research uses
  • ✅ Acknowledge data source in publications
  • ✅ Report security breaches

Preprocessing Pipelines

MIMIC-Extract

Extract and preprocess MIMIC data:

git clone https://github.com/MLforHealth/MIMIC_Extract cd MIMIC_Extract python extract.py --task mortality --timewindow 24h

Outputs:

  • Feature matrices
  • Labels (mortality, readmission, etc.)
  • Train/val/test splits

MedCAT

Medical Concept Annotation Tool for clinical text:

from medcat.cat import CAT # Load pre-trained medical concept annotator cat = CAT.load_model_pack("model_pack") # Extract medical entities text = "Patient has diabetes and hypertension" entities = cat.get_entities(text) # Output: [('diabetes', 'SNOMED:73211009'), ('hypertension', 'SNOMED:38341003')]

Benchmarks and Leaderboards

Papers with Code - Medical

Track state-of-the-art:

Grand Challenges

Medical imaging competitions:


Best Practices

Data Splits

  • Temporal split: Train on past, test on future (avoids data leakage)
  • Patient-level split: Same patient never in train and test (prevents data leakage)
  • Site-level split: For multi-site data, hold out entire sites for testing (tests generalization)

Handling Imbalance

  • Many healthcare datasets are highly imbalanced
  • Disease prevalence often 1-5%
  • Use appropriate metrics (AUPRC, not accuracy)
  • Consider resampling or weighted loss functions

Missing Data

  • EHR data has systematic missingness (not MCAR)
  • Lab tests ordered based on clinical suspicion
  • Handle explicitly, don’t drop missing values
  • Model missingness as a feature

Privacy

  • Even “de-identified” data can be re-identified
  • Use secure computing environment
  • Don’t share preprocessed data without permission
  • Follow data use agreements strictly

Success Criteria

You’re ready to work with healthcare datasets when you can:

✅ Navigate PhysioNet credentialing process ✅ Download and preprocess MIMIC-III/IV data ✅ Handle missing values and temporal dependencies ✅ Create appropriate train/val/test splits (temporal, patient-level) ✅ Use domain-appropriate evaluation metrics (AUPRC, calibration) ✅ Follow data use agreements and privacy requirements ✅ Cite datasets properly in publications


Next Steps

  1. Identify dataset(s) relevant to your research
  2. Complete credentialing if needed (start early, takes time)
  3. Download and explore data
  4. Use existing preprocessing pipelines when available (MIMIC-Extract, MedCAT)
  5. Follow benchmark protocols for fair comparison
  6. Properly acknowledge data sources in publications

While not medical datasets, these demonstrate analogous transfer learning applications in other domains requiring expert-level visual diagnosis. They face similar challenges: limited data, class imbalance, and need for interpretability.

Environmental Monitoring

  • Planet Amazon Rainforest  - Multi-label satellite imagery for deforestation tracking and land use classification
  • Domain shift from natural images to satellite perspective
  • Multi-label classification (multiple conditions per image)
  • Similar to screening multiple pathologies in a single medical image

Industrial Inspection

  • Severstal Steel Defect Detection  - Industrial quality control dataset
  • Extreme class imbalance (rare defects, like rare diseases)
  • High precision requirements for deployment
  • Transfer learning from ImageNet to industrial inspection domain

Agricultural Pathology

  • Plant Disease Recognition  - 87,000 labeled images of healthy and diseased crop leaves across 38 classes
  • Visual diagnosis of pathology (analogous to medical diagnosis)
  • Variation in imaging conditions (like medical imaging protocols)
  • Requires expert-level interpretation

Why these matter for healthcare AI: These datasets demonstrate that the same transfer learning principles, data augmentation strategies, and interpretability techniques used in medical imaging apply broadly to any domain requiring visual expert diagnosis.