Skip to Content
BlogResearch & WritingExperiment Design

Experimental Design for Machine Learning

Rigorous experimental design is crucial for credible research. This guide covers baselines, ablation studies, data splits, and statistical testing for ML experiments.

Baselines

Always compare your method against strong baselines. You need multiple comparison points:

Essential Baselines

  1. Strongest existing method: State-of-the-art on your problem (e.g., ETHOS for EHR modeling)
  2. Ablated versions: Your model without each component
  3. Simple baseline: Logistic regression or similar on basic features
  4. Human performance: Expert judgment (if measurable)

Example Experimental Setup

# Define all experiments upfront experiments = { 'baseline_simple': LogisticRegression( features=['age', 'sex', 'chief_complaint'] ), 'baseline_sota': ETHOS( structured_ehr_only=True ), 'ablation_text_only': MultimodalModel( use_ehr=True, use_text=True, use_sketch=False ), 'ablation_sketch_only': MultimodalModel( use_ehr=True, use_text=False, use_sketch=True ), 'full_model': MultimodalModel( use_ehr=True, use_text=True, use_sketch=True ), }

Ablation Studies

Systematically remove components to understand their individual contributions.

Example Ablation Results

Model VariantEHRTextSketchAccuracyAUROC
Full model0.890.91
No sketch0.870.89
No text0.850.87
EHR only (ETHOS)0.830.85
Text + sketch only0.780.80

Interpreting Ablations

From this table, you can conclude:

  • Text contributes +4% accuracy (0.87 → 0.83 when removed)
  • Sketch contributes +2% accuracy (0.89 → 0.87 when removed)
  • EHR is the strongest signal (0.89 → 0.78 without it)
  • Multimodal fusion provides +6% over EHR alone

Data Splits

Proper data splitting is critical to avoid overfitting and data leakage.

Standard Split

# Simple 70/15/15 split train_data = data[:int(0.7 * len(data))] # 70% val_data = data[int(0.7 * len(data)):int(0.85 * len(data))] # 15% test_data = data[int(0.85 * len(data)):] # 15%
# Temporal split avoids data leakage train_data = data[data.date < '2024-01-01'] val_data = data[(data.date >= '2024-01-01') & (data.date < '2024-07-01')] test_data = data[data.date >= '2024-07-01']

Why temporal splits matter: For sequential data (like EHR), random splits can cause data leakage. Patient events are correlated over time, and distribution shift happens. Use temporal splits for realistic evaluation.

Cross-Validation

For robust evaluation, use k-fold cross-validation:

from sklearn.model_selection import KFold kf = KFold(n_splits=5, shuffle=True, random_state=42) scores = [] for fold, (train_idx, val_idx) in enumerate(kf.split(data)): train_data = data[train_idx] val_data = data[val_idx] model = train_model(train_data) score = evaluate_model(model, val_data) scores.append(score) print(f"Fold {fold+1}: {score:.4f}") print(f"Mean: {np.mean(scores):.4f} ± {np.std(scores):.4f}")

Statistical Significance

Don’t just report point estimates - show that improvements are statistically significant.

Paired t-test

from scipy import stats # Compare two models using 5-fold CV model_a_scores = [0.87, 0.88, 0.86, 0.89, 0.87] model_b_scores = [0.91, 0.90, 0.92, 0.89, 0.91] # Paired t-test t_statistic, p_value = stats.ttest_rel(model_a_scores, model_b_scores) print(f"t-statistic: {t_statistic:.4f}") print(f"p-value: {p_value:.4f}") if p_value < 0.05: print("✓ Improvement is statistically significant") else: print("✗ Improvement is not statistically significant")

Confidence Intervals

Always report confidence intervals, not just point estimates:

import numpy as np from scipy import stats def confidence_interval(scores, confidence=0.95): n = len(scores) mean = np.mean(scores) std_err = stats.sem(scores) margin = std_err * stats.t.ppf((1 + confidence) / 2, n - 1) return mean, mean - margin, mean + margin mean, lower, upper = confidence_interval(model_a_scores) print(f"Accuracy: {mean:.3f} (95% CI: [{lower:.3f}, {upper:.3f}])")

Evaluation Metrics

Choose metrics appropriate for your problem:

MetricWhen to Use
AccuracyBalanced classes
AUROCImbalanced classes (common in healthcare)
AUPRCRare events (e.g., mortality)
Sensitivity (Recall)When false negatives are costly
SpecificityWhen false positives are costly
F1 ScoreBalance precision and recall
CalibrationProbability estimates must be accurate

Multiple Metrics

Report multiple metrics to give a complete picture:

from sklearn.metrics import ( accuracy_score, roc_auc_score, average_precision_score, f1_score ) metrics = { 'accuracy': accuracy_score(y_true, y_pred), 'auroc': roc_auc_score(y_true, y_prob), 'auprc': average_precision_score(y_true, y_prob), 'f1': f1_score(y_true, y_pred) } for name, value in metrics.items(): print(f"{name}: {value:.4f}")

Common Mistakes to Avoid

Critical Mistakes

  1. Testing on tuning data: Never test on the validation set used for hyperparameter tuning
  2. No temporal splits: Random splits cause data leakage in time-series data
  3. Cherry-picking seeds: Don’t report the best random seed - use multiple seeds
  4. Missing confidence intervals: Always report uncertainty
  5. Weak baselines: Compare against the strongest existing methods
  6. Data leakage: Ensure strict separation between train/val/test

Example: Avoiding Data Leakage

Wrong (leakage):

# Normalize on entire dataset scaler.fit(all_data) train_data = scaler.transform(train_data) test_data = scaler.transform(test_data)

Correct (no leakage):

# Fit scaler only on training data scaler.fit(train_data) train_data = scaler.transform(train_data) test_data = scaler.transform(test_data)

Hyperparameter Tuning

Use the validation set for tuning, test set only for final evaluation:

# Grid search on validation set learning_rates = [1e-5, 1e-4, 1e-3] batch_sizes = [16, 32, 64] best_val_score = 0 best_params = {} for lr in learning_rates: for bs in batch_sizes: model = train_model(train_data, lr=lr, batch_size=bs) val_score = evaluate_model(model, val_data) if val_score > best_val_score: best_val_score = val_score best_params = {'lr': lr, 'batch_size': bs} # Final evaluation on test set with best params final_model = train_model(train_data, **best_params) test_score = evaluate_model(final_model, test_data) print(f"Best params: {best_params}") print(f"Val score: {best_val_score:.4f}") print(f"Test score: {test_score:.4f}")

Reproducibility Checklist

To ensure reproducibility:

  • Set random seeds (np.random.seed(42), torch.manual_seed(42))
  • Document Python/PyTorch versions
  • Log all hyperparameters
  • Save model checkpoints
  • Record hardware details (GPU type, RAM)
  • Version control your code
  • Document data preprocessing steps
  • Provide training logs

Sample Experiment Log

Keep detailed logs of all experiments:

Experiment: full_multimodal_v1 Date: 2025-01-15 Model: MultimodalTransformer Architecture: EHR encoder (12 layers) + Text encoder (ClinicalBERT) + Sketch encoder (ResNet50) Data: 50k ED visits (train: 35k, val: 7.5k, test: 7.5k) Hyperparameters: - learning_rate: 1e-4 - batch_size: 32 - epochs: 50 - optimizer: Adam Hardware: A100 GPU, 40GB RAM Training time: 6 hours Results: - Train AUROC: 0.92 - Val AUROC: 0.89 - Test AUROC: 0.88 Notes: Model shows slight overfitting. Try dropout=0.2 next.

Key Takeaways

  1. Multiple baselines: Simple, state-of-the-art, and ablated versions
  2. Ablation studies: Systematically test each component’s contribution
  3. Proper splits: Temporal splits for sequential data, avoid leakage
  4. Statistical tests: Report confidence intervals and p-values
  5. Multiple metrics: AUROC, AUPRC, F1, calibration for complete picture
  6. Reproducibility: Document everything for replication
  7. Separate tuning and testing: Use validation for hyperparameters, test only once