Experimental Design for Machine Learning
Rigorous experimental design is crucial for credible research. This guide covers baselines, ablation studies, data splits, and statistical testing for ML experiments.
Baselines
Always compare your method against strong baselines. You need multiple comparison points:
Essential Baselines
- Strongest existing method: State-of-the-art on your problem (e.g., ETHOS for EHR modeling)
- Ablated versions: Your model without each component
- Simple baseline: Logistic regression or similar on basic features
- Human performance: Expert judgment (if measurable)
Example Experimental Setup
# Define all experiments upfront
experiments = {
'baseline_simple': LogisticRegression(
features=['age', 'sex', 'chief_complaint']
),
'baseline_sota': ETHOS(
structured_ehr_only=True
),
'ablation_text_only': MultimodalModel(
use_ehr=True,
use_text=True,
use_sketch=False
),
'ablation_sketch_only': MultimodalModel(
use_ehr=True,
use_text=False,
use_sketch=True
),
'full_model': MultimodalModel(
use_ehr=True,
use_text=True,
use_sketch=True
),
}Ablation Studies
Systematically remove components to understand their individual contributions.
Example Ablation Results
| Model Variant | EHR | Text | Sketch | Accuracy | AUROC |
|---|---|---|---|---|---|
| Full model | ✓ | ✓ | ✓ | 0.89 | 0.91 |
| No sketch | ✓ | ✓ | ✗ | 0.87 | 0.89 |
| No text | ✓ | ✗ | ✓ | 0.85 | 0.87 |
| EHR only (ETHOS) | ✓ | ✗ | ✗ | 0.83 | 0.85 |
| Text + sketch only | ✗ | ✓ | ✓ | 0.78 | 0.80 |
Interpreting Ablations
From this table, you can conclude:
- Text contributes +4% accuracy (0.87 → 0.83 when removed)
- Sketch contributes +2% accuracy (0.89 → 0.87 when removed)
- EHR is the strongest signal (0.89 → 0.78 without it)
- Multimodal fusion provides +6% over EHR alone
Data Splits
Proper data splitting is critical to avoid overfitting and data leakage.
Standard Split
# Simple 70/15/15 split
train_data = data[:int(0.7 * len(data))] # 70%
val_data = data[int(0.7 * len(data)):int(0.85 * len(data))] # 15%
test_data = data[int(0.85 * len(data)):] # 15%Temporal Split (Recommended for Sequential Data)
# Temporal split avoids data leakage
train_data = data[data.date < '2024-01-01']
val_data = data[(data.date >= '2024-01-01') & (data.date < '2024-07-01')]
test_data = data[data.date >= '2024-07-01']Why temporal splits matter: For sequential data (like EHR), random splits can cause data leakage. Patient events are correlated over time, and distribution shift happens. Use temporal splits for realistic evaluation.
Cross-Validation
For robust evaluation, use k-fold cross-validation:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for fold, (train_idx, val_idx) in enumerate(kf.split(data)):
train_data = data[train_idx]
val_data = data[val_idx]
model = train_model(train_data)
score = evaluate_model(model, val_data)
scores.append(score)
print(f"Fold {fold+1}: {score:.4f}")
print(f"Mean: {np.mean(scores):.4f} ± {np.std(scores):.4f}")Statistical Significance
Don’t just report point estimates - show that improvements are statistically significant.
Paired t-test
from scipy import stats
# Compare two models using 5-fold CV
model_a_scores = [0.87, 0.88, 0.86, 0.89, 0.87]
model_b_scores = [0.91, 0.90, 0.92, 0.89, 0.91]
# Paired t-test
t_statistic, p_value = stats.ttest_rel(model_a_scores, model_b_scores)
print(f"t-statistic: {t_statistic:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("✓ Improvement is statistically significant")
else:
print("✗ Improvement is not statistically significant")Confidence Intervals
Always report confidence intervals, not just point estimates:
import numpy as np
from scipy import stats
def confidence_interval(scores, confidence=0.95):
n = len(scores)
mean = np.mean(scores)
std_err = stats.sem(scores)
margin = std_err * stats.t.ppf((1 + confidence) / 2, n - 1)
return mean, mean - margin, mean + margin
mean, lower, upper = confidence_interval(model_a_scores)
print(f"Accuracy: {mean:.3f} (95% CI: [{lower:.3f}, {upper:.3f}])")Evaluation Metrics
Choose metrics appropriate for your problem:
| Metric | When to Use |
|---|---|
| Accuracy | Balanced classes |
| AUROC | Imbalanced classes (common in healthcare) |
| AUPRC | Rare events (e.g., mortality) |
| Sensitivity (Recall) | When false negatives are costly |
| Specificity | When false positives are costly |
| F1 Score | Balance precision and recall |
| Calibration | Probability estimates must be accurate |
Multiple Metrics
Report multiple metrics to give a complete picture:
from sklearn.metrics import (
accuracy_score, roc_auc_score,
average_precision_score, f1_score
)
metrics = {
'accuracy': accuracy_score(y_true, y_pred),
'auroc': roc_auc_score(y_true, y_prob),
'auprc': average_precision_score(y_true, y_prob),
'f1': f1_score(y_true, y_pred)
}
for name, value in metrics.items():
print(f"{name}: {value:.4f}")Common Mistakes to Avoid
Critical Mistakes
- Testing on tuning data: Never test on the validation set used for hyperparameter tuning
- No temporal splits: Random splits cause data leakage in time-series data
- Cherry-picking seeds: Don’t report the best random seed - use multiple seeds
- Missing confidence intervals: Always report uncertainty
- Weak baselines: Compare against the strongest existing methods
- Data leakage: Ensure strict separation between train/val/test
Example: Avoiding Data Leakage
Wrong (leakage):
# Normalize on entire dataset
scaler.fit(all_data)
train_data = scaler.transform(train_data)
test_data = scaler.transform(test_data)Correct (no leakage):
# Fit scaler only on training data
scaler.fit(train_data)
train_data = scaler.transform(train_data)
test_data = scaler.transform(test_data)Hyperparameter Tuning
Use the validation set for tuning, test set only for final evaluation:
# Grid search on validation set
learning_rates = [1e-5, 1e-4, 1e-3]
batch_sizes = [16, 32, 64]
best_val_score = 0
best_params = {}
for lr in learning_rates:
for bs in batch_sizes:
model = train_model(train_data, lr=lr, batch_size=bs)
val_score = evaluate_model(model, val_data)
if val_score > best_val_score:
best_val_score = val_score
best_params = {'lr': lr, 'batch_size': bs}
# Final evaluation on test set with best params
final_model = train_model(train_data, **best_params)
test_score = evaluate_model(final_model, test_data)
print(f"Best params: {best_params}")
print(f"Val score: {best_val_score:.4f}")
print(f"Test score: {test_score:.4f}")Reproducibility Checklist
To ensure reproducibility:
- Set random seeds (
np.random.seed(42),torch.manual_seed(42)) - Document Python/PyTorch versions
- Log all hyperparameters
- Save model checkpoints
- Record hardware details (GPU type, RAM)
- Version control your code
- Document data preprocessing steps
- Provide training logs
Sample Experiment Log
Keep detailed logs of all experiments:
Experiment: full_multimodal_v1
Date: 2025-01-15
Model: MultimodalTransformer
Architecture: EHR encoder (12 layers) + Text encoder (ClinicalBERT) + Sketch encoder (ResNet50)
Data: 50k ED visits (train: 35k, val: 7.5k, test: 7.5k)
Hyperparameters:
- learning_rate: 1e-4
- batch_size: 32
- epochs: 50
- optimizer: Adam
Hardware: A100 GPU, 40GB RAM
Training time: 6 hours
Results:
- Train AUROC: 0.92
- Val AUROC: 0.89
- Test AUROC: 0.88
Notes: Model shows slight overfitting. Try dropout=0.2 next.Related Concepts
- Optimization algorithms for training
- Regularization techniques to prevent overfitting
- Training best practices
Related Guides
- Formulating Research Questions - What to test
- Structuring Papers - Reporting your results
- Reading Papers - Understanding existing methods
Key Takeaways
- Multiple baselines: Simple, state-of-the-art, and ablated versions
- Ablation studies: Systematically test each component’s contribution
- Proper splits: Temporal splits for sequential data, avoid leakage
- Statistical tests: Report confidence intervals and p-values
- Multiple metrics: AUROC, AUPRC, F1, calibration for complete picture
- Reproducibility: Document everything for replication
- Separate tuning and testing: Use validation for hyperparameters, test only once