Practical Training Considerations
Theory is important, but getting neural networks to train successfully requires attention to many practical details. This covers essential techniques that make the difference between success and failure.
Weight Initialization
Problem: Poor initialization can break training before it even starts.
Why Initialization Matters
All zeros: Breaks symmetry—all neurons learn the same thing
# BAD: Don't do this!
W = np.zeros((input_dim, hidden_dim))Too large: Activations explode, gradients explode, loss becomes NaN
Too small: Activations vanish, gradients vanish, no learning
Xavier/Glorot Initialization
For layers with sigmoid or tanh activation:
# Xavier initialization
W = np.random.randn(input_dim, hidden_dim) / np.sqrt(input_dim)Intuition: Variance of activations stays constant across layers.
He Initialization
For layers with ReLU activation:
# He initialization (for ReLU)
W = np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim)Why different? ReLU kills half the neurons (sets them to 0), so we need more variance to compensate.
- ReLU activation: Use He initialization
- Sigmoid/Tanh activation: Use Xavier initialization
- Always initialize biases to zero:
b = np.zeros(hidden_dim)
Complete Initialization Example
class Network:
def __init__(self, layer_dims, activation='relu'):
"""
Initialize network with proper weight initialization.
Args:
layer_dims: List [input_dim, hidden1, hidden2, ..., output_dim]
activation: 'relu', 'sigmoid', or 'tanh'
"""
self.params = {}
self.activation = activation
for i in range(1, len(layer_dims)):
# Choose initialization based on activation
if activation == 'relu':
# He initialization
std = np.sqrt(2.0 / layer_dims[i-1])
else:
# Xavier initialization
std = np.sqrt(1.0 / layer_dims[i-1])
self.params[f'W{i}'] = np.random.randn(layer_dims[i-1], layer_dims[i]) * std
self.params[f'b{i}'] = np.zeros(layer_dims[i])Learning Rate Selection
The most important hyperparameter. Too high → divergence. Too low → no learning.
Learning Rate Range Test
Strategy: Exponentially increase learning rate and plot loss:
def learning_rate_range_test(model, X, y, min_lr=1e-6, max_lr=10, num_steps=100):
"""
Find good learning rate by exponentially increasing it.
Returns:
learning_rates: Array of tested learning rates
losses: Corresponding losses
"""
learning_rates = np.logspace(np.log10(min_lr), np.log10(max_lr), num_steps)
losses = []
for lr in learning_rates:
# Take a single optimization step
loss, grads = model.loss(X, y)
for param in model.params:
model.params[param] -= lr * grads[param]
losses.append(loss)
return learning_rates, losses
# Run test
lrs, losses = learning_rate_range_test(model, X_batch, y_batch)
# Plot results
plt.semilogx(lrs, losses)
plt.xlabel('Learning Rate (log scale)')
plt.ylabel('Loss')
plt.title('Learning Rate Range Test')How to interpret:
- Loss decreases: Good learning rates
- Loss plateaus: Learning rate too small
- Loss explodes: Learning rate too large
- Pick: Highest learning rate where loss still decreases steadily
Monitoring Training
What to Track
Essential metrics:
- Training loss (should decrease)
- Validation loss (should decrease, then plateau)
- Training accuracy
- Validation accuracy
- Learning rate (if using schedule)
Advanced metrics:
- Gradient magnitudes (check for vanishing/exploding)
- Weight magnitudes
- Activation distributions
Training Monitor
class TrainingMonitor:
def __init__(self):
self.train_loss = []
self.val_loss = []
self.train_acc = []
self.val_acc = []
self.grad_norms = []
def record(self, train_loss, val_loss, train_acc, val_acc, grads=None):
self.train_loss.append(train_loss)
self.val_loss.append(val_loss)
self.train_acc.append(train_acc)
self.val_acc.append(val_acc)
if grads:
# Compute gradient norm
grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads.values()))
self.grad_norms.append(grad_norm)
def plot(self):
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Loss curves
axes[0, 0].plot(self.train_loss, label='Train')
axes[0, 0].plot(self.val_loss, label='Validation')
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('Loss Curves')
axes[0, 0].legend()
# Accuracy curves
axes[0, 1].plot(self.train_acc, label='Train')
axes[0, 1].plot(self.val_acc, label='Validation')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy')
axes[0, 1].set_title('Accuracy Curves')
axes[0, 1].legend()
# Gradient norms
if self.grad_norms:
axes[1, 0].plot(self.grad_norms)
axes[1, 0].set_xlabel('Iteration')
axes[1, 0].set_ylabel('Gradient Norm')
axes[1, 0].set_title('Gradient Magnitudes')
plt.tight_layout()Debugging Neural Networks
Common issues and how to fix them:
Problem: Loss is NaN
Causes:
- Learning rate too high
- Numerical instability (e.g., log(0))
- Exploding gradients
Solutions:
- Reduce learning rate by 10x
- Add gradient clipping:
np.clip(grads, -max_norm, max_norm) - Check for numerical stability in loss computation
- Verify weight initialization
Problem: Loss Not Decreasing
Causes:
- Learning rate too low
- Dead ReLUs (all neurons outputting 0)
- Bug in gradient computation
Solutions:
- Increase learning rate
- Check gradient magnitudes:
print(np.mean(np.abs(grads['W1']))) - Verify gradient check passes
- Try different initialization
Problem: Vanishing Gradients
Symptoms: Gradients very small (< 1e-6), early layers not learning
Solutions:
- Use ReLU instead of sigmoid/tanh
- Use batch normalization
- Use better initialization (He for ReLU)
- Skip connections (in deeper networks)
Problem: Exploding Gradients
Symptoms: Gradients very large (> 1), loss becomes NaN
Solutions:
- Reduce learning rate
- Gradient clipping:
np.clip(grads, -5, 5) - Better weight initialization
- Check for bugs in backward pass
Problem: Dead ReLUs
Symptoms: Many ReLU units always output 0
Solutions:
- Lower learning rate
- Better initialization (He initialization)
- Try Leaky ReLU:
Debugging Checklist
Step 1: Overfit a tiny dataset (10 examples)
- Can the model memorize them perfectly?
- If not, bug in model or optimization
Step 2: Check gradient implementation
- Run gradient check
- Should get relative error < 1e-5
Step 3: Monitor gradient magnitudes
- Too small (< 1e-6): Vanishing gradients
- Too large (> 1): Exploding gradients
Step 4: Visualize predictions
- What is the model actually predicting?
- Are predictions reasonable?
Step 5: Start simple, add complexity
- Begin with small network, simple data
- Gradually increase complexity
- Identify where things break
Hyperparameter Tuning
Hyperparameters to Tune
Priority 1 (tune these first):
- Learning rate
- Network architecture (layer sizes, number of layers)
Priority 2:
- Regularization strength
- Dropout rate
Priority 3:
- Batch size
- Optimizer parameters (momentum, betas)
Tuning Strategy
Random search > Grid search
import random
best_val_acc = 0
best_params = None
for trial in range(50): # Try 50 random combinations
# Sample hyperparameters from ranges
lr = 10 ** np.random.uniform(-5, -2) # 1e-5 to 1e-2
hidden_dim = random.choice([50, 100, 200, 500])
reg = 10 ** np.random.uniform(-5, -2)
dropout = random.uniform(0.3, 0.7)
# Train model
model = TwoLayerNet(input_dim, hidden_dim, output_dim)
history = train(model, X_train, y_train, X_val, y_val,
learning_rate=lr, reg=reg, dropout=dropout)
# Track best
val_acc = history['val_acc_history'][-1]
if val_acc > best_val_acc:
best_val_acc = val_acc
best_params = {'lr': lr, 'hidden_dim': hidden_dim,
'reg': reg, 'dropout': dropout}
print(f"Best validation accuracy: {best_val_acc:.4f}")
print(f"Best hyperparameters: {best_params}")Why random better than grid?
- More efficient exploration
- Better for varying importance of parameters
- Easier to add more trials
Sanity Checks
Before training on full dataset:
Check 1: Overfit Small Dataset
# Take 10 random examples
small_X = X_train[:10]
small_y = y_train[:10]
# Train until perfect accuracy
for epoch in range(100):
loss, grads = model.loss(small_X, small_y)
# Update parameters...
if epoch % 10 == 0:
train_acc = (model.predict(small_X) == small_y).mean()
print(f"Epoch {epoch}: accuracy = {train_acc:.2f}")
# Should reach 100% accuracy within 100 epochsIf you can’t overfit 10 examples, there’s a bug!
Check 2: Initial Loss
For softmax classifier with classes:
# Check initial loss (before training)
loss = model.loss(X_train[:100], y_train[:100])
expected_loss = np.log(num_classes)
print(f"Initial loss: {loss:.4f}, expected: {expected_loss:.4f}")
# Should be close!
assert abs(loss - expected_loss) < 0.1, "Initial loss is wrong!"Check 3: Loss Decreases
# Loss should decrease after a few iterations
loss_before = model.loss(X_batch, y_batch)
for _ in range(10):
loss, grads = model.loss(X_batch, y_batch)
# Update parameters...
loss_after = model.loss(X_batch, y_batch)
assert loss_after < loss_before, "Loss not decreasing!"Learning Resources
Reading
- Andrej Karpathy: A Recipe for Training Neural Networks ⭐⭐⭐ Essential reading
- CS231n: Tips/Tricks for Training Neural Networks
Related Concepts
- Weight Initialization - Detailed initialization strategies
- Gradient Checking - Verifying gradients
- Learning Rate Schedules - Adaptive learning rates
- Optimization - Training algorithms
Next Steps
- Apply these techniques in MNIST implementation
- Learn about batch normalization
- Study gradient checking in detail
- Explore advanced learning rate schedules