Skip to Content
LibraryConceptsTraining Practices

Practical Training Considerations

Theory is important, but getting neural networks to train successfully requires attention to many practical details. This covers essential techniques that make the difference between success and failure.

Weight Initialization

Problem: Poor initialization can break training before it even starts.

Why Initialization Matters

All zeros: Breaks symmetry—all neurons learn the same thing

# BAD: Don't do this! W = np.zeros((input_dim, hidden_dim))

Too large: Activations explode, gradients explode, loss becomes NaN

Too small: Activations vanish, gradients vanish, no learning

Xavier/Glorot Initialization

For layers with sigmoid or tanh activation:

WN(0,1nin)W \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}}}\right)
# Xavier initialization W = np.random.randn(input_dim, hidden_dim) / np.sqrt(input_dim)

Intuition: Variance of activations stays constant across layers.

He Initialization

For layers with ReLU activation:

WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)
# He initialization (for ReLU) W = np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim)

Why different? ReLU kills half the neurons (sets them to 0), so we need more variance to compensate.

Rule of Thumb
  • ReLU activation: Use He initialization
  • Sigmoid/Tanh activation: Use Xavier initialization
  • Always initialize biases to zero: b = np.zeros(hidden_dim)

Complete Initialization Example

class Network: def __init__(self, layer_dims, activation='relu'): """ Initialize network with proper weight initialization. Args: layer_dims: List [input_dim, hidden1, hidden2, ..., output_dim] activation: 'relu', 'sigmoid', or 'tanh' """ self.params = {} self.activation = activation for i in range(1, len(layer_dims)): # Choose initialization based on activation if activation == 'relu': # He initialization std = np.sqrt(2.0 / layer_dims[i-1]) else: # Xavier initialization std = np.sqrt(1.0 / layer_dims[i-1]) self.params[f'W{i}'] = np.random.randn(layer_dims[i-1], layer_dims[i]) * std self.params[f'b{i}'] = np.zeros(layer_dims[i])

Learning Rate Selection

The most important hyperparameter. Too high → divergence. Too low → no learning.

Learning Rate Range Test

Strategy: Exponentially increase learning rate and plot loss:

def learning_rate_range_test(model, X, y, min_lr=1e-6, max_lr=10, num_steps=100): """ Find good learning rate by exponentially increasing it. Returns: learning_rates: Array of tested learning rates losses: Corresponding losses """ learning_rates = np.logspace(np.log10(min_lr), np.log10(max_lr), num_steps) losses = [] for lr in learning_rates: # Take a single optimization step loss, grads = model.loss(X, y) for param in model.params: model.params[param] -= lr * grads[param] losses.append(loss) return learning_rates, losses # Run test lrs, losses = learning_rate_range_test(model, X_batch, y_batch) # Plot results plt.semilogx(lrs, losses) plt.xlabel('Learning Rate (log scale)') plt.ylabel('Loss') plt.title('Learning Rate Range Test')

How to interpret:

  • Loss decreases: Good learning rates
  • Loss plateaus: Learning rate too small
  • Loss explodes: Learning rate too large
  • Pick: Highest learning rate where loss still decreases steadily

Monitoring Training

What to Track

Essential metrics:

  • Training loss (should decrease)
  • Validation loss (should decrease, then plateau)
  • Training accuracy
  • Validation accuracy
  • Learning rate (if using schedule)

Advanced metrics:

  • Gradient magnitudes (check for vanishing/exploding)
  • Weight magnitudes
  • Activation distributions

Training Monitor

class TrainingMonitor: def __init__(self): self.train_loss = [] self.val_loss = [] self.train_acc = [] self.val_acc = [] self.grad_norms = [] def record(self, train_loss, val_loss, train_acc, val_acc, grads=None): self.train_loss.append(train_loss) self.val_loss.append(val_loss) self.train_acc.append(train_acc) self.val_acc.append(val_acc) if grads: # Compute gradient norm grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads.values())) self.grad_norms.append(grad_norm) def plot(self): fig, axes = plt.subplots(2, 2, figsize=(15, 10)) # Loss curves axes[0, 0].plot(self.train_loss, label='Train') axes[0, 0].plot(self.val_loss, label='Validation') axes[0, 0].set_xlabel('Iteration') axes[0, 0].set_ylabel('Loss') axes[0, 0].set_title('Loss Curves') axes[0, 0].legend() # Accuracy curves axes[0, 1].plot(self.train_acc, label='Train') axes[0, 1].plot(self.val_acc, label='Validation') axes[0, 1].set_xlabel('Epoch') axes[0, 1].set_ylabel('Accuracy') axes[0, 1].set_title('Accuracy Curves') axes[0, 1].legend() # Gradient norms if self.grad_norms: axes[1, 0].plot(self.grad_norms) axes[1, 0].set_xlabel('Iteration') axes[1, 0].set_ylabel('Gradient Norm') axes[1, 0].set_title('Gradient Magnitudes') plt.tight_layout()

Debugging Neural Networks

Common issues and how to fix them:

Problem: Loss is NaN

Causes:

  • Learning rate too high
  • Numerical instability (e.g., log(0))
  • Exploding gradients

Solutions:

  • Reduce learning rate by 10x
  • Add gradient clipping: np.clip(grads, -max_norm, max_norm)
  • Check for numerical stability in loss computation
  • Verify weight initialization

Problem: Loss Not Decreasing

Causes:

  • Learning rate too low
  • Dead ReLUs (all neurons outputting 0)
  • Bug in gradient computation

Solutions:

  • Increase learning rate
  • Check gradient magnitudes: print(np.mean(np.abs(grads['W1'])))
  • Verify gradient check passes
  • Try different initialization

Problem: Vanishing Gradients

Symptoms: Gradients very small (< 1e-6), early layers not learning

Solutions:

  • Use ReLU instead of sigmoid/tanh
  • Use batch normalization
  • Use better initialization (He for ReLU)
  • Skip connections (in deeper networks)

Problem: Exploding Gradients

Symptoms: Gradients very large (> 1), loss becomes NaN

Solutions:

  • Reduce learning rate
  • Gradient clipping: np.clip(grads, -5, 5)
  • Better weight initialization
  • Check for bugs in backward pass

Problem: Dead ReLUs

Symptoms: Many ReLU units always output 0

Solutions:

  • Lower learning rate
  • Better initialization (He initialization)
  • Try Leaky ReLU: f(x)=max(0.01x,x)f(x) = \max(0.01x, x)

Debugging Checklist

Systematic Debugging

Step 1: Overfit a tiny dataset (10 examples)

  • Can the model memorize them perfectly?
  • If not, bug in model or optimization

Step 2: Check gradient implementation

  • Run gradient check
  • Should get relative error < 1e-5

Step 3: Monitor gradient magnitudes

  • Too small (< 1e-6): Vanishing gradients
  • Too large (> 1): Exploding gradients

Step 4: Visualize predictions

  • What is the model actually predicting?
  • Are predictions reasonable?

Step 5: Start simple, add complexity

  • Begin with small network, simple data
  • Gradually increase complexity
  • Identify where things break

Hyperparameter Tuning

Hyperparameters to Tune

Priority 1 (tune these first):

  • Learning rate
  • Network architecture (layer sizes, number of layers)

Priority 2:

  • Regularization strength
  • Dropout rate

Priority 3:

  • Batch size
  • Optimizer parameters (momentum, betas)

Tuning Strategy

Random search > Grid search

import random best_val_acc = 0 best_params = None for trial in range(50): # Try 50 random combinations # Sample hyperparameters from ranges lr = 10 ** np.random.uniform(-5, -2) # 1e-5 to 1e-2 hidden_dim = random.choice([50, 100, 200, 500]) reg = 10 ** np.random.uniform(-5, -2) dropout = random.uniform(0.3, 0.7) # Train model model = TwoLayerNet(input_dim, hidden_dim, output_dim) history = train(model, X_train, y_train, X_val, y_val, learning_rate=lr, reg=reg, dropout=dropout) # Track best val_acc = history['val_acc_history'][-1] if val_acc > best_val_acc: best_val_acc = val_acc best_params = {'lr': lr, 'hidden_dim': hidden_dim, 'reg': reg, 'dropout': dropout} print(f"Best validation accuracy: {best_val_acc:.4f}") print(f"Best hyperparameters: {best_params}")

Why random better than grid?

  • More efficient exploration
  • Better for varying importance of parameters
  • Easier to add more trials

Sanity Checks

Before training on full dataset:

Check 1: Overfit Small Dataset

# Take 10 random examples small_X = X_train[:10] small_y = y_train[:10] # Train until perfect accuracy for epoch in range(100): loss, grads = model.loss(small_X, small_y) # Update parameters... if epoch % 10 == 0: train_acc = (model.predict(small_X) == small_y).mean() print(f"Epoch {epoch}: accuracy = {train_acc:.2f}") # Should reach 100% accuracy within 100 epochs

If you can’t overfit 10 examples, there’s a bug!

Check 2: Initial Loss

For softmax classifier with CC classes:

Linitiallog(1/C)=log(C)L_{\text{initial}} \approx -\log(1/C) = \log(C)
# Check initial loss (before training) loss = model.loss(X_train[:100], y_train[:100]) expected_loss = np.log(num_classes) print(f"Initial loss: {loss:.4f}, expected: {expected_loss:.4f}") # Should be close! assert abs(loss - expected_loss) < 0.1, "Initial loss is wrong!"

Check 3: Loss Decreases

# Loss should decrease after a few iterations loss_before = model.loss(X_batch, y_batch) for _ in range(10): loss, grads = model.loss(X_batch, y_batch) # Update parameters... loss_after = model.loss(X_batch, y_batch) assert loss_after < loss_before, "Loss not decreasing!"

Learning Resources

Reading

  • Weight Initialization - Detailed initialization strategies
  • Gradient Checking - Verifying gradients
  • Learning Rate Schedules - Adaptive learning rates
  • Optimization - Training algorithms

Next Steps

  1. Apply these techniques in MNIST implementation
  2. Learn about batch normalization
  3. Study gradient checking in detail
  4. Explore advanced learning rate schedules