Backpropagation
This is the foundation of everything in deep learning. You must understand backpropagation deeply—not just conceptually, but mathematically and implementationally. Spend extra time here if needed. Everything else builds on this.
Backpropagation is the algorithm that makes training deep neural networks possible. It efficiently computes gradients by propagating error backward through the network using the chain rule of calculus.
Understanding Backpropagation
The Chain Rule
For composed functions, the chain rule tells us how to compute derivatives:
If and , then:
Neural Network Perspective:
- = input or parameter
- = intermediate activation
- = final loss
- We want to know how to update
Computational Graphs
A computational graph represents the forward pass as a directed acyclic graph (DAG):
- Nodes = variables or operations
- Edges = data flow
Example:
x ──┐
├──> [+] ──> q ──┐
y ──┘ ├──> [×] ──> f
│
z ─────────────────┘Forward pass: Compute values from inputs to output Backward pass: Compute gradients from output to inputs
Local vs. Global Gradients
Local gradient: Derivative of the operation with respect to its immediate inputs
Global gradient: Derivative of the final loss with respect to this variable
Chain Rule Connection:
The global gradient at each node = (upstream gradient) × (local gradient)
Backpropagation is not magic—it’s just the chain rule applied systematically to compute how each weight affects the final loss. Each layer receives a gradient from the layer above and passes gradients to the layer below.
The Backpropagation Algorithm
Forward Pass
- Start with inputs
- Compute each operation in order
- Store intermediate values (cache)
- Compute loss at output
Backward Pass
- Start with gradient of loss w.r.t. loss (= 1)
- For each operation in reverse order:
- Receive gradient from upstream
- Compute local gradients
- Multiply: upstream gradient × local gradient
- Pass gradients to inputs
- Accumulate gradients for parameters
Mathematical Derivation for Two-Layer Network
Let’s derive backpropagation step-by-step for a two-layer network.
Forward Pass
Given input and labels :
Layer 1:
Layer 2:
Loss (softmax + cross-entropy):
Backward Pass
We need: , , ,
Step 1: Gradient at output (softmax + cross-entropy)
The gradient has a beautiful closed form:
In code: dS = probs; dS[range(N), y] -= 1; dS /= N
Step 2: Backprop through layer 2
Step 3: Backprop through ReLU
Where is element-wise multiplication and is 1 if , else 0.
Step 4: Backprop through layer 1
Implementation
Complete backpropagation implementation:
def backward(self, X, y, cache, reg=0.0):
"""
Compute gradients using backpropagation.
Args:
X: Input data (N, D)
y: Labels (N,)
cache: Values from forward pass
reg: Regularization strength
Returns:
grads: Dictionary containing gradients
"""
X, z1, h, W1, W2 = cache
N = X.shape[0]
# Compute scores and probabilities
scores = h @ W2 + self.params['b2']
scores_shifted = scores - np.max(scores, axis=1, keepdims=True)
exp_scores = np.exp(scores_shifted)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# ================== BACKWARD PASS ==================
# Gradient of loss w.r.t. scores (softmax + cross-entropy)
dscores = probs.copy()
dscores[range(N), y] -= 1
dscores /= N
# Backprop into W2 and b2
dW2 = h.T @ dscores
db2 = np.sum(dscores, axis=0)
dW2 += reg * W2 # Add regularization gradient
# Backprop into hidden layer
dh = dscores @ W2.T
# Backprop through ReLU
dh[z1 <= 0] = 0
# Backprop into W1 and b1
dW1 = X.T @ dh
db1 = np.sum(dh, axis=0)
dW1 += reg * W1 # Add regularization gradient
grads = {
'W1': dW1,
'b1': db1,
'W2': dW2,
'b2': db2
}
return gradsTraining Loop
def train(self, X, y, X_val, y_val,
learning_rate=1e-3, reg=1e-5,
num_epochs=100, batch_size=200,
verbose=True):
"""Train the network using SGD with backpropagation."""
num_train = X.shape[0]
iterations_per_epoch = max(num_train // batch_size, 1)
loss_history = []
train_acc_history = []
val_acc_history = []
for epoch in range(num_epochs):
# Shuffle training data
indices = np.random.choice(num_train, num_train, replace=False)
for it in range(iterations_per_epoch):
# Sample minibatch
batch_indices = indices[it * batch_size:(it + 1) * batch_size]
X_batch = X[batch_indices]
y_batch = y[batch_indices]
# Forward pass
scores, cache = self.forward(X_batch)
# Compute loss
loss = self.loss(X_batch, y_batch, reg)
loss_history.append(loss)
# Backward pass
grads = self.backward(X_batch, y_batch, cache, reg)
# Update parameters (SGD)
self.params['W1'] -= learning_rate * grads['W1']
self.params['b1'] -= learning_rate * grads['b1']
self.params['W2'] -= learning_rate * grads['W2']
self.params['b2'] -= learning_rate * grads['b2']
# Check accuracy
train_acc = (self.predict(X) == y).mean()
val_acc = (self.predict(X_val) == y_val).mean()
train_acc_history.append(train_acc)
val_acc_history.append(val_acc)
if verbose:
print(f'Epoch {epoch+1}/{num_epochs}: '
f'loss {loss:.4f}, '
f'train_acc {train_acc:.4f}, '
f'val_acc {val_acc:.4f}')
return {
'loss_history': loss_history,
'train_acc_history': train_acc_history,
'val_acc_history': val_acc_history
}Gradient Checking
Always verify your backprop implementation with numerical gradients!
Numerical Gradient
Approximate gradient using finite differences:
where is a small value (e.g., ).
Implementation
def gradient_check(self, X, y, reg=0.0, epsilon=1e-5):
"""
Verify backpropagation gradients using numerical gradients.
Args:
X: Input data (small batch, e.g., 5 examples)
y: Labels
reg: Regularization strength
epsilon: Small value for finite differences
"""
# Compute analytical gradients
loss = self.loss(X, y, reg)
scores, cache = self.forward(X)
grads = self.backward(X, y, cache, reg)
# Check each parameter
for param_name in ['W1', 'b1', 'W2', 'b2']:
param = self.params[param_name]
grad = grads[param_name]
# Compute numerical gradient
num_grad = np.zeros_like(param)
it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
# f(x + epsilon)
old_value = param[idx]
param[idx] = old_value + epsilon
loss_plus = self.loss(X, y, reg)
# f(x - epsilon)
param[idx] = old_value - epsilon
loss_minus = self.loss(X, y, reg)
# Numerical gradient
num_grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)
# Restore original value
param[idx] = old_value
it.iternext()
# Compare
diff = np.linalg.norm(grad - num_grad) / np.linalg.norm(grad + num_grad)
print(f'{param_name}: relative error = {diff:.2e}')
if diff > 1e-5:
print(f' WARNING: Gradient check failed for {param_name}!')
else:
print(f' ✓ Gradient check passed for {param_name}')Expected Results:
- Relative error < : Excellent! Gradients are correct
- Relative error < : Good, likely correct
- Relative error < : Suspicious, double-check
- Relative error > : ERROR! Bug in backpropagation
Gradient checking is non-negotiable. Always verify your backprop implementation with numerical gradients before training. This technique will save you countless hours of debugging mysterious failures.
Debugging Tips
Common Issues:
| Symptom | Likely Cause | Solution |
|---|---|---|
| Gradient check fails | Bug in backward pass | Check shapes, transpose operations |
| NaN values | Numerical instability | Add stability tricks (shift scores) |
| Loss increases | Gradients have wrong sign | Check minus signs in updates |
| Loss doesn’t decrease | Gradients too small/large | Check gradient magnitudes, learning rate |
Debugging Strategy:
- Start with tiny network (2 inputs, 3 hidden, 2 outputs)
- Use small batch (5 examples)
- Turn off regularization initially
- Verify gradient check passes
- Visualize gradients (check for vanishing/exploding)
- Gradually increase complexity
Learning Resources
Videos (⭐⭐⭐ ESSENTIAL)
- Welch Labs: Neural Networks Demystified Part 4 - Backpropagation (9 min) - Watch multiple times!
- 3Blue1Brown: Backpropagation (14 min) + Backpropagation calculus (10 min)
- CS231n Lecture 4: Neural Networks & Backpropagation (79 min) - Most important lecture
Reading
- CS231n: Backpropagation Notes
- CS231n: Computational Graphs
- Nielsen: Neural Networks and Deep Learning - Chapter 2
Related Concepts
- Multi-Layer Perceptrons - Architecture that uses backprop
- Optimization Algorithms - Using gradients to update weights
- Computational Graphs - Framework for backprop
- Automatic Differentiation - Modern implementation
Next Steps
- Master optimization algorithms (SGD, Adam, momentum)
- Learn about regularization (dropout, weight decay)
- Understand batch normalization
- Implement complete network in MNIST example