Skip to Content
LibraryConceptsBackpropagation

Backpropagation

Most Important Section

This is the foundation of everything in deep learning. You must understand backpropagation deeply—not just conceptually, but mathematically and implementationally. Spend extra time here if needed. Everything else builds on this.

Backpropagation is the algorithm that makes training deep neural networks possible. It efficiently computes gradients by propagating error backward through the network using the chain rule of calculus.

Understanding Backpropagation

The Chain Rule

For composed functions, the chain rule tells us how to compute derivatives:

If y=f(u)y = f(u) and u=g(x)u = g(x), then:

dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

Neural Network Perspective:

  • xx = input or parameter
  • uu = intermediate activation
  • yy = final loss
  • We want dydx\frac{dy}{dx} to know how to update xx

Computational Graphs

A computational graph represents the forward pass as a directed acyclic graph (DAG):

  • Nodes = variables or operations
  • Edges = data flow

Example: f(x,y,z)=(x+y)zf(x, y, z) = (x + y) \cdot z

x ──┐ ├──> [+] ──> q ──┐ y ──┘ ├──> [×] ──> f z ─────────────────┘

Forward pass: Compute values from inputs to output Backward pass: Compute gradients from output to inputs

Local vs. Global Gradients

Local gradient: Derivative of the operation with respect to its immediate inputs

fxlocal\frac{\partial f}{\partial x_{\text{local}}}

Global gradient: Derivative of the final loss with respect to this variable

Lx\frac{\partial L}{\partial x}

Chain Rule Connection:

Lx=Lffx\frac{\partial L}{\partial x} = \frac{\partial L}{\partial f} \cdot \frac{\partial f}{\partial x}

The global gradient at each node = (upstream gradient) × (local gradient)

Key Insight

Backpropagation is not magic—it’s just the chain rule applied systematically to compute how each weight affects the final loss. Each layer receives a gradient from the layer above and passes gradients to the layer below.

The Backpropagation Algorithm

Forward Pass

  1. Start with inputs
  2. Compute each operation in order
  3. Store intermediate values (cache)
  4. Compute loss at output

Backward Pass

  1. Start with gradient of loss w.r.t. loss (= 1)
  2. For each operation in reverse order:
    • Receive gradient from upstream
    • Compute local gradients
    • Multiply: upstream gradient × local gradient
    • Pass gradients to inputs
  3. Accumulate gradients for parameters

Mathematical Derivation for Two-Layer Network

Let’s derive backpropagation step-by-step for a two-layer network.

Forward Pass

Given input XRN×DX \in \mathbb{R}^{N \times D} and labels yRNy \in \mathbb{R}^N:

Layer 1:

Z1=XW1+b1(linear)Z_1 = XW_1 + b_1 \quad \text{(linear)} H=max(0,Z1)(ReLU)H = \max(0, Z_1) \quad \text{(ReLU)}

Layer 2:

S=HW2+b2(scores)S = HW_2 + b_2 \quad \text{(scores)}

Loss (softmax + cross-entropy):

P=softmax(S)=esijesjP = \text{softmax}(S) = \frac{e^{s_i}}{\sum_j e^{s_j}} L=1Ni=1NlogPi,yiL = -\frac{1}{N} \sum_{i=1}^{N} \log P_{i,y_i}

Backward Pass

We need: LW1\frac{\partial L}{\partial W_1}, Lb1\frac{\partial L}{\partial b_1}, LW2\frac{\partial L}{\partial W_2}, Lb2\frac{\partial L}{\partial b_2}

Step 1: Gradient at output (softmax + cross-entropy)

The gradient has a beautiful closed form:

LS=PYone-hot\frac{\partial L}{\partial S} = P - Y_{\text{one-hot}}

In code: dS = probs; dS[range(N), y] -= 1; dS /= N

Step 2: Backprop through layer 2

LW2=HTLS\frac{\partial L}{\partial W_2} = H^T \frac{\partial L}{\partial S} Lb2=i=1NLSi\frac{\partial L}{\partial b_2} = \sum_{i=1}^{N} \frac{\partial L}{\partial S_i} LH=LSW2T\frac{\partial L}{\partial H} = \frac{\partial L}{\partial S} W_2^T

Step 3: Backprop through ReLU

LZ1=LH1[Z1>0]\frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial H} \odot \mathbb{1}[Z_1 > 0]

Where \odot is element-wise multiplication and 1[Z1>0]\mathbb{1}[Z_1 > 0] is 1 if Z1>0Z_1 > 0, else 0.

Step 4: Backprop through layer 1

LW1=XTLZ1\frac{\partial L}{\partial W_1} = X^T \frac{\partial L}{\partial Z_1} Lb1=i=1NLZ1,i\frac{\partial L}{\partial b_1} = \sum_{i=1}^{N} \frac{\partial L}{\partial Z_{1,i}}

Implementation

Complete backpropagation implementation:

def backward(self, X, y, cache, reg=0.0): """ Compute gradients using backpropagation. Args: X: Input data (N, D) y: Labels (N,) cache: Values from forward pass reg: Regularization strength Returns: grads: Dictionary containing gradients """ X, z1, h, W1, W2 = cache N = X.shape[0] # Compute scores and probabilities scores = h @ W2 + self.params['b2'] scores_shifted = scores - np.max(scores, axis=1, keepdims=True) exp_scores = np.exp(scores_shifted) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # ================== BACKWARD PASS ================== # Gradient of loss w.r.t. scores (softmax + cross-entropy) dscores = probs.copy() dscores[range(N), y] -= 1 dscores /= N # Backprop into W2 and b2 dW2 = h.T @ dscores db2 = np.sum(dscores, axis=0) dW2 += reg * W2 # Add regularization gradient # Backprop into hidden layer dh = dscores @ W2.T # Backprop through ReLU dh[z1 <= 0] = 0 # Backprop into W1 and b1 dW1 = X.T @ dh db1 = np.sum(dh, axis=0) dW1 += reg * W1 # Add regularization gradient grads = { 'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2 } return grads

Training Loop

def train(self, X, y, X_val, y_val, learning_rate=1e-3, reg=1e-5, num_epochs=100, batch_size=200, verbose=True): """Train the network using SGD with backpropagation.""" num_train = X.shape[0] iterations_per_epoch = max(num_train // batch_size, 1) loss_history = [] train_acc_history = [] val_acc_history = [] for epoch in range(num_epochs): # Shuffle training data indices = np.random.choice(num_train, num_train, replace=False) for it in range(iterations_per_epoch): # Sample minibatch batch_indices = indices[it * batch_size:(it + 1) * batch_size] X_batch = X[batch_indices] y_batch = y[batch_indices] # Forward pass scores, cache = self.forward(X_batch) # Compute loss loss = self.loss(X_batch, y_batch, reg) loss_history.append(loss) # Backward pass grads = self.backward(X_batch, y_batch, cache, reg) # Update parameters (SGD) self.params['W1'] -= learning_rate * grads['W1'] self.params['b1'] -= learning_rate * grads['b1'] self.params['W2'] -= learning_rate * grads['W2'] self.params['b2'] -= learning_rate * grads['b2'] # Check accuracy train_acc = (self.predict(X) == y).mean() val_acc = (self.predict(X_val) == y_val).mean() train_acc_history.append(train_acc) val_acc_history.append(val_acc) if verbose: print(f'Epoch {epoch+1}/{num_epochs}: ' f'loss {loss:.4f}, ' f'train_acc {train_acc:.4f}, ' f'val_acc {val_acc:.4f}') return { 'loss_history': loss_history, 'train_acc_history': train_acc_history, 'val_acc_history': val_acc_history }

Gradient Checking

Always verify your backprop implementation with numerical gradients!

Numerical Gradient

Approximate gradient using finite differences:

fxf(x+h)f(xh)2h\frac{\partial f}{\partial x} \approx \frac{f(x + h) - f(x - h)}{2h}

where hh is a small value (e.g., 10510^{-5}).

Implementation

def gradient_check(self, X, y, reg=0.0, epsilon=1e-5): """ Verify backpropagation gradients using numerical gradients. Args: X: Input data (small batch, e.g., 5 examples) y: Labels reg: Regularization strength epsilon: Small value for finite differences """ # Compute analytical gradients loss = self.loss(X, y, reg) scores, cache = self.forward(X) grads = self.backward(X, y, cache, reg) # Check each parameter for param_name in ['W1', 'b1', 'W2', 'b2']: param = self.params[param_name] grad = grads[param_name] # Compute numerical gradient num_grad = np.zeros_like(param) it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite']) while not it.finished: idx = it.multi_index # f(x + epsilon) old_value = param[idx] param[idx] = old_value + epsilon loss_plus = self.loss(X, y, reg) # f(x - epsilon) param[idx] = old_value - epsilon loss_minus = self.loss(X, y, reg) # Numerical gradient num_grad[idx] = (loss_plus - loss_minus) / (2 * epsilon) # Restore original value param[idx] = old_value it.iternext() # Compare diff = np.linalg.norm(grad - num_grad) / np.linalg.norm(grad + num_grad) print(f'{param_name}: relative error = {diff:.2e}') if diff > 1e-5: print(f' WARNING: Gradient check failed for {param_name}!') else: print(f' ✓ Gradient check passed for {param_name}')

Expected Results:

  • Relative error < 10710^{-7}: Excellent! Gradients are correct
  • Relative error < 10510^{-5}: Good, likely correct
  • Relative error < 10310^{-3}: Suspicious, double-check
  • Relative error > 10310^{-3}: ERROR! Bug in backpropagation
Critical Skill

Gradient checking is non-negotiable. Always verify your backprop implementation with numerical gradients before training. This technique will save you countless hours of debugging mysterious failures.

Debugging Tips

Common Issues:

SymptomLikely CauseSolution
Gradient check failsBug in backward passCheck shapes, transpose operations
NaN valuesNumerical instabilityAdd stability tricks (shift scores)
Loss increasesGradients have wrong signCheck minus signs in updates
Loss doesn’t decreaseGradients too small/largeCheck gradient magnitudes, learning rate

Debugging Strategy:

  1. Start with tiny network (2 inputs, 3 hidden, 2 outputs)
  2. Use small batch (5 examples)
  3. Turn off regularization initially
  4. Verify gradient check passes
  5. Visualize gradients (check for vanishing/exploding)
  6. Gradually increase complexity

Learning Resources

Videos (⭐⭐⭐ ESSENTIAL)

Reading

Next Steps

  1. Master optimization algorithms (SGD, Adam, momentum)
  2. Learn about regularization (dropout, weight decay)
  3. Understand batch normalization
  4. Implement complete network in MNIST example