Backpropagation

Most Important Section

This is the foundation of everything in deep learning. You must understand backpropagation deeply—not just conceptually, but mathematically and implementationally. Spend extra time here if needed. Everything else builds on this.

Backpropagation is the algorithm that makes training deep neural networks possible. It efficiently computes gradients by propagating error backward through the network using the chain rule of calculus.

Understanding Backpropagation

The Chain Rule

For composed functions, the chain rule tells us how to compute derivatives:

If $y = f(u)$ and $u = g(x)$ , then:

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

Neural Network Perspective:

$x$ = input or parameter
$u$ = intermediate activation
$y$ = final loss
We want $\frac{dy}{dx}$ to know how to update $x$

Computational Graphs

A computational graph represents the forward pass as a directed acyclic graph (DAG):

Nodes = variables or operations
Edges = data flow

Example: $f(x, y, z) = (x + y) \cdot z$


x ──┐
    ├──> [+] ──> q ──┐
y ──┘              ├──> [×] ──> f
                   │
z ─────────────────┘

Forward pass: Compute values from inputs to output Backward pass: Compute gradients from output to inputs

Local vs. Global Gradients

Local gradient: Derivative of the operation with respect to its immediate inputs

\frac{\partial f}{\partial x_{\text{local}}}

Global gradient: Derivative of the final loss with respect to this variable

\frac{\partial L}{\partial x}

Chain Rule Connection:

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial f} \cdot \frac{\partial f}{\partial x}

The global gradient at each node = (upstream gradient) × (local gradient)

Key Insight

Backpropagation is not magic—it’s just the chain rule applied systematically to compute how each weight affects the final loss. Each layer receives a gradient from the layer above and passes gradients to the layer below.

The Backpropagation Algorithm

Forward Pass

Start with inputs
Compute each operation in order
Store intermediate values (cache)
Compute loss at output

Backward Pass

Start with gradient of loss w.r.t. loss (= 1)
For each operation in reverse order:
- Receive gradient from upstream
- Compute local gradients
- Multiply: upstream gradient × local gradient
- Pass gradients to inputs
Accumulate gradients for parameters

Mathematical Derivation for Two-Layer Network

Let’s derive backpropagation step-by-step for a two-layer network.

Forward Pass

Given input $X \in \mathbb{R}^{N \times D}$ and labels $y \in \mathbb{R}^N$ :

Layer 1:

Z_1 = XW_1 + b_1 \quad \text{(linear)}

H = \max(0, Z_1) \quad \text{(ReLU)}

Layer 2:

S = HW_2 + b_2 \quad \text{(scores)}

Loss (softmax + cross-entropy):

P = \text{softmax}(S) = \frac{e^{s_i}}{\sum_j e^{s_j}}

L = -\frac{1}{N} \sum_{i=1}^{N} \log P_{i,y_i}

Backward Pass

We need: $\frac{\partial L}{\partial W_1}$ , $\frac{\partial L}{\partial b_1}$ , $\frac{\partial L}{\partial W_2}$ , $\frac{\partial L}{\partial b_2}$

Step 1: Gradient at output (softmax + cross-entropy)

The gradient has a beautiful closed form:

\frac{\partial L}{\partial S} = P - Y_{\text{one-hot}}

In code: dS = probs; dS[range(N), y] -= 1; dS /= N

Step 2: Backprop through layer 2

\frac{\partial L}{\partial W_2} = H^T \frac{\partial L}{\partial S}

\frac{\partial L}{\partial b_2} = \sum_{i=1}^{N} \frac{\partial L}{\partial S_i}

\frac{\partial L}{\partial H} = \frac{\partial L}{\partial S} W_2^T

Step 3: Backprop through ReLU

\frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial H} \odot \mathbb{1}[Z_1 > 0]

Where $\odot$ is element-wise multiplication and $\mathbb{1}[Z_1 > 0]$ is 1 if $Z_1 > 0$ , else 0.

Step 4: Backprop through layer 1

\frac{\partial L}{\partial W_1} = X^T \frac{\partial L}{\partial Z_1}

\frac{\partial L}{\partial b_1} = \sum_{i=1}^{N} \frac{\partial L}{\partial Z_{1,i}}

Implementation

Complete backpropagation implementation:


def backward(self, X, y, cache, reg=0.0):
    """
    Compute gradients using backpropagation.
 
    Args:
        X: Input data (N, D)
        y: Labels (N,)
        cache: Values from forward pass
        reg: Regularization strength
 
    Returns:
        grads: Dictionary containing gradients
    """
    X, z1, h, W1, W2 = cache
    N = X.shape[0]
 
    # Compute scores and probabilities
    scores = h @ W2 + self.params['b2']
    scores_shifted = scores - np.max(scores, axis=1, keepdims=True)
    exp_scores = np.exp(scores_shifted)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
 
    # ================== BACKWARD PASS ==================
 
    # Gradient of loss w.r.t. scores (softmax + cross-entropy)
    dscores = probs.copy()
    dscores[range(N), y] -= 1
    dscores /= N
 
    # Backprop into W2 and b2
    dW2 = h.T @ dscores
    db2 = np.sum(dscores, axis=0)
    dW2 += reg * W2  # Add regularization gradient
 
    # Backprop into hidden layer
    dh = dscores @ W2.T
 
    # Backprop through ReLU
    dh[z1 <= 0] = 0
 
    # Backprop into W1 and b1
    dW1 = X.T @ dh
    db1 = np.sum(dh, axis=0)
    dW1 += reg * W1  # Add regularization gradient
 
    grads = {
        'W1': dW1,
        'b1': db1,
        'W2': dW2,
        'b2': db2
    }
 
    return grads

Training Loop


def train(self, X, y, X_val, y_val,
          learning_rate=1e-3, reg=1e-5,
          num_epochs=100, batch_size=200,
          verbose=True):
    """Train the network using SGD with backpropagation."""
    num_train = X.shape[0]
    iterations_per_epoch = max(num_train // batch_size, 1)
 
    loss_history = []
    train_acc_history = []
    val_acc_history = []
 
    for epoch in range(num_epochs):
        # Shuffle training data
        indices = np.random.choice(num_train, num_train, replace=False)
 
        for it in range(iterations_per_epoch):
            # Sample minibatch
            batch_indices = indices[it * batch_size:(it + 1) * batch_size]
            X_batch = X[batch_indices]
            y_batch = y[batch_indices]
 
            # Forward pass
            scores, cache = self.forward(X_batch)
 
            # Compute loss
            loss = self.loss(X_batch, y_batch, reg)
            loss_history.append(loss)
 
            # Backward pass
            grads = self.backward(X_batch, y_batch, cache, reg)
 
            # Update parameters (SGD)
            self.params['W1'] -= learning_rate * grads['W1']
            self.params['b1'] -= learning_rate * grads['b1']
            self.params['W2'] -= learning_rate * grads['W2']
            self.params['b2'] -= learning_rate * grads['b2']
 
        # Check accuracy
        train_acc = (self.predict(X) == y).mean()
        val_acc = (self.predict(X_val) == y_val).mean()
        train_acc_history.append(train_acc)
        val_acc_history.append(val_acc)
 
        if verbose:
            print(f'Epoch {epoch+1}/{num_epochs}: '
                  f'loss {loss:.4f}, '
                  f'train_acc {train_acc:.4f}, '
                  f'val_acc {val_acc:.4f}')
 
    return {
        'loss_history': loss_history,
        'train_acc_history': train_acc_history,
        'val_acc_history': val_acc_history
    }

Gradient Checking

Always verify your backprop implementation with numerical gradients!

Numerical Gradient

Approximate gradient using finite differences:

\frac{\partial f}{\partial x} \approx \frac{f(x + h) - f(x - h)}{2h}

where $h$ is a small value (e.g., $10^{-5}$ ).

Implementation


def gradient_check(self, X, y, reg=0.0, epsilon=1e-5):
    """
    Verify backpropagation gradients using numerical gradients.
 
    Args:
        X: Input data (small batch, e.g., 5 examples)
        y: Labels
        reg: Regularization strength
        epsilon: Small value for finite differences
    """
    # Compute analytical gradients
    loss = self.loss(X, y, reg)
    scores, cache = self.forward(X)
    grads = self.backward(X, y, cache, reg)
 
    # Check each parameter
    for param_name in ['W1', 'b1', 'W2', 'b2']:
        param = self.params[param_name]
        grad = grads[param_name]
 
        # Compute numerical gradient
        num_grad = np.zeros_like(param)
        it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
 
        while not it.finished:
            idx = it.multi_index
 
            # f(x + epsilon)
            old_value = param[idx]
            param[idx] = old_value + epsilon
            loss_plus = self.loss(X, y, reg)
 
            # f(x - epsilon)
            param[idx] = old_value - epsilon
            loss_minus = self.loss(X, y, reg)
 
            # Numerical gradient
            num_grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)
 
            # Restore original value
            param[idx] = old_value
            it.iternext()
 
        # Compare
        diff = np.linalg.norm(grad - num_grad) / np.linalg.norm(grad + num_grad)
        print(f'{param_name}: relative error = {diff:.2e}')
 
        if diff > 1e-5:
            print(f'  WARNING: Gradient check failed for {param_name}!')
        else:
            print(f'  ✓ Gradient check passed for {param_name}')

Expected Results:

Relative error < $10^{-7}$ : Excellent! Gradients are correct
Relative error < $10^{-5}$ : Good, likely correct
Relative error < $10^{-3}$ : Suspicious, double-check
Relative error > $10^{-3}$ : ERROR! Bug in backpropagation

Critical Skill

Gradient checking is non-negotiable. Always verify your backprop implementation with numerical gradients before training. This technique will save you countless hours of debugging mysterious failures.

Debugging Tips

Common Issues:

Symptom	Likely Cause	Solution
Gradient check fails	Bug in backward pass	Check shapes, transpose operations
NaN values	Numerical instability	Add stability tricks (shift scores)
Loss increases	Gradients have wrong sign	Check minus signs in updates
Loss doesn’t decrease	Gradients too small/large	Check gradient magnitudes, learning rate

Debugging Strategy:

Start with tiny network (2 inputs, 3 hidden, 2 outputs)
Use small batch (5 examples)
Turn off regularization initially
Verify gradient check passes
Visualize gradients (check for vanishing/exploding)
Gradually increase complexity

Learning Resources

Videos (⭐⭐⭐ ESSENTIAL)

Welch Labs: Neural Networks Demystified Part 4 - Backpropagation (9 min) - Watch multiple times!
3Blue1Brown: Backpropagation (14 min) + Backpropagation calculus (10 min)
CS231n Lecture 4: Neural Networks & Backpropagation (79 min) - Most important lecture

Reading

Multi-Layer Perceptrons - Architecture that uses backprop
Optimization Algorithms - Using gradients to update weights
Computational Graphs - Framework for backprop
Automatic Differentiation - Modern implementation

Next Steps

Master optimization algorithms (SGD, Adam, momentum)
Learn about regularization (dropout, weight decay)
Understand batch normalization
Implement complete network in MNIST example

Backpropagation

Understanding Backpropagation

The Chain Rule

Computational Graphs

Local vs. Global Gradients

The Backpropagation Algorithm

Forward Pass

Backward Pass

Mathematical Derivation for Two-Layer Network

Forward Pass

Backward Pass

Implementation

Training Loop

Gradient Checking

Numerical Gradient

Implementation

Debugging Tips

Learning Resources

Videos (⭐⭐⭐ ESSENTIAL)

Reading

Related Concepts

Next Steps