Multi-Layer Perceptrons (MLPs)

Multi-layer perceptrons overcome the limitations of single perceptrons by stacking layers. This creates powerful function approximators that can learn arbitrarily complex, non-linear decision boundaries.

Architecture

A two-layer network (one hidden layer):

h = f(W_1 x + b_1) \quad \text{(hidden layer)}

y = W_2 h + b_2 \quad \text{(output layer)}

Key Components:

Input layer: $x \in \mathbb{R}^D$ (raw features)
Hidden layer: $h \in \mathbb{R}^H$ (learned representations)
Output layer: $y \in \mathbb{R}^C$ (class scores or predictions)
Nonlinearity: $f$ (typically ReLU)

The nonlinearity $f$ between layers is essential - without it, multiple linear layers collapse into a single linear transformation.

Why Depth Matters

Universal Approximation Theorem

A neural network with a single hidden layer can approximate any continuous function (given enough neurons).

Then why use multiple layers?

Advantages of Depth

Hierarchical Representations: Each layer learns increasingly abstract features
- Layer 1: Edges and textures
- Layer 2: Corners and simple shapes
- Layer 3: Object parts
- Layer 4: Complete objects
Computational Efficiency: Deep narrow networks can represent functions more compactly than shallow wide networks
- Exponentially more expressive with depth
- Fewer parameters for same capacity
Optimization Benefits: Sometimes easier to optimize (though not always!)
- Better gradient flow (with modern techniques)
- More structured learning
Biological Plausibility: Mirrors hierarchical processing in visual cortex

Architecture Choices

How to Choose Hidden Layer Size?

Rules of Thumb:

Start with hidden size between input and output size
More neurons = more capacity (but risk of overfitting)
Fewer neurons = simpler model (but risk of underfitting)
Use validation set to tune hyperparameter

Capacity Trade-off:


Too few neurons → Underfitting (can't learn patterns)
Right amount → Good generalization
Too many neurons → Overfitting (memorizes training data)

How Many Layers?

Starting Point:

Begin with 1-2 hidden layers
Add more layers if underfitting on training data
Modern networks: 10s to 100s of layers (with skip connections, normalization)

Practical Guidance:

Simple problems: 1-2 layers sufficient
Image recognition: 10-100+ layers (CNNs)
Language modeling: 10-100+ layers (Transformers)
Always start simple and add complexity as needed

Forward Pass Implementation


class TwoLayerNet:
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Initialize a two-layer fully-connected neural network.
 
        Args:
            input_dim: Size of input (D)
            hidden_dim: Size of hidden layer (H)
            output_dim: Number of classes (C)
        """
        # Initialize weights with small random values
        self.params = {}
        self.params['W1'] = np.random.randn(input_dim, hidden_dim) * 0.01
        self.params['b1'] = np.zeros(hidden_dim)
        self.params['W2'] = np.random.randn(hidden_dim, output_dim) * 0.01
        self.params['b2'] = np.zeros(output_dim)
 
    def forward(self, X):
        """
        Compute forward pass.
 
        Args:
            X: Input data (N, D)
 
        Returns:
            scores: Class scores (N, C)
            cache: Values needed for backward pass
        """
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
 
        # Layer 1: Linear -> ReLU
        z1 = X @ W1 + b1
        h = np.maximum(0, z1)  # ReLU activation
 
        # Layer 2: Linear
        scores = h @ W2 + b2
 
        # Cache intermediate values for backward pass
        cache = (X, z1, h, W1, W2)
 
        return scores, cache

Loss Computation

For classification, we typically use softmax cross-entropy loss:


def loss(self, X, y, reg=0.0):
    """
    Compute loss and gradients.
 
    Args:
        X: Input data (N, D)
        y: Labels (N,)
        reg: Regularization strength
 
    Returns:
        loss: Scalar loss value
    """
    # Forward pass
    scores, cache = self.forward(X)
    N = X.shape[0]
 
    # Compute softmax loss
    scores_shifted = scores - np.max(scores, axis=1, keepdims=True)
    exp_scores = np.exp(scores_shifted)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
 
    # Cross-entropy loss
    correct_log_probs = -np.log(probs[range(N), y])
    data_loss = np.sum(correct_log_probs) / N
 
    # Add L2 regularization
    W1, W2 = self.params['W1'], self.params['W2']
    reg_loss = 0.5 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
    loss = data_loss + reg_loss
 
    return loss

Solving the XOR Problem

Unlike a single perceptron, an MLP can solve XOR:


# XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])
 
# Two-layer network with 2 hidden units
net = TwoLayerNet(input_dim=2, hidden_dim=2, output_dim=2)
 
# After training, the network learns non-linear decision boundary
# that correctly classifies all XOR examples

The hidden layer learns a representation that makes the data linearly separable in the hidden space.

Training MLPs

Training requires computing gradients for all parameters - see Backpropagation.

Key Steps:

Forward pass: Compute predictions and loss
Backward pass: Compute gradients via backpropagation
Update: Adjust weights using gradient descent
Repeat: Iterate until convergence

Learning Resources

Videos

3Blue1Brown: But what is a neural network? (19 min) - Excellent visual explanation
Welch Labs: Neural Networks Demystified Part 1-2 (15 min each)
CS231n Lecture 2: Neural Networks (75 min)

Reading

Perceptron - Single layer building block
Backpropagation - How to train MLPs
Activation Functions - Enabling non-linearity
Optimization - Training algorithms
Regularization - Preventing overfitting

Next Steps

Master backpropagation to understand training
Learn about optimization algorithms (SGD, Adam)
Study regularization techniques (dropout, weight decay)
Implement complete training in MNIST example