Skip to Content

Multi-Layer Perceptrons (MLPs)

Multi-layer perceptrons overcome the limitations of single perceptrons by stacking layers. This creates powerful function approximators that can learn arbitrarily complex, non-linear decision boundaries.

Architecture

A two-layer network (one hidden layer):

h=f(W1x+b1)(hidden layer)h = f(W_1 x + b_1) \quad \text{(hidden layer)} y=W2h+b2(output layer)y = W_2 h + b_2 \quad \text{(output layer)}

Key Components:

  • Input layer: xRDx \in \mathbb{R}^D (raw features)
  • Hidden layer: hRHh \in \mathbb{R}^H (learned representations)
  • Output layer: yRCy \in \mathbb{R}^C (class scores or predictions)
  • Nonlinearity: ff (typically ReLU)

The nonlinearity ff between layers is essential - without it, multiple linear layers collapse into a single linear transformation.

Why Depth Matters

Universal Approximation Theorem

A neural network with a single hidden layer can approximate any continuous function (given enough neurons).

Then why use multiple layers?

Advantages of Depth

  1. Hierarchical Representations: Each layer learns increasingly abstract features

    • Layer 1: Edges and textures
    • Layer 2: Corners and simple shapes
    • Layer 3: Object parts
    • Layer 4: Complete objects
  2. Computational Efficiency: Deep narrow networks can represent functions more compactly than shallow wide networks

    • Exponentially more expressive with depth
    • Fewer parameters for same capacity
  3. Optimization Benefits: Sometimes easier to optimize (though not always!)

    • Better gradient flow (with modern techniques)
    • More structured learning
  4. Biological Plausibility: Mirrors hierarchical processing in visual cortex

Architecture Choices

How to Choose Hidden Layer Size?

Rules of Thumb:

  • Start with hidden size between input and output size
  • More neurons = more capacity (but risk of overfitting)
  • Fewer neurons = simpler model (but risk of underfitting)
  • Use validation set to tune hyperparameter

Capacity Trade-off:

Too few neurons → Underfitting (can't learn patterns) Right amount → Good generalization Too many neurons → Overfitting (memorizes training data)

How Many Layers?

Starting Point:

  • Begin with 1-2 hidden layers
  • Add more layers if underfitting on training data
  • Modern networks: 10s to 100s of layers (with skip connections, normalization)

Practical Guidance:

  • Simple problems: 1-2 layers sufficient
  • Image recognition: 10-100+ layers (CNNs)
  • Language modeling: 10-100+ layers (Transformers)
  • Always start simple and add complexity as needed

Forward Pass Implementation

class TwoLayerNet: def __init__(self, input_dim, hidden_dim, output_dim): """ Initialize a two-layer fully-connected neural network. Args: input_dim: Size of input (D) hidden_dim: Size of hidden layer (H) output_dim: Number of classes (C) """ # Initialize weights with small random values self.params = {} self.params['W1'] = np.random.randn(input_dim, hidden_dim) * 0.01 self.params['b1'] = np.zeros(hidden_dim) self.params['W2'] = np.random.randn(hidden_dim, output_dim) * 0.01 self.params['b2'] = np.zeros(output_dim) def forward(self, X): """ Compute forward pass. Args: X: Input data (N, D) Returns: scores: Class scores (N, C) cache: Values needed for backward pass """ W1, b1 = self.params['W1'], self.params['b1'] W2, b2 = self.params['W2'], self.params['b2'] # Layer 1: Linear -> ReLU z1 = X @ W1 + b1 h = np.maximum(0, z1) # ReLU activation # Layer 2: Linear scores = h @ W2 + b2 # Cache intermediate values for backward pass cache = (X, z1, h, W1, W2) return scores, cache

Loss Computation

For classification, we typically use softmax cross-entropy loss:

def loss(self, X, y, reg=0.0): """ Compute loss and gradients. Args: X: Input data (N, D) y: Labels (N,) reg: Regularization strength Returns: loss: Scalar loss value """ # Forward pass scores, cache = self.forward(X) N = X.shape[0] # Compute softmax loss scores_shifted = scores - np.max(scores, axis=1, keepdims=True) exp_scores = np.exp(scores_shifted) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # Cross-entropy loss correct_log_probs = -np.log(probs[range(N), y]) data_loss = np.sum(correct_log_probs) / N # Add L2 regularization W1, W2 = self.params['W1'], self.params['W2'] reg_loss = 0.5 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2)) loss = data_loss + reg_loss return loss

Solving the XOR Problem

Unlike a single perceptron, an MLP can solve XOR:

# XOR dataset X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y = np.array([0, 1, 1, 0]) # Two-layer network with 2 hidden units net = TwoLayerNet(input_dim=2, hidden_dim=2, output_dim=2) # After training, the network learns non-linear decision boundary # that correctly classifies all XOR examples

The hidden layer learns a representation that makes the data linearly separable in the hidden space.

Training MLPs

Training requires computing gradients for all parameters - see Backpropagation.

Key Steps:

  1. Forward pass: Compute predictions and loss
  2. Backward pass: Compute gradients via backpropagation
  3. Update: Adjust weights using gradient descent
  4. Repeat: Iterate until convergence

Learning Resources

Videos

Reading

Next Steps

  1. Master backpropagation to understand training
  2. Learn about optimization algorithms (SGD, Adam)
  3. Study regularization techniques (dropout, weight decay)
  4. Implement complete training in MNIST example