Multi-Layer Perceptrons (MLPs)
Multi-layer perceptrons overcome the limitations of single perceptrons by stacking layers. This creates powerful function approximators that can learn arbitrarily complex, non-linear decision boundaries.
Architecture
A two-layer network (one hidden layer):
Key Components:
- Input layer: (raw features)
- Hidden layer: (learned representations)
- Output layer: (class scores or predictions)
- Nonlinearity: (typically ReLU)
The nonlinearity between layers is essential - without it, multiple linear layers collapse into a single linear transformation.
Why Depth Matters
Universal Approximation Theorem
A neural network with a single hidden layer can approximate any continuous function (given enough neurons).
Then why use multiple layers?
Advantages of Depth
-
Hierarchical Representations: Each layer learns increasingly abstract features
- Layer 1: Edges and textures
- Layer 2: Corners and simple shapes
- Layer 3: Object parts
- Layer 4: Complete objects
-
Computational Efficiency: Deep narrow networks can represent functions more compactly than shallow wide networks
- Exponentially more expressive with depth
- Fewer parameters for same capacity
-
Optimization Benefits: Sometimes easier to optimize (though not always!)
- Better gradient flow (with modern techniques)
- More structured learning
-
Biological Plausibility: Mirrors hierarchical processing in visual cortex
Architecture Choices
How to Choose Hidden Layer Size?
Rules of Thumb:
- Start with hidden size between input and output size
- More neurons = more capacity (but risk of overfitting)
- Fewer neurons = simpler model (but risk of underfitting)
- Use validation set to tune hyperparameter
Capacity Trade-off:
Too few neurons → Underfitting (can't learn patterns)
Right amount → Good generalization
Too many neurons → Overfitting (memorizes training data)How Many Layers?
Starting Point:
- Begin with 1-2 hidden layers
- Add more layers if underfitting on training data
- Modern networks: 10s to 100s of layers (with skip connections, normalization)
Practical Guidance:
- Simple problems: 1-2 layers sufficient
- Image recognition: 10-100+ layers (CNNs)
- Language modeling: 10-100+ layers (Transformers)
- Always start simple and add complexity as needed
Forward Pass Implementation
class TwoLayerNet:
def __init__(self, input_dim, hidden_dim, output_dim):
"""
Initialize a two-layer fully-connected neural network.
Args:
input_dim: Size of input (D)
hidden_dim: Size of hidden layer (H)
output_dim: Number of classes (C)
"""
# Initialize weights with small random values
self.params = {}
self.params['W1'] = np.random.randn(input_dim, hidden_dim) * 0.01
self.params['b1'] = np.zeros(hidden_dim)
self.params['W2'] = np.random.randn(hidden_dim, output_dim) * 0.01
self.params['b2'] = np.zeros(output_dim)
def forward(self, X):
"""
Compute forward pass.
Args:
X: Input data (N, D)
Returns:
scores: Class scores (N, C)
cache: Values needed for backward pass
"""
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
# Layer 1: Linear -> ReLU
z1 = X @ W1 + b1
h = np.maximum(0, z1) # ReLU activation
# Layer 2: Linear
scores = h @ W2 + b2
# Cache intermediate values for backward pass
cache = (X, z1, h, W1, W2)
return scores, cacheLoss Computation
For classification, we typically use softmax cross-entropy loss:
def loss(self, X, y, reg=0.0):
"""
Compute loss and gradients.
Args:
X: Input data (N, D)
y: Labels (N,)
reg: Regularization strength
Returns:
loss: Scalar loss value
"""
# Forward pass
scores, cache = self.forward(X)
N = X.shape[0]
# Compute softmax loss
scores_shifted = scores - np.max(scores, axis=1, keepdims=True)
exp_scores = np.exp(scores_shifted)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Cross-entropy loss
correct_log_probs = -np.log(probs[range(N), y])
data_loss = np.sum(correct_log_probs) / N
# Add L2 regularization
W1, W2 = self.params['W1'], self.params['W2']
reg_loss = 0.5 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
loss = data_loss + reg_loss
return lossSolving the XOR Problem
Unlike a single perceptron, an MLP can solve XOR:
# XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])
# Two-layer network with 2 hidden units
net = TwoLayerNet(input_dim=2, hidden_dim=2, output_dim=2)
# After training, the network learns non-linear decision boundary
# that correctly classifies all XOR examplesThe hidden layer learns a representation that makes the data linearly separable in the hidden space.
Training MLPs
Training requires computing gradients for all parameters - see Backpropagation.
Key Steps:
- Forward pass: Compute predictions and loss
- Backward pass: Compute gradients via backpropagation
- Update: Adjust weights using gradient descent
- Repeat: Iterate until convergence
Learning Resources
Videos
- 3Blue1Brown: But what is a neural network? (19 min) - Excellent visual explanation
- Welch Labs: Neural Networks Demystified Part 1-2 (15 min each)
- CS231n Lecture 2: Neural Networks (75 min)
Reading
Related Concepts
- Perceptron - Single layer building block
- Backpropagation - How to train MLPs
- Activation Functions - Enabling non-linearity
- Optimization - Training algorithms
- Regularization - Preventing overfitting
Next Steps
- Master backpropagation to understand training
- Learn about optimization algorithms (SGD, Adam)
- Study regularization techniques (dropout, weight decay)
- Implement complete training in MNIST example