Optimization Algorithms
Once we can compute gradients via backpropagation, we need algorithms to use those gradients to update parameters efficiently. This covers the evolution from basic gradient descent to modern adaptive optimizers.
Gradient Descent Fundamentals
The core idea: iteratively move in the direction that decreases the loss.
Batch Gradient Descent
Update parameters using the gradient over the entire dataset:
where:
- = parameters (weights and biases)
- = learning rate
- = gradient of loss w.r.t. parameters
Pros:
- Stable convergence
- Guaranteed to converge to minimum (for convex functions)
Cons:
- Very slow for large datasets
- Requires loading entire dataset into memory
- Can get stuck in local minima
Stochastic Gradient Descent (SGD)
Update parameters using the gradient from a single example:
Pros:
- Much faster updates
- Can escape shallow local minima (noise helps!)
- Works with online/streaming data
Cons:
- Noisy gradients → oscillating loss
- Requires careful learning rate tuning
- May not converge to exact minimum
Mini-Batch Gradient Descent
Best of both worlds: Use gradients from a small batch of examples (typically 32-256):
Why mini-batches?
- Reduces gradient noise (vs. single example)
- Faster than full batch (especially with GPUs)
- Batch size is a hyperparameter (typical: 32, 64, 128, 256)
# Mini-batch SGD implementation
for epoch in range(num_epochs):
# Shuffle data
indices = np.random.permutation(num_train)
for i in range(0, num_train, batch_size):
# Get minibatch
batch_indices = indices[i:i+batch_size]
X_batch = X[batch_indices]
y_batch = y[batch_indices]
# Compute loss and gradients
loss, grads = compute_loss_and_grads(X_batch, y_batch)
# Update parameters
for param in params:
param -= learning_rate * grads[param]SGD with Momentum
Problem with vanilla SGD: Updates can oscillate, especially in directions with high curvature.
Solution: Accumulate a velocity vector that acts as a running mean of gradients:
where is the momentum coefficient (typically 0.9 or 0.99).
Intuition: Think of a ball rolling down a hill—it builds up speed in consistent directions and dampens oscillations.
Benefits:
- Accelerates convergence in consistent directions
- Dampens oscillations in high-curvature directions
- Can escape small local minima
class SGDMomentum:
def __init__(self, learning_rate=1e-3, momentum=0.9):
self.learning_rate = learning_rate
self.momentum = momentum
self.velocity = {}
def update(self, params, grads):
"""Update parameters using momentum."""
# Initialize velocity on first call
if not self.velocity:
for key in params:
self.velocity[key] = np.zeros_like(params[key])
# Momentum update
for key in params:
self.velocity[key] = (
self.momentum * self.velocity[key]
- self.learning_rate * grads[key]
)
params[key] += self.velocity[key]Nesterov Momentum
A clever twist: evaluate gradient at the “look-ahead” position:
Why better? By looking ahead, we can “correct” the velocity before it’s too late. Often converges faster than standard momentum.
Adaptive Learning Rate Methods
Problem: A single global learning rate is suboptimal. Different parameters may need different learning rates.
AdaGrad (Adaptive Gradient)
Adapt learning rate per parameter based on historical gradients:
where accumulates squared gradients:
Intuition:
- Parameters with large gradients get smaller learning rates
- Parameters with small gradients get larger learning rates
- Good for sparse data (e.g., word embeddings)
Problem: Learning rate keeps decreasing (monotonically), may become too small.
RMSProp (Root Mean Square Propagation)
Fix AdaGrad’s monotonic decay by using a moving average:
where is typically 0.9 or 0.99.
Benefits: Doesn’t suffer from monotonic decay, works well in practice.
Adam Optimizer
Adam (Adaptive Moment Estimation) combines momentum and adaptive learning rates. It’s the de facto default optimizer in deep learning.
Adam Update Rules
Adam maintains two moving averages:
First moment (mean of gradients, like momentum):
Second moment (uncentered variance, like RMSProp):
Bias correction (because ):
Update:
Default hyperparameters (work well in practice):
- (learning rate)
- (momentum)
- (RMSProp decay)
- (numerical stability)
class Adam:
def __init__(self, learning_rate=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = {} # First moment
self.v = {} # Second moment
self.t = 0 # Time step
def update(self, params, grads):
"""Update parameters using Adam."""
# Initialize moments on first call
if not self.m:
for key in params:
self.m[key] = np.zeros_like(params[key])
self.v[key] = np.zeros_like(params[key])
self.t += 1
for key in params:
# Update biased first moment estimate
self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
# Update biased second moment estimate
self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2)
# Compute bias-corrected moments
m_hat = self.m[key] / (1 - self.beta1**self.t)
v_hat = self.v[key] / (1 - self.beta2**self.t)
# Update parameters
params[key] -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)Why Adam Works
- Adaptive learning rates: Each parameter gets its own effective learning rate
- Momentum: Accelerates convergence
- Bias correction: Accounts for initialization at zero
- Robust: Works well across different architectures and problems
Learning Rate Schedules
Even with adaptive optimizers, adjusting the learning rate during training can help.
Step Decay
Reduce learning rate by a factor every few epochs:
learning_rate = initial_lr * (decay_rate ** (epoch // step_size))Example: Start with , multiply by 0.1 every 30 epochs.
Cosine Annealing
Smoothly decrease learning rate following a cosine curve:
learning_rate = min_lr + 0.5 * (max_lr - min_lr) * (1 + np.cos(epoch / total_epochs * np.pi))Popular in modern training, especially with restarts.
Warmup
Start with a very small learning rate and gradually increase:
if epoch < warmup_epochs:
learning_rate = target_lr * (epoch / warmup_epochs)Useful for large batch training and transformers.
Comparison and Best Practices
When to Use Which Optimizer
| Optimizer | Best For | Notes |
|---|---|---|
| SGD | Baselines, when you want control | Requires careful LR tuning |
| SGD + Momentum | CNNs, well-understood problems | Often reaches better final performance |
| Adam | Default choice, transformers, RNNs | Fast convergence, robust |
| RMSProp | RNNs (historically) | Less common now, Adam usually better |
Start with Adam (, default parameters). It works well out of the box. If you need the absolute best performance and have time to tune, try SGD with momentum and a learning rate schedule.
Learning Resources
Videos
Reading
- CS231n: Optimization Notes
- Sebastian Ruder: “An Overview of Gradient Descent Optimization Algorithms” ⭐⭐ Best single reference
Papers
- Adam: A Method for Stochastic Optimization (Kingma & Ba, 2014)
Related Concepts
- Backpropagation - Computing gradients
- Learning Rate Schedules - Adaptive LR strategies
- Regularization - Preventing overfitting
- Batch Normalization - Speeding up training
Next Steps
- Learn about regularization techniques
- Study batch normalization
- Understand learning rate scheduling
- Practice in MNIST implementation