Skip to Content
LibraryConceptsOptimization

Optimization Algorithms

Once we can compute gradients via backpropagation, we need algorithms to use those gradients to update parameters efficiently. This covers the evolution from basic gradient descent to modern adaptive optimizers.

Gradient Descent Fundamentals

The core idea: iteratively move in the direction that decreases the loss.

Batch Gradient Descent

Update parameters using the gradient over the entire dataset:

θt+1=θtαθL(θt)\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)

where:

  • θ\theta = parameters (weights and biases)
  • α\alpha = learning rate
  • θL\nabla_\theta L = gradient of loss w.r.t. parameters

Pros:

  • Stable convergence
  • Guaranteed to converge to minimum (for convex functions)

Cons:

  • Very slow for large datasets
  • Requires loading entire dataset into memory
  • Can get stuck in local minima

Stochastic Gradient Descent (SGD)

Update parameters using the gradient from a single example:

θt+1=θtαθL(xi,yi;θt)\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(x_i, y_i; \theta_t)

Pros:

  • Much faster updates
  • Can escape shallow local minima (noise helps!)
  • Works with online/streaming data

Cons:

  • Noisy gradients → oscillating loss
  • Requires careful learning rate tuning
  • May not converge to exact minimum

Mini-Batch Gradient Descent

Best of both worlds: Use gradients from a small batch of examples (typically 32-256):

θt+1=θtα1BibatchθL(xi,yi;θt)\theta_{t+1} = \theta_t - \alpha \frac{1}{B} \sum_{i \in \text{batch}} \nabla_\theta L(x_i, y_i; \theta_t)

Why mini-batches?

  • Reduces gradient noise (vs. single example)
  • Faster than full batch (especially with GPUs)
  • Batch size is a hyperparameter (typical: 32, 64, 128, 256)
# Mini-batch SGD implementation for epoch in range(num_epochs): # Shuffle data indices = np.random.permutation(num_train) for i in range(0, num_train, batch_size): # Get minibatch batch_indices = indices[i:i+batch_size] X_batch = X[batch_indices] y_batch = y[batch_indices] # Compute loss and gradients loss, grads = compute_loss_and_grads(X_batch, y_batch) # Update parameters for param in params: param -= learning_rate * grads[param]

SGD with Momentum

Problem with vanilla SGD: Updates can oscillate, especially in directions with high curvature.

Solution: Accumulate a velocity vector that acts as a running mean of gradients:

vt+1=ρvtαθL(θt)v_{t+1} = \rho v_t - \alpha \nabla_\theta L(\theta_t) θt+1=θt+vt+1\theta_{t+1} = \theta_t + v_{t+1}

where ρ\rho is the momentum coefficient (typically 0.9 or 0.99).

Intuition: Think of a ball rolling down a hill—it builds up speed in consistent directions and dampens oscillations.

Benefits:

  • Accelerates convergence in consistent directions
  • Dampens oscillations in high-curvature directions
  • Can escape small local minima
class SGDMomentum: def __init__(self, learning_rate=1e-3, momentum=0.9): self.learning_rate = learning_rate self.momentum = momentum self.velocity = {} def update(self, params, grads): """Update parameters using momentum.""" # Initialize velocity on first call if not self.velocity: for key in params: self.velocity[key] = np.zeros_like(params[key]) # Momentum update for key in params: self.velocity[key] = ( self.momentum * self.velocity[key] - self.learning_rate * grads[key] ) params[key] += self.velocity[key]

Nesterov Momentum

A clever twist: evaluate gradient at the “look-ahead” position:

vt+1=ρvtαθL(θt+ρvt)v_{t+1} = \rho v_t - \alpha \nabla_\theta L(\theta_t + \rho v_t) θt+1=θt+vt+1\theta_{t+1} = \theta_t + v_{t+1}

Why better? By looking ahead, we can “correct” the velocity before it’s too late. Often converges faster than standard momentum.

Adaptive Learning Rate Methods

Problem: A single global learning rate is suboptimal. Different parameters may need different learning rates.

AdaGrad (Adaptive Gradient)

Adapt learning rate per parameter based on historical gradients:

θt+1=θtαGt+ϵθL(θt)\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \odot \nabla_\theta L(\theta_t)

where GtG_t accumulates squared gradients: Gt=τ=1tgτgτG_t = \sum_{\tau=1}^{t} g_\tau \odot g_\tau

Intuition:

  • Parameters with large gradients get smaller learning rates
  • Parameters with small gradients get larger learning rates
  • Good for sparse data (e.g., word embeddings)

Problem: Learning rate keeps decreasing (monotonically), may become too small.

RMSProp (Root Mean Square Propagation)

Fix AdaGrad’s monotonic decay by using a moving average:

Gt=ρGt1+(1ρ)gtgtG_t = \rho G_{t-1} + (1-\rho) g_t \odot g_t θt+1=θtαGt+ϵgt\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \odot g_t

where ρ\rho is typically 0.9 or 0.99.

Benefits: Doesn’t suffer from monotonic decay, works well in practice.

Adam Optimizer

Adam (Adaptive Moment Estimation) combines momentum and adaptive learning rates. It’s the de facto default optimizer in deep learning.

Adam Update Rules

Adam maintains two moving averages:

First moment (mean of gradients, like momentum):

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t

Second moment (uncentered variance, like RMSProp):

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

Bias correction (because m0=v0=0m_0 = v_0 = 0):

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Update:

θt+1=θtαm^tv^t+ϵ\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Default hyperparameters (work well in practice):

  • α=0.001\alpha = 0.001 (learning rate)
  • β1=0.9\beta_1 = 0.9 (momentum)
  • β2=0.999\beta_2 = 0.999 (RMSProp decay)
  • ϵ=108\epsilon = 10^{-8} (numerical stability)
class Adam: def __init__(self, learning_rate=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8): self.learning_rate = learning_rate self.beta1 = beta1 self.beta2 = beta2 self.epsilon = epsilon self.m = {} # First moment self.v = {} # Second moment self.t = 0 # Time step def update(self, params, grads): """Update parameters using Adam.""" # Initialize moments on first call if not self.m: for key in params: self.m[key] = np.zeros_like(params[key]) self.v[key] = np.zeros_like(params[key]) self.t += 1 for key in params: # Update biased first moment estimate self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key] # Update biased second moment estimate self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2) # Compute bias-corrected moments m_hat = self.m[key] / (1 - self.beta1**self.t) v_hat = self.v[key] / (1 - self.beta2**self.t) # Update parameters params[key] -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)

Why Adam Works

  • Adaptive learning rates: Each parameter gets its own effective learning rate
  • Momentum: Accelerates convergence
  • Bias correction: Accounts for initialization at zero
  • Robust: Works well across different architectures and problems

Learning Rate Schedules

Even with adaptive optimizers, adjusting the learning rate during training can help.

Step Decay

Reduce learning rate by a factor every few epochs:

learning_rate = initial_lr * (decay_rate ** (epoch // step_size))

Example: Start with α=0.1\alpha = 0.1, multiply by 0.1 every 30 epochs.

Cosine Annealing

Smoothly decrease learning rate following a cosine curve:

learning_rate = min_lr + 0.5 * (max_lr - min_lr) * (1 + np.cos(epoch / total_epochs * np.pi))

Popular in modern training, especially with restarts.

Warmup

Start with a very small learning rate and gradually increase:

if epoch < warmup_epochs: learning_rate = target_lr * (epoch / warmup_epochs)

Useful for large batch training and transformers.

Comparison and Best Practices

When to Use Which Optimizer

OptimizerBest ForNotes
SGDBaselines, when you want controlRequires careful LR tuning
SGD + MomentumCNNs, well-understood problemsOften reaches better final performance
AdamDefault choice, transformers, RNNsFast convergence, robust
RMSPropRNNs (historically)Less common now, Adam usually better
Practical Advice

Start with Adam (α=0.001\alpha = 0.001, default parameters). It works well out of the box. If you need the absolute best performance and have time to tune, try SGD with momentum and a learning rate schedule.

Learning Resources

Videos

Reading

Papers

  • Backpropagation - Computing gradients
  • Learning Rate Schedules - Adaptive LR strategies
  • Regularization - Preventing overfitting
  • Batch Normalization - Speeding up training

Next Steps

  1. Learn about regularization techniques
  2. Study batch normalization
  3. Understand learning rate scheduling
  4. Practice in MNIST implementation