Optimization Algorithms

Once we can compute gradients via backpropagation, we need algorithms to use those gradients to update parameters efficiently. This covers the evolution from basic gradient descent to modern adaptive optimizers.

Gradient Descent Fundamentals

The core idea: iteratively move in the direction that decreases the loss.

Batch Gradient Descent

Update parameters using the gradient over the entire dataset:

\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)

where:

$\theta$ = parameters (weights and biases)
$\alpha$ = learning rate
$\nabla_\theta L$ = gradient of loss w.r.t. parameters

Pros:

Stable convergence
Guaranteed to converge to minimum (for convex functions)

Cons:

Very slow for large datasets
Requires loading entire dataset into memory
Can get stuck in local minima

Stochastic Gradient Descent (SGD)

Update parameters using the gradient from a single example:

\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(x_i, y_i; \theta_t)

Pros:

Much faster updates
Can escape shallow local minima (noise helps!)
Works with online/streaming data

Cons:

Noisy gradients → oscillating loss
Requires careful learning rate tuning
May not converge to exact minimum

Mini-Batch Gradient Descent

Best of both worlds: Use gradients from a small batch of examples (typically 32-256):

\theta_{t+1} = \theta_t - \alpha \frac{1}{B} \sum_{i \in \text{batch}} \nabla_\theta L(x_i, y_i; \theta_t)

Why mini-batches?

Reduces gradient noise (vs. single example)
Faster than full batch (especially with GPUs)
Batch size is a hyperparameter (typical: 32, 64, 128, 256)


# Mini-batch SGD implementation
for epoch in range(num_epochs):
    # Shuffle data
    indices = np.random.permutation(num_train)
 
    for i in range(0, num_train, batch_size):
        # Get minibatch
        batch_indices = indices[i:i+batch_size]
        X_batch = X[batch_indices]
        y_batch = y[batch_indices]
 
        # Compute loss and gradients
        loss, grads = compute_loss_and_grads(X_batch, y_batch)
 
        # Update parameters
        for param in params:
            param -= learning_rate * grads[param]

SGD with Momentum

Problem with vanilla SGD: Updates can oscillate, especially in directions with high curvature.

Solution: Accumulate a velocity vector that acts as a running mean of gradients:

v_{t+1} = \rho v_t - \alpha \nabla_\theta L(\theta_t)

\theta_{t+1} = \theta_t + v_{t+1}

where $\rho$ is the momentum coefficient (typically 0.9 or 0.99).

Intuition: Think of a ball rolling down a hill—it builds up speed in consistent directions and dampens oscillations.

Benefits:

Accelerates convergence in consistent directions
Dampens oscillations in high-curvature directions
Can escape small local minima


class SGDMomentum:
    def __init__(self, learning_rate=1e-3, momentum=0.9):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocity = {}
 
    def update(self, params, grads):
        """Update parameters using momentum."""
        # Initialize velocity on first call
        if not self.velocity:
            for key in params:
                self.velocity[key] = np.zeros_like(params[key])
 
        # Momentum update
        for key in params:
            self.velocity[key] = (
                self.momentum * self.velocity[key]
                - self.learning_rate * grads[key]
            )
            params[key] += self.velocity[key]

Nesterov Momentum

A clever twist: evaluate gradient at the “look-ahead” position:

v_{t+1} = \rho v_t - \alpha \nabla_\theta L(\theta_t + \rho v_t)

\theta_{t+1} = \theta_t + v_{t+1}

Why better? By looking ahead, we can “correct” the velocity before it’s too late. Often converges faster than standard momentum.

Adaptive Learning Rate Methods

Problem: A single global learning rate is suboptimal. Different parameters may need different learning rates.

AdaGrad (Adaptive Gradient)

Adapt learning rate per parameter based on historical gradients:

\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \odot \nabla_\theta L(\theta_t)

where $G_t$ accumulates squared gradients: $G_t = \sum_{\tau=1}^{t} g_\tau \odot g_\tau$

Intuition:

Parameters with large gradients get smaller learning rates
Parameters with small gradients get larger learning rates
Good for sparse data (e.g., word embeddings)

Problem: Learning rate keeps decreasing (monotonically), may become too small.

RMSProp (Root Mean Square Propagation)

Fix AdaGrad’s monotonic decay by using a moving average:

G_t = \rho G_{t-1} + (1-\rho) g_t \odot g_t

\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \odot g_t

where $\rho$ is typically 0.9 or 0.99.

Benefits: Doesn’t suffer from monotonic decay, works well in practice.

Adam Optimizer

Adam (Adaptive Moment Estimation) combines momentum and adaptive learning rates. It’s the de facto default optimizer in deep learning.

Adam Update Rules

Adam maintains two moving averages:

First moment (mean of gradients, like momentum):

m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t

Second moment (uncentered variance, like RMSProp):

v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

Bias correction (because $m_0 = v_0 = 0$ ):

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Update:

\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Default hyperparameters (work well in practice):

$\alpha = 0.001$ (learning rate)
$\beta_1 = 0.9$ (momentum)
$\beta_2 = 0.999$ (RMSProp decay)
$\epsilon = 10^{-8}$ (numerical stability)


class Adam:
    def __init__(self, learning_rate=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = 0   # Time step
 
    def update(self, params, grads):
        """Update parameters using Adam."""
        # Initialize moments on first call
        if not self.m:
            for key in params:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])
 
        self.t += 1
 
        for key in params:
            # Update biased first moment estimate
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
 
            # Update biased second moment estimate
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2)
 
            # Compute bias-corrected moments
            m_hat = self.m[key] / (1 - self.beta1**self.t)
            v_hat = self.v[key] / (1 - self.beta2**self.t)
 
            # Update parameters
            params[key] -= self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon)

Why Adam Works

Adaptive learning rates: Each parameter gets its own effective learning rate
Momentum: Accelerates convergence
Bias correction: Accounts for initialization at zero
Robust: Works well across different architectures and problems

Learning Rate Schedules

Even with adaptive optimizers, adjusting the learning rate during training can help.

Step Decay

Reduce learning rate by a factor every few epochs:


learning_rate = initial_lr * (decay_rate ** (epoch // step_size))

Example: Start with $\alpha = 0.1$ , multiply by 0.1 every 30 epochs.

Cosine Annealing

Smoothly decrease learning rate following a cosine curve:


learning_rate = min_lr + 0.5 * (max_lr - min_lr) * (1 + np.cos(epoch / total_epochs * np.pi))

Popular in modern training, especially with restarts.

Warmup

Start with a very small learning rate and gradually increase:


if epoch < warmup_epochs:
    learning_rate = target_lr * (epoch / warmup_epochs)

Useful for large batch training and transformers.

Comparison and Best Practices

When to Use Which Optimizer

Optimizer	Best For	Notes
SGD	Baselines, when you want control	Requires careful LR tuning
SGD + Momentum	CNNs, well-understood problems	Often reaches better final performance
Adam	Default choice, transformers, RNNs	Fast convergence, robust
RMSProp	RNNs (historically)	Less common now, Adam usually better

Practical Advice

Start with Adam ( $\alpha = 0.001$ , default parameters). It works well out of the box. If you need the absolute best performance and have time to tune, try SGD with momentum and a learning rate schedule.

Learning Resources

Backpropagation - Computing gradients
Learning Rate Schedules - Adaptive LR strategies
Regularization - Preventing overfitting
Batch Normalization - Speeding up training

Next Steps

Learn about regularization techniques
Study batch normalization
Understand learning rate scheduling
Practice in MNIST implementation