MNIST Digit Classification from Scratch

Critical Assignment

This assignment is non-negotiable. You must complete this to understand neural networks at a deep level. Everything you’ve learned comes together here: forward pass, backpropagation, optimization, regularization, debugging.

Assignment Overview

Goal: Build a two-layer neural network from scratch (in NumPy) and train it to classify MNIST handwritten digits with >95% test accuracy.

Time estimate: 5-8 hours (first time), 2-3 hours (if you’ve done similar work)

Primary resource: CS231n Assignment 1

Learning Objectives

By completing this assignment, you will:

✓ Implement forward propagation through multiple layers
✓ Implement backpropagation and gradient computation
✓ Understand and apply different optimizers (SGD, Adam)
✓ Debug neural networks using gradient checking
✓ Apply regularization techniques (L2, dropout)
✓ Tune hyperparameters systematically
✓ Analyze training dynamics (loss curves, accuracy)

Part 1: CS231n Assignment 1

First, complete the official CS231n Assignment 1:

Required Sections

k-Nearest Neighbor classifier (Warmup, 1-2 hours)
- Understand the baseline
- Implement vectorized distance computation
- Cross-validation for k-selection
Support Vector Machine (2-3 hours)
- Implement SVM loss function
- Implement gradient computation
- Verify with gradient check
- Train with SGD
Softmax Classifier (2-3 hours)
- Implement softmax loss function
- Implement gradient computation
- Numerical stability tricks
Two-Layer Neural Network ⭐⭐⭐ (4-6 hours)
- Implement forward pass
- Implement backward pass (backpropagation)
- Gradient checking
- Training loop with SGD
- Hyperparameter tuning

Total time: 8-15 hours

Assignment Setup

Visit CS231n Assignment 1 page for:

Setup instructions
Starter code download
Jupyter notebooks with guided instructions
Autograder for verification

Follow the instructions carefully—the assignment is well-designed to guide you through the implementation.

Part 2: Extended MNIST Challenge

After completing CS231n Assignment 1, extend your implementation:

Requirements

Load and Preprocess MNIST
Build a Two-Layer Network
Implement Two Different Optimizers
Apply Regularization
Achieve Performance Target

1. Load and Preprocess MNIST


import numpy as np
from tensorflow.keras.datasets import mnist
 
# Load MNIST data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
 
# Reshape and normalize
X_train = X_train.reshape(60000, 784).astype('float32') / 255
X_test = X_test.reshape(10000, 784).astype('float32') / 255
 
# Split train into train + validation
num_val = 10000
X_val = X_train[-num_val:]
y_val = y_train[-num_val:]
X_train = X_train[:-num_val]
y_train = y_train[:-num_val]
 
print(f"Training samples: {X_train.shape[0]}")
print(f"Validation samples: {X_val.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")

2. Network Architecture

Specifications:

Input layer: 784 neurons (28×28 pixels)
Hidden layer: 100-500 neurons (tune this!)
ReLU activation
Output layer: 10 neurons (digits 0-9)
Softmax + cross-entropy loss

3. Starter Code Template


class TwoLayerMNIST:
    def __init__(self, input_dim=784, hidden_dim=100, output_dim=10):
        """
        Initialize a two-layer neural network for MNIST.
        """
        # Initialize weights with He initialization
        self.params = {
            'W1': np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim),
            'b1': np.zeros(hidden_dim),
            'W2': np.random.randn(hidden_dim, output_dim) * np.sqrt(2.0 / hidden_dim),
            'b2': np.zeros(output_dim)
        }
 
    def forward(self, X):
        """
        Forward pass.
 
        Args:
            X: Input data (N, 784)
 
        Returns:
            scores: Class scores (N, 10)
            cache: Values needed for backward pass
        """
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
 
        # Layer 1: Linear -> ReLU
        z1 = X @ W1 + b1
        h = np.maximum(0, z1)
 
        # Layer 2: Linear
        scores = h @ W2 + b2
 
        cache = (X, z1, h, W1, W2)
        return scores, cache
 
    def loss(self, X, y, reg=0.0):
        """
        Compute loss and gradients.
 
        Args:
            X: Input data (N, 784)
            y: Labels (N,)
            reg: Regularization strength
 
        Returns:
            loss: Scalar loss
            grads: Dictionary of gradients
        """
        scores, cache = self.forward(X)
        X, z1, h, W1, W2 = cache
        N = X.shape[0]
 
        # Compute softmax loss
        scores_shifted = scores - np.max(scores, axis=1, keepdims=True)
        exp_scores = np.exp(scores_shifted)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
 
        # Cross-entropy loss
        correct_log_probs = -np.log(probs[range(N), y])
        data_loss = np.sum(correct_log_probs) / N
 
        # L2 regularization
        reg_loss = 0.5 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
        loss = data_loss + reg_loss
 
        # ========== BACKWARD PASS ==========
        # TODO: Implement backpropagation
        # Gradient of loss w.r.t. scores
        dscores = probs.copy()
        dscores[range(N), y] -= 1
        dscores /= N
 
        # Backprop through layer 2
        dW2 = h.T @ dscores
        db2 = np.sum(dscores, axis=0)
        dh = dscores @ W2.T
 
        # Backprop through ReLU
        dh[z1 <= 0] = 0
 
        # Backprop through layer 1
        dW1 = X.T @ dh
        db1 = np.sum(dh, axis=0)
 
        # Add regularization gradient
        dW2 += reg * W2
        dW1 += reg * W1
 
        grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
 
        return loss, grads
 
    def predict(self, X):
        """
        Make predictions.
 
        Args:
            X: Input data (N, 784)
 
        Returns:
            predictions: Predicted labels (N,)
        """
        scores, _ = self.forward(X)
        return np.argmax(scores, axis=1)
 
    def train(self, X, y, X_val, y_val,
              learning_rate=1e-3, reg=1e-5,
              num_epochs=10, batch_size=200,
              verbose=True):
        """
        Train the network using mini-batch SGD.
 
        Args:
            X: Training data (N, 784)
            y: Training labels (N,)
            X_val: Validation data
            y_val: Validation labels
            learning_rate: Learning rate
            reg: Regularization strength
            num_epochs: Number of epochs
            batch_size: Batch size
            verbose: Print progress
 
        Returns:
            history: Dictionary containing training history
        """
        num_train = X.shape[0]
        iterations_per_epoch = max(num_train // batch_size, 1)
 
        loss_history = []
        train_acc_history = []
        val_acc_history = []
 
        for epoch in range(num_epochs):
            # Shuffle training data
            indices = np.random.choice(num_train, num_train, replace=False)
 
            for it in range(iterations_per_epoch):
                # Sample minibatch
                batch_indices = indices[it * batch_size:(it + 1) * batch_size]
                X_batch = X[batch_indices]
                y_batch = y[batch_indices]
 
                # Compute loss and gradients
                loss, grads = self.loss(X_batch, y_batch, reg)
                loss_history.append(loss)
 
                # Update parameters (SGD)
                self.params['W1'] -= learning_rate * grads['W1']
                self.params['b1'] -= learning_rate * grads['b1']
                self.params['W2'] -= learning_rate * grads['W2']
                self.params['b2'] -= learning_rate * grads['b2']
 
            # Check accuracy
            train_acc = np.mean(self.predict(X) == y)
            val_acc = np.mean(self.predict(X_val) == y_val)
            train_acc_history.append(train_acc)
            val_acc_history.append(val_acc)
 
            if verbose:
                print(f'Epoch {epoch+1}/{num_epochs}: '
                      f'loss {loss:.4f}, '
                      f'train_acc {train_acc:.4f}, '
                      f'val_acc {val_acc:.4f}')
 
        return {
            'loss_history': loss_history,
            'train_acc_history': train_acc_history,
            'val_acc_history': val_acc_history
        }

Tasks

Task 1: Implement and Train

Complete the forward pass
Complete the backward pass (backpropagation)
Train the network
Achieve >95% test accuracy

Task 2: Implement Adam Optimizer

Add Adam optimizer support:


class AdamOptimizer:
    def __init__(self, learning_rate=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = {}
        self.v = {}
        self.t = 0
 
    def update(self, params, grads):
        if not self.m:
            for key in params:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])
 
        self.t += 1
 
        for key in params:
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2)
 
            m_hat = self.m[key] / (1 - self.beta1**self.t)
            v_hat = self.v[key] / (1 - self.beta2**self.t)
 
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

Task 3: Add Dropout

Implement dropout regularization:


def forward_with_dropout(self, X, dropout_prob=0.5, train=True):
    # Implement dropout on hidden layer
    pass

Task 4: Hyperparameter Tuning

Find best hyperparameters:


# Parameters to tune
hidden_dims = [100, 200, 500]
learning_rates = [1e-4, 5e-4, 1e-3, 5e-3]
regularizations = [0, 1e-5, 1e-4, 1e-3]
 
best_val_acc = 0
best_params = None
 
# Grid search or random search
for hidden_dim in hidden_dims:
    for lr in learning_rates:
        for reg in regularizations:
            # Train model with these params
            # Track best validation accuracy
            pass

Performance Targets

Minimum: 95% test accuracy
Good: 97% test accuracy
Excellent: 98% test accuracy

Debugging Checklist

✓ Gradient check passes (relative error < 1e-5)
✓ Can overfit 10 training examples
✓ Initial loss ≈ log(10) ≈ 2.30
✓ Loss decreases over time
✓ Train/val accuracy curves look reasonable

Visualization


import matplotlib.pyplot as plt
 
# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
 
# Loss curve
axes[0].plot(history['loss_history'])
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
 
# Accuracy curves
axes[1].plot(history['train_acc_history'], label='Train')
axes[1].plot(history['val_acc_history'], label='Validation')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy')
axes[1].legend()
 
plt.tight_layout()
plt.show()

Extensions (Optional)

After achieving good performance, try:

Deeper network: Add more hidden layers
Different activations: Try tanh, sigmoid, Leaky ReLU
Batch normalization: Implement and compare
Data augmentation: Random rotations, shifts
Visualization: Visualize learned weights in first layer

Multi-Layer Perceptrons - Network architecture
Backpropagation - Training algorithm
Optimization - SGD, Adam
Regularization - Preventing overfitting
Training Practices - Debugging and tuning

Next Steps

After completing MNIST:

Move to CNNs for computer vision
Study attention mechanisms for NLP
Learn frameworks (PyTorch, TensorFlow) for production code