Skip to Content
LibraryExamplesMNIST from Scratch

MNIST Digit Classification from Scratch

Critical Assignment

This assignment is non-negotiable. You must complete this to understand neural networks at a deep level. Everything you’ve learned comes together here: forward pass, backpropagation, optimization, regularization, debugging.

Assignment Overview

Goal: Build a two-layer neural network from scratch (in NumPy) and train it to classify MNIST handwritten digits with >95% test accuracy.

Time estimate: 5-8 hours (first time), 2-3 hours (if you’ve done similar work)

Primary resource: CS231n Assignment 1 

Learning Objectives

By completing this assignment, you will:

  • ✓ Implement forward propagation through multiple layers
  • ✓ Implement backpropagation and gradient computation
  • ✓ Understand and apply different optimizers (SGD, Adam)
  • ✓ Debug neural networks using gradient checking
  • ✓ Apply regularization techniques (L2, dropout)
  • ✓ Tune hyperparameters systematically
  • ✓ Analyze training dynamics (loss curves, accuracy)

Part 1: CS231n Assignment 1

First, complete the official CS231n Assignment 1:

Required Sections

  1. k-Nearest Neighbor classifier (Warmup, 1-2 hours)

    • Understand the baseline
    • Implement vectorized distance computation
    • Cross-validation for k-selection
  2. Support Vector Machine (2-3 hours)

    • Implement SVM loss function
    • Implement gradient computation
    • Verify with gradient check
    • Train with SGD
  3. Softmax Classifier (2-3 hours)

    • Implement softmax loss function
    • Implement gradient computation
    • Numerical stability tricks
  4. Two-Layer Neural Network ⭐⭐⭐ (4-6 hours)

    • Implement forward pass
    • Implement backward pass (backpropagation)
    • Gradient checking
    • Training loop with SGD
    • Hyperparameter tuning

Total time: 8-15 hours

Assignment Setup

Visit CS231n Assignment 1 page  for:

  • Setup instructions
  • Starter code download
  • Jupyter notebooks with guided instructions
  • Autograder for verification

Follow the instructions carefully—the assignment is well-designed to guide you through the implementation.

Part 2: Extended MNIST Challenge

After completing CS231n Assignment 1, extend your implementation:

Requirements

  1. Load and Preprocess MNIST
  2. Build a Two-Layer Network
  3. Implement Two Different Optimizers
  4. Apply Regularization
  5. Achieve Performance Target

1. Load and Preprocess MNIST

import numpy as np from tensorflow.keras.datasets import mnist # Load MNIST data (X_train, y_train), (X_test, y_test) = mnist.load_data() # Reshape and normalize X_train = X_train.reshape(60000, 784).astype('float32') / 255 X_test = X_test.reshape(10000, 784).astype('float32') / 255 # Split train into train + validation num_val = 10000 X_val = X_train[-num_val:] y_val = y_train[-num_val:] X_train = X_train[:-num_val] y_train = y_train[:-num_val] print(f"Training samples: {X_train.shape[0]}") print(f"Validation samples: {X_val.shape[0]}") print(f"Test samples: {X_test.shape[0]}")

2. Network Architecture

Specifications:

  • Input layer: 784 neurons (28×28 pixels)
  • Hidden layer: 100-500 neurons (tune this!)
  • ReLU activation
  • Output layer: 10 neurons (digits 0-9)
  • Softmax + cross-entropy loss

3. Starter Code Template

class TwoLayerMNIST: def __init__(self, input_dim=784, hidden_dim=100, output_dim=10): """ Initialize a two-layer neural network for MNIST. """ # Initialize weights with He initialization self.params = { 'W1': np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim), 'b1': np.zeros(hidden_dim), 'W2': np.random.randn(hidden_dim, output_dim) * np.sqrt(2.0 / hidden_dim), 'b2': np.zeros(output_dim) } def forward(self, X): """ Forward pass. Args: X: Input data (N, 784) Returns: scores: Class scores (N, 10) cache: Values needed for backward pass """ W1, b1 = self.params['W1'], self.params['b1'] W2, b2 = self.params['W2'], self.params['b2'] # Layer 1: Linear -> ReLU z1 = X @ W1 + b1 h = np.maximum(0, z1) # Layer 2: Linear scores = h @ W2 + b2 cache = (X, z1, h, W1, W2) return scores, cache def loss(self, X, y, reg=0.0): """ Compute loss and gradients. Args: X: Input data (N, 784) y: Labels (N,) reg: Regularization strength Returns: loss: Scalar loss grads: Dictionary of gradients """ scores, cache = self.forward(X) X, z1, h, W1, W2 = cache N = X.shape[0] # Compute softmax loss scores_shifted = scores - np.max(scores, axis=1, keepdims=True) exp_scores = np.exp(scores_shifted) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # Cross-entropy loss correct_log_probs = -np.log(probs[range(N), y]) data_loss = np.sum(correct_log_probs) / N # L2 regularization reg_loss = 0.5 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2)) loss = data_loss + reg_loss # ========== BACKWARD PASS ========== # TODO: Implement backpropagation # Gradient of loss w.r.t. scores dscores = probs.copy() dscores[range(N), y] -= 1 dscores /= N # Backprop through layer 2 dW2 = h.T @ dscores db2 = np.sum(dscores, axis=0) dh = dscores @ W2.T # Backprop through ReLU dh[z1 <= 0] = 0 # Backprop through layer 1 dW1 = X.T @ dh db1 = np.sum(dh, axis=0) # Add regularization gradient dW2 += reg * W2 dW1 += reg * W1 grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2} return loss, grads def predict(self, X): """ Make predictions. Args: X: Input data (N, 784) Returns: predictions: Predicted labels (N,) """ scores, _ = self.forward(X) return np.argmax(scores, axis=1) def train(self, X, y, X_val, y_val, learning_rate=1e-3, reg=1e-5, num_epochs=10, batch_size=200, verbose=True): """ Train the network using mini-batch SGD. Args: X: Training data (N, 784) y: Training labels (N,) X_val: Validation data y_val: Validation labels learning_rate: Learning rate reg: Regularization strength num_epochs: Number of epochs batch_size: Batch size verbose: Print progress Returns: history: Dictionary containing training history """ num_train = X.shape[0] iterations_per_epoch = max(num_train // batch_size, 1) loss_history = [] train_acc_history = [] val_acc_history = [] for epoch in range(num_epochs): # Shuffle training data indices = np.random.choice(num_train, num_train, replace=False) for it in range(iterations_per_epoch): # Sample minibatch batch_indices = indices[it * batch_size:(it + 1) * batch_size] X_batch = X[batch_indices] y_batch = y[batch_indices] # Compute loss and gradients loss, grads = self.loss(X_batch, y_batch, reg) loss_history.append(loss) # Update parameters (SGD) self.params['W1'] -= learning_rate * grads['W1'] self.params['b1'] -= learning_rate * grads['b1'] self.params['W2'] -= learning_rate * grads['W2'] self.params['b2'] -= learning_rate * grads['b2'] # Check accuracy train_acc = np.mean(self.predict(X) == y) val_acc = np.mean(self.predict(X_val) == y_val) train_acc_history.append(train_acc) val_acc_history.append(val_acc) if verbose: print(f'Epoch {epoch+1}/{num_epochs}: ' f'loss {loss:.4f}, ' f'train_acc {train_acc:.4f}, ' f'val_acc {val_acc:.4f}') return { 'loss_history': loss_history, 'train_acc_history': train_acc_history, 'val_acc_history': val_acc_history }

Tasks

Task 1: Implement and Train

  1. Complete the forward pass
  2. Complete the backward pass (backpropagation)
  3. Train the network
  4. Achieve >95% test accuracy

Task 2: Implement Adam Optimizer

Add Adam optimizer support:

class AdamOptimizer: def __init__(self, learning_rate=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8): self.lr = learning_rate self.beta1 = beta1 self.beta2 = beta2 self.epsilon = epsilon self.m = {} self.v = {} self.t = 0 def update(self, params, grads): if not self.m: for key in params: self.m[key] = np.zeros_like(params[key]) self.v[key] = np.zeros_like(params[key]) self.t += 1 for key in params: self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key] self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2) m_hat = self.m[key] / (1 - self.beta1**self.t) v_hat = self.v[key] / (1 - self.beta2**self.t) params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

Task 3: Add Dropout

Implement dropout regularization:

def forward_with_dropout(self, X, dropout_prob=0.5, train=True): # Implement dropout on hidden layer pass

Task 4: Hyperparameter Tuning

Find best hyperparameters:

# Parameters to tune hidden_dims = [100, 200, 500] learning_rates = [1e-4, 5e-4, 1e-3, 5e-3] regularizations = [0, 1e-5, 1e-4, 1e-3] best_val_acc = 0 best_params = None # Grid search or random search for hidden_dim in hidden_dims: for lr in learning_rates: for reg in regularizations: # Train model with these params # Track best validation accuracy pass

Performance Targets

  • Minimum: 95% test accuracy
  • Good: 97% test accuracy
  • Excellent: 98% test accuracy

Debugging Checklist

  1. ✓ Gradient check passes (relative error < 1e-5)
  2. ✓ Can overfit 10 training examples
  3. ✓ Initial loss ≈ log(10) ≈ 2.30
  4. ✓ Loss decreases over time
  5. ✓ Train/val accuracy curves look reasonable

Visualization

import matplotlib.pyplot as plt # Plot training curves fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Loss curve axes[0].plot(history['loss_history']) axes[0].set_xlabel('Iteration') axes[0].set_ylabel('Loss') axes[0].set_title('Training Loss') # Accuracy curves axes[1].plot(history['train_acc_history'], label='Train') axes[1].plot(history['val_acc_history'], label='Validation') axes[1].set_xlabel('Epoch') axes[1].set_ylabel('Accuracy') axes[1].set_title('Accuracy') axes[1].legend() plt.tight_layout() plt.show()

Extensions (Optional)

After achieving good performance, try:

  1. Deeper network: Add more hidden layers
  2. Different activations: Try tanh, sigmoid, Leaky ReLU
  3. Batch normalization: Implement and compare
  4. Data augmentation: Random rotations, shifts
  5. Visualization: Visualize learned weights in first layer

Next Steps

After completing MNIST:

  1. Move to CNNs for computer vision
  2. Study attention mechanisms for NLP
  3. Learn frameworks (PyTorch, TensorFlow) for production code