MNIST Digit Classification from Scratch
This assignment is non-negotiable. You must complete this to understand neural networks at a deep level. Everything you’ve learned comes together here: forward pass, backpropagation, optimization, regularization, debugging.
Assignment Overview
Goal: Build a two-layer neural network from scratch (in NumPy) and train it to classify MNIST handwritten digits with >95% test accuracy.
Time estimate: 5-8 hours (first time), 2-3 hours (if you’ve done similar work)
Primary resource: CS231n Assignment 1
Learning Objectives
By completing this assignment, you will:
- ✓ Implement forward propagation through multiple layers
- ✓ Implement backpropagation and gradient computation
- ✓ Understand and apply different optimizers (SGD, Adam)
- ✓ Debug neural networks using gradient checking
- ✓ Apply regularization techniques (L2, dropout)
- ✓ Tune hyperparameters systematically
- ✓ Analyze training dynamics (loss curves, accuracy)
Part 1: CS231n Assignment 1
First, complete the official CS231n Assignment 1:
Required Sections
-
k-Nearest Neighbor classifier (Warmup, 1-2 hours)
- Understand the baseline
- Implement vectorized distance computation
- Cross-validation for k-selection
-
Support Vector Machine (2-3 hours)
- Implement SVM loss function
- Implement gradient computation
- Verify with gradient check
- Train with SGD
-
Softmax Classifier (2-3 hours)
- Implement softmax loss function
- Implement gradient computation
- Numerical stability tricks
-
Two-Layer Neural Network ⭐⭐⭐ (4-6 hours)
- Implement forward pass
- Implement backward pass (backpropagation)
- Gradient checking
- Training loop with SGD
- Hyperparameter tuning
Total time: 8-15 hours
Visit CS231n Assignment 1 page for:
- Setup instructions
- Starter code download
- Jupyter notebooks with guided instructions
- Autograder for verification
Follow the instructions carefully—the assignment is well-designed to guide you through the implementation.
Part 2: Extended MNIST Challenge
After completing CS231n Assignment 1, extend your implementation:
Requirements
- Load and Preprocess MNIST
- Build a Two-Layer Network
- Implement Two Different Optimizers
- Apply Regularization
- Achieve Performance Target
1. Load and Preprocess MNIST
import numpy as np
from tensorflow.keras.datasets import mnist
# Load MNIST data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Reshape and normalize
X_train = X_train.reshape(60000, 784).astype('float32') / 255
X_test = X_test.reshape(10000, 784).astype('float32') / 255
# Split train into train + validation
num_val = 10000
X_val = X_train[-num_val:]
y_val = y_train[-num_val:]
X_train = X_train[:-num_val]
y_train = y_train[:-num_val]
print(f"Training samples: {X_train.shape[0]}")
print(f"Validation samples: {X_val.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")2. Network Architecture
Specifications:
- Input layer: 784 neurons (28×28 pixels)
- Hidden layer: 100-500 neurons (tune this!)
- ReLU activation
- Output layer: 10 neurons (digits 0-9)
- Softmax + cross-entropy loss
3. Starter Code Template
class TwoLayerMNIST:
def __init__(self, input_dim=784, hidden_dim=100, output_dim=10):
"""
Initialize a two-layer neural network for MNIST.
"""
# Initialize weights with He initialization
self.params = {
'W1': np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim),
'b1': np.zeros(hidden_dim),
'W2': np.random.randn(hidden_dim, output_dim) * np.sqrt(2.0 / hidden_dim),
'b2': np.zeros(output_dim)
}
def forward(self, X):
"""
Forward pass.
Args:
X: Input data (N, 784)
Returns:
scores: Class scores (N, 10)
cache: Values needed for backward pass
"""
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
# Layer 1: Linear -> ReLU
z1 = X @ W1 + b1
h = np.maximum(0, z1)
# Layer 2: Linear
scores = h @ W2 + b2
cache = (X, z1, h, W1, W2)
return scores, cache
def loss(self, X, y, reg=0.0):
"""
Compute loss and gradients.
Args:
X: Input data (N, 784)
y: Labels (N,)
reg: Regularization strength
Returns:
loss: Scalar loss
grads: Dictionary of gradients
"""
scores, cache = self.forward(X)
X, z1, h, W1, W2 = cache
N = X.shape[0]
# Compute softmax loss
scores_shifted = scores - np.max(scores, axis=1, keepdims=True)
exp_scores = np.exp(scores_shifted)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Cross-entropy loss
correct_log_probs = -np.log(probs[range(N), y])
data_loss = np.sum(correct_log_probs) / N
# L2 regularization
reg_loss = 0.5 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
loss = data_loss + reg_loss
# ========== BACKWARD PASS ==========
# TODO: Implement backpropagation
# Gradient of loss w.r.t. scores
dscores = probs.copy()
dscores[range(N), y] -= 1
dscores /= N
# Backprop through layer 2
dW2 = h.T @ dscores
db2 = np.sum(dscores, axis=0)
dh = dscores @ W2.T
# Backprop through ReLU
dh[z1 <= 0] = 0
# Backprop through layer 1
dW1 = X.T @ dh
db1 = np.sum(dh, axis=0)
# Add regularization gradient
dW2 += reg * W2
dW1 += reg * W1
grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
return loss, grads
def predict(self, X):
"""
Make predictions.
Args:
X: Input data (N, 784)
Returns:
predictions: Predicted labels (N,)
"""
scores, _ = self.forward(X)
return np.argmax(scores, axis=1)
def train(self, X, y, X_val, y_val,
learning_rate=1e-3, reg=1e-5,
num_epochs=10, batch_size=200,
verbose=True):
"""
Train the network using mini-batch SGD.
Args:
X: Training data (N, 784)
y: Training labels (N,)
X_val: Validation data
y_val: Validation labels
learning_rate: Learning rate
reg: Regularization strength
num_epochs: Number of epochs
batch_size: Batch size
verbose: Print progress
Returns:
history: Dictionary containing training history
"""
num_train = X.shape[0]
iterations_per_epoch = max(num_train // batch_size, 1)
loss_history = []
train_acc_history = []
val_acc_history = []
for epoch in range(num_epochs):
# Shuffle training data
indices = np.random.choice(num_train, num_train, replace=False)
for it in range(iterations_per_epoch):
# Sample minibatch
batch_indices = indices[it * batch_size:(it + 1) * batch_size]
X_batch = X[batch_indices]
y_batch = y[batch_indices]
# Compute loss and gradients
loss, grads = self.loss(X_batch, y_batch, reg)
loss_history.append(loss)
# Update parameters (SGD)
self.params['W1'] -= learning_rate * grads['W1']
self.params['b1'] -= learning_rate * grads['b1']
self.params['W2'] -= learning_rate * grads['W2']
self.params['b2'] -= learning_rate * grads['b2']
# Check accuracy
train_acc = np.mean(self.predict(X) == y)
val_acc = np.mean(self.predict(X_val) == y_val)
train_acc_history.append(train_acc)
val_acc_history.append(val_acc)
if verbose:
print(f'Epoch {epoch+1}/{num_epochs}: '
f'loss {loss:.4f}, '
f'train_acc {train_acc:.4f}, '
f'val_acc {val_acc:.4f}')
return {
'loss_history': loss_history,
'train_acc_history': train_acc_history,
'val_acc_history': val_acc_history
}Tasks
Task 1: Implement and Train
- Complete the forward pass
- Complete the backward pass (backpropagation)
- Train the network
- Achieve >95% test accuracy
Task 2: Implement Adam Optimizer
Add Adam optimizer support:
class AdamOptimizer:
def __init__(self, learning_rate=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = {}
self.v = {}
self.t = 0
def update(self, params, grads):
if not self.m:
for key in params:
self.m[key] = np.zeros_like(params[key])
self.v[key] = np.zeros_like(params[key])
self.t += 1
for key in params:
self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2)
m_hat = self.m[key] / (1 - self.beta1**self.t)
v_hat = self.v[key] / (1 - self.beta2**self.t)
params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)Task 3: Add Dropout
Implement dropout regularization:
def forward_with_dropout(self, X, dropout_prob=0.5, train=True):
# Implement dropout on hidden layer
passTask 4: Hyperparameter Tuning
Find best hyperparameters:
# Parameters to tune
hidden_dims = [100, 200, 500]
learning_rates = [1e-4, 5e-4, 1e-3, 5e-3]
regularizations = [0, 1e-5, 1e-4, 1e-3]
best_val_acc = 0
best_params = None
# Grid search or random search
for hidden_dim in hidden_dims:
for lr in learning_rates:
for reg in regularizations:
# Train model with these params
# Track best validation accuracy
passPerformance Targets
- Minimum: 95% test accuracy
- Good: 97% test accuracy
- Excellent: 98% test accuracy
Debugging Checklist
- ✓ Gradient check passes (relative error < 1e-5)
- ✓ Can overfit 10 training examples
- ✓ Initial loss ≈ log(10) ≈ 2.30
- ✓ Loss decreases over time
- ✓ Train/val accuracy curves look reasonable
Visualization
import matplotlib.pyplot as plt
# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Loss curve
axes[0].plot(history['loss_history'])
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
# Accuracy curves
axes[1].plot(history['train_acc_history'], label='Train')
axes[1].plot(history['val_acc_history'], label='Validation')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy')
axes[1].legend()
plt.tight_layout()
plt.show()Extensions (Optional)
After achieving good performance, try:
- Deeper network: Add more hidden layers
- Different activations: Try tanh, sigmoid, Leaky ReLU
- Batch normalization: Implement and compare
- Data augmentation: Random rotations, shifts
- Visualization: Visualize learned weights in first layer
Related Concepts
- Multi-Layer Perceptrons - Network architecture
- Backpropagation - Training algorithm
- Optimization - SGD, Adam
- Regularization - Preventing overfitting
- Training Practices - Debugging and tuning
Next Steps
After completing MNIST:
- Move to CNNs for computer vision
- Study attention mechanisms for NLP
- Learn frameworks (PyTorch, TensorFlow) for production code