Dropout

Dropout is one of the most effective regularization techniques for neural networks. It prevents overfitting by randomly “dropping” (zeroing out) neurons during training.

Core Idea

During training, randomly drop neurons with probability $p$ (typically 0.5):

Training: Each neuron is kept with probability $(1-p)$ Testing: Use all neurons (no dropout)

Why it works:

Prevents co-adaptation: Neurons can’t rely on specific other neurons
Ensemble effect: Training many “thinned” networks, averaging at test time
Reduces overfitting: Forces distributed representations

Implementation: Inverted Dropout

The standard approach is inverted dropout, which scales during training rather than testing:

Training Phase


# Forward pass with dropout (training)
dropout_prob = 0.5
mask = (np.random.rand(*h.shape) > dropout_prob) / (1 - dropout_prob)
h = h * mask

The division by $(1 - p)$ ensures expected values remain the same. This is crucial - it means we don’t need scaling at test time.

Testing Phase


# Forward pass without dropout (testing)
h = h  # Use full network, no scaling needed!

At test time, simply use all neurons with no modifications.

Complete Forward and Backward Pass

Forward Pass with Dropout


def forward_with_dropout(self, X, dropout_prob=0.5, train=True):
    """
    Forward pass with optional dropout.
 
    Args:
        X: Input data (N, D)
        dropout_prob: Probability of dropping a neuron
        train: If True, apply dropout; if False, use all neurons
 
    Returns:
        scores: Class scores (N, C)
        cache: Values needed for backward pass
    """
    W1, b1 = self.params['W1'], self.params['b1']
    W2, b2 = self.params['W2'], self.params['b2']
 
    # Layer 1
    z1 = X @ W1 + b1
    h = np.maximum(0, z1)  # ReLU
 
    # Dropout on hidden layer
    dropout_mask = None
    if train:
        # Inverted dropout: drop and scale during training
        dropout_mask = (np.random.rand(*h.shape) > dropout_prob) / (1 - dropout_prob)
        h = h * dropout_mask
    # At test time: use all neurons, no scaling needed
 
    # Layer 2
    scores = h @ W2 + b2
 
    cache = (X, z1, h, W1, W2, dropout_mask)
    return scores, cache

Backward Pass with Dropout


def backward_with_dropout(self, dscores, cache):
    """Backward pass through dropout."""
    X, z1, h, W1, W2, dropout_mask = cache
 
    # Backprop through layer 2
    dW2 = h.T @ dscores
    db2 = np.sum(dscores, axis=0)
    dh = dscores @ W2.T
 
    # Backprop through dropout
    if dropout_mask is not None:
        dh = dh * dropout_mask  # Zero out same neurons
 
    # Backprop through ReLU
    dh[z1 <= 0] = 0
 
    # Backprop through layer 1
    dW1 = X.T @ dh
    db1 = np.sum(dh, axis=0)
 
    grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
    return grads

Where to Apply Dropout

Best Practices:

Fully-connected layers: Typical $p = 0.5$ (drop half the neurons)
Convolutional layers: Use sparingly, if at all (typically $p = 0.1$ if used)
Output layer: Never apply dropout
RNNs: Use special variants (variational dropout)

Why not in conv layers?

Convolutional layers have built-in regularization through weight sharing
Spatial dropout is used instead for CNNs (drops entire feature maps)

Tuning Dropout Rate

Guidelines:

Start with $p = 0.5$ for hidden layers
Increase $p$ if still overfitting (more regularization)
Decrease $p$ if underfitting (less regularization)
Use validation set to tune

Rule of thumb:

Small datasets or deep networks → higher dropout ( $p = 0.7$ )
Large datasets or shallow networks → lower dropout ( $p = 0.3$ )


# Experiment with different dropout rates
dropout_rates = [0.0, 0.3, 0.5, 0.7, 0.9]
results = {}
 
for p in dropout_rates:
    net = TwoLayerNet(input_dim, hidden_dim, output_dim)
    history = train(net, X_train, y_train, X_val, y_val,
                   dropout_prob=p, num_epochs=100)
    results[f'dropout={p}'] = history
 
# Plot validation accuracy
for name, history in results.items():
    plt.plot(history['val_acc_history'], label=name)
plt.xlabel('Epoch')
plt.ylabel('Validation Accuracy')
plt.legend()
plt.title('Effect of Dropout Rate')

Why Inverted Dropout?

Alternative: Standard Dropout

Drop during training, scale by $(1-p)$ during testing
Problem: Must remember to scale at test time
Risk of bugs in deployment

Inverted Dropout (preferred):

Scale during training, no modifications at test time
Test time code is simpler and faster
Less error-prone in production

Dropout as Ensemble

Dropout can be interpreted as training an ensemble of exponentially many thinned networks:

Network with $n$ neurons has $2^n$ possible dropout masks
Each mask defines a different network
Training samples different networks on each batch
Test time approximates averaging all these networks

This explains dropout’s effectiveness: it’s like training and averaging many networks!

Monte Carlo Dropout

For uncertainty estimation, you can use dropout at test time:


# Make predictions with dropout (multiple forward passes)
n_samples = 100
predictions = []
 
for _ in range(n_samples):
    scores = net.forward_with_dropout(X_test, dropout_prob=0.5, train=True)
    predictions.append(scores)
 
# Average predictions
mean_pred = np.mean(predictions, axis=0)
std_pred = np.std(predictions, axis=0)  # Uncertainty estimate

This gives uncertainty estimates, useful for safety-critical applications.

Dropout Variants

Spatial Dropout (for CNNs):

Drop entire feature maps instead of individual neurons
Better for convolutional layers

Variational Dropout (for RNNs):

Use same dropout mask across all time steps
Prevents network from learning to ignore dropout

DropConnect:

Drop weights instead of activations
Similar effect, less commonly used

Combining with Other Regularization

Dropout works well with:

L2 regularization: Use both together
Batch normalization: Some debate, but generally compatible
Data augmentation: Complementary effects

Typical stack:


# Modern fully-connected layer
x = linear(x)
x = batch_norm(x)
x = relu(x)
x = dropout(x, p=0.5)

Learning Resources

Papers

Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al., 2014) - Original paper

Reading

CS231n: Neural Networks Part 2

Regularization - General overfitting prevention
Batch Normalization - Often used together
Bias-Variance Tradeoff - Why regularization helps

Next Steps

Understand the bias-variance tradeoff
Learn about batch normalization
Implement in MNIST example
Explore advanced regularization in training practices