Dropout
Dropout is one of the most effective regularization techniques for neural networks. It prevents overfitting by randomly “dropping” (zeroing out) neurons during training.
Core Idea
During training, randomly drop neurons with probability (typically 0.5):
Training: Each neuron is kept with probability Testing: Use all neurons (no dropout)
Why it works:
- Prevents co-adaptation: Neurons can’t rely on specific other neurons
- Ensemble effect: Training many “thinned” networks, averaging at test time
- Reduces overfitting: Forces distributed representations
Implementation: Inverted Dropout
The standard approach is inverted dropout, which scales during training rather than testing:
Training Phase
# Forward pass with dropout (training)
dropout_prob = 0.5
mask = (np.random.rand(*h.shape) > dropout_prob) / (1 - dropout_prob)
h = h * maskThe division by ensures expected values remain the same. This is crucial - it means we don’t need scaling at test time.
Testing Phase
# Forward pass without dropout (testing)
h = h # Use full network, no scaling needed!At test time, simply use all neurons with no modifications.
Complete Forward and Backward Pass
Forward Pass with Dropout
def forward_with_dropout(self, X, dropout_prob=0.5, train=True):
"""
Forward pass with optional dropout.
Args:
X: Input data (N, D)
dropout_prob: Probability of dropping a neuron
train: If True, apply dropout; if False, use all neurons
Returns:
scores: Class scores (N, C)
cache: Values needed for backward pass
"""
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
# Layer 1
z1 = X @ W1 + b1
h = np.maximum(0, z1) # ReLU
# Dropout on hidden layer
dropout_mask = None
if train:
# Inverted dropout: drop and scale during training
dropout_mask = (np.random.rand(*h.shape) > dropout_prob) / (1 - dropout_prob)
h = h * dropout_mask
# At test time: use all neurons, no scaling needed
# Layer 2
scores = h @ W2 + b2
cache = (X, z1, h, W1, W2, dropout_mask)
return scores, cacheBackward Pass with Dropout
def backward_with_dropout(self, dscores, cache):
"""Backward pass through dropout."""
X, z1, h, W1, W2, dropout_mask = cache
# Backprop through layer 2
dW2 = h.T @ dscores
db2 = np.sum(dscores, axis=0)
dh = dscores @ W2.T
# Backprop through dropout
if dropout_mask is not None:
dh = dh * dropout_mask # Zero out same neurons
# Backprop through ReLU
dh[z1 <= 0] = 0
# Backprop through layer 1
dW1 = X.T @ dh
db1 = np.sum(dh, axis=0)
grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
return gradsWhere to Apply Dropout
Best Practices:
- Fully-connected layers: Typical (drop half the neurons)
- Convolutional layers: Use sparingly, if at all (typically if used)
- Output layer: Never apply dropout
- RNNs: Use special variants (variational dropout)
Why not in conv layers?
- Convolutional layers have built-in regularization through weight sharing
- Spatial dropout is used instead for CNNs (drops entire feature maps)
Tuning Dropout Rate
Guidelines:
- Start with for hidden layers
- Increase if still overfitting (more regularization)
- Decrease if underfitting (less regularization)
- Use validation set to tune
Rule of thumb:
- Small datasets or deep networks → higher dropout ()
- Large datasets or shallow networks → lower dropout ()
# Experiment with different dropout rates
dropout_rates = [0.0, 0.3, 0.5, 0.7, 0.9]
results = {}
for p in dropout_rates:
net = TwoLayerNet(input_dim, hidden_dim, output_dim)
history = train(net, X_train, y_train, X_val, y_val,
dropout_prob=p, num_epochs=100)
results[f'dropout={p}'] = history
# Plot validation accuracy
for name, history in results.items():
plt.plot(history['val_acc_history'], label=name)
plt.xlabel('Epoch')
plt.ylabel('Validation Accuracy')
plt.legend()
plt.title('Effect of Dropout Rate')Why Inverted Dropout?
Alternative: Standard Dropout
- Drop during training, scale by during testing
- Problem: Must remember to scale at test time
- Risk of bugs in deployment
Inverted Dropout (preferred):
- Scale during training, no modifications at test time
- Test time code is simpler and faster
- Less error-prone in production
Dropout as Ensemble
Dropout can be interpreted as training an ensemble of exponentially many thinned networks:
- Network with neurons has possible dropout masks
- Each mask defines a different network
- Training samples different networks on each batch
- Test time approximates averaging all these networks
This explains dropout’s effectiveness: it’s like training and averaging many networks!
Monte Carlo Dropout
For uncertainty estimation, you can use dropout at test time:
# Make predictions with dropout (multiple forward passes)
n_samples = 100
predictions = []
for _ in range(n_samples):
scores = net.forward_with_dropout(X_test, dropout_prob=0.5, train=True)
predictions.append(scores)
# Average predictions
mean_pred = np.mean(predictions, axis=0)
std_pred = np.std(predictions, axis=0) # Uncertainty estimateThis gives uncertainty estimates, useful for safety-critical applications.
Dropout Variants
Spatial Dropout (for CNNs):
- Drop entire feature maps instead of individual neurons
- Better for convolutional layers
Variational Dropout (for RNNs):
- Use same dropout mask across all time steps
- Prevents network from learning to ignore dropout
DropConnect:
- Drop weights instead of activations
- Similar effect, less commonly used
Combining with Other Regularization
Dropout works well with:
- L2 regularization: Use both together
- Batch normalization: Some debate, but generally compatible
- Data augmentation: Complementary effects
Typical stack:
# Modern fully-connected layer
x = linear(x)
x = batch_norm(x)
x = relu(x)
x = dropout(x, p=0.5)Learning Resources
Papers
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al., 2014) - Original paper
Reading
Related Concepts
- Regularization - General overfitting prevention
- Batch Normalization - Often used together
- Bias-Variance Tradeoff - Why regularization helps
Next Steps
- Understand the bias-variance tradeoff
- Learn about batch normalization
- Implement in MNIST example
- Explore advanced regularization in training practices