Skip to Content

Dropout

Dropout is one of the most effective regularization techniques for neural networks. It prevents overfitting by randomly “dropping” (zeroing out) neurons during training.

Core Idea

During training, randomly drop neurons with probability pp (typically 0.5):

Training: Each neuron is kept with probability (1p)(1-p) Testing: Use all neurons (no dropout)

Why it works:

  • Prevents co-adaptation: Neurons can’t rely on specific other neurons
  • Ensemble effect: Training many “thinned” networks, averaging at test time
  • Reduces overfitting: Forces distributed representations

Implementation: Inverted Dropout

The standard approach is inverted dropout, which scales during training rather than testing:

Training Phase

# Forward pass with dropout (training) dropout_prob = 0.5 mask = (np.random.rand(*h.shape) > dropout_prob) / (1 - dropout_prob) h = h * mask

The division by (1p)(1 - p) ensures expected values remain the same. This is crucial - it means we don’t need scaling at test time.

Testing Phase

# Forward pass without dropout (testing) h = h # Use full network, no scaling needed!

At test time, simply use all neurons with no modifications.

Complete Forward and Backward Pass

Forward Pass with Dropout

def forward_with_dropout(self, X, dropout_prob=0.5, train=True): """ Forward pass with optional dropout. Args: X: Input data (N, D) dropout_prob: Probability of dropping a neuron train: If True, apply dropout; if False, use all neurons Returns: scores: Class scores (N, C) cache: Values needed for backward pass """ W1, b1 = self.params['W1'], self.params['b1'] W2, b2 = self.params['W2'], self.params['b2'] # Layer 1 z1 = X @ W1 + b1 h = np.maximum(0, z1) # ReLU # Dropout on hidden layer dropout_mask = None if train: # Inverted dropout: drop and scale during training dropout_mask = (np.random.rand(*h.shape) > dropout_prob) / (1 - dropout_prob) h = h * dropout_mask # At test time: use all neurons, no scaling needed # Layer 2 scores = h @ W2 + b2 cache = (X, z1, h, W1, W2, dropout_mask) return scores, cache

Backward Pass with Dropout

def backward_with_dropout(self, dscores, cache): """Backward pass through dropout.""" X, z1, h, W1, W2, dropout_mask = cache # Backprop through layer 2 dW2 = h.T @ dscores db2 = np.sum(dscores, axis=0) dh = dscores @ W2.T # Backprop through dropout if dropout_mask is not None: dh = dh * dropout_mask # Zero out same neurons # Backprop through ReLU dh[z1 <= 0] = 0 # Backprop through layer 1 dW1 = X.T @ dh db1 = np.sum(dh, axis=0) grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2} return grads

Where to Apply Dropout

Best Practices:

  • Fully-connected layers: Typical p=0.5p = 0.5 (drop half the neurons)
  • Convolutional layers: Use sparingly, if at all (typically p=0.1p = 0.1 if used)
  • Output layer: Never apply dropout
  • RNNs: Use special variants (variational dropout)

Why not in conv layers?

  • Convolutional layers have built-in regularization through weight sharing
  • Spatial dropout is used instead for CNNs (drops entire feature maps)

Tuning Dropout Rate

Guidelines:

  • Start with p=0.5p = 0.5 for hidden layers
  • Increase pp if still overfitting (more regularization)
  • Decrease pp if underfitting (less regularization)
  • Use validation set to tune

Rule of thumb:

  • Small datasets or deep networks → higher dropout (p=0.7p = 0.7)
  • Large datasets or shallow networks → lower dropout (p=0.3p = 0.3)
# Experiment with different dropout rates dropout_rates = [0.0, 0.3, 0.5, 0.7, 0.9] results = {} for p in dropout_rates: net = TwoLayerNet(input_dim, hidden_dim, output_dim) history = train(net, X_train, y_train, X_val, y_val, dropout_prob=p, num_epochs=100) results[f'dropout={p}'] = history # Plot validation accuracy for name, history in results.items(): plt.plot(history['val_acc_history'], label=name) plt.xlabel('Epoch') plt.ylabel('Validation Accuracy') plt.legend() plt.title('Effect of Dropout Rate')

Why Inverted Dropout?

Alternative: Standard Dropout

  • Drop during training, scale by (1p)(1-p) during testing
  • Problem: Must remember to scale at test time
  • Risk of bugs in deployment

Inverted Dropout (preferred):

  • Scale during training, no modifications at test time
  • Test time code is simpler and faster
  • Less error-prone in production

Dropout as Ensemble

Dropout can be interpreted as training an ensemble of exponentially many thinned networks:

  • Network with nn neurons has 2n2^n possible dropout masks
  • Each mask defines a different network
  • Training samples different networks on each batch
  • Test time approximates averaging all these networks

This explains dropout’s effectiveness: it’s like training and averaging many networks!

Monte Carlo Dropout

For uncertainty estimation, you can use dropout at test time:

# Make predictions with dropout (multiple forward passes) n_samples = 100 predictions = [] for _ in range(n_samples): scores = net.forward_with_dropout(X_test, dropout_prob=0.5, train=True) predictions.append(scores) # Average predictions mean_pred = np.mean(predictions, axis=0) std_pred = np.std(predictions, axis=0) # Uncertainty estimate

This gives uncertainty estimates, useful for safety-critical applications.

Dropout Variants

Spatial Dropout (for CNNs):

  • Drop entire feature maps instead of individual neurons
  • Better for convolutional layers

Variational Dropout (for RNNs):

  • Use same dropout mask across all time steps
  • Prevents network from learning to ignore dropout

DropConnect:

  • Drop weights instead of activations
  • Similar effect, less commonly used

Combining with Other Regularization

Dropout works well with:

  • L2 regularization: Use both together
  • Batch normalization: Some debate, but generally compatible
  • Data augmentation: Complementary effects

Typical stack:

# Modern fully-connected layer x = linear(x) x = batch_norm(x) x = relu(x) x = dropout(x, p=0.5)

Learning Resources

Papers

Reading

Next Steps

  1. Understand the bias-variance tradeoff
  2. Learn about batch normalization
  3. Implement in MNIST example
  4. Explore advanced regularization in training practices