Batch Normalization

Batch normalization (BatchNorm) is one of the most important innovations in deep learning, dramatically accelerating training and improving stability by normalizing layer inputs across mini-batches.

First introduced: Ioffe & Szegedy (2015) - “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”

The Problem: Internal Covariate Shift

Challenge: During training, layer input distributions change as parameters update

Early layers update → later layer inputs shift
Each layer must continuously adapt to changing inputs
Training becomes slow and unstable
Requires careful learning rate tuning

Example: Deep CNN training


Layer 5 input distribution at epoch 1: mean=0.5, std=1.2
Layer 5 input distribution at epoch 10: mean=2.1, std=0.3
→ Layer 5 must adapt to this shift instead of learning features

How Batch Normalization Works

For a mini-batch of examples, BatchNorm:

1. Compute Batch Statistics

Calculate mean and variance across the batch:

\mu_B = \frac{1}{m} \sum_{i=1}^m x_i

\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2

Where $m$ is the batch size.

2. Normalize

Normalize each value using batch statistics:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

The $\epsilon$ (typically $10^{-5}$ ) prevents division by zero.

3. Scale and Shift

Apply learnable parameters $\gamma$ (scale) and $\beta$ (shift):

y_i = \gamma \hat{x}_i + \beta

Why learnable parameters? Allow the network to undo normalization if needed for optimal representation.

Implementation in PyTorch

For Convolutional Layers


import torch.nn as nn
 
# Standard conv block with BatchNorm
conv = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
bn = nn.BatchNorm2d(num_features=128)  # Normalize 128 feature maps
relu = nn.ReLU()
 
# Typical ordering: Conv → BatchNorm → Activation
x = torch.randn(32, 64, 28, 28)  # batch_size=32, channels=64, H=W=28
x = conv(x)         # [32, 128, 28, 28]
x = bn(x)           # [32, 128, 28, 28] - normalized
x = relu(x)         # [32, 128, 28, 28] - activated

For Fully-Connected Layers


# MLP block with BatchNorm
fc = nn.Linear(512, 256)
bn = nn.BatchNorm1d(num_features=256)  # For 1D features
relu = nn.ReLU()
 
# Forward pass
x = torch.randn(32, 512)  # batch_size=32, features=512
x = fc(x)       # [32, 256]
x = bn(x)       # [32, 256] - normalized
x = relu(x)     # [32, 256] - activated

Complete CNN with BatchNorm


class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU()
 
    def forward(self, x):
        return self.relu(self.bn(self.conv(x)))
 
class ResidualBlock(nn.Module):
    """Modern residual block with BatchNorm"""
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)
 
    def forward(self, x):
        identity = x
 
        out = self.conv1(x)
        out = self.bn1(out)        # BatchNorm after conv1
        out = self.relu(out)
 
        out = self.conv2(out)
        out = self.bn2(out)        # BatchNorm after conv2
 
        out += identity            # Residual connection
        out = self.relu(out)
 
        return out

Why Batch Normalization Works

Multiple complementary explanations:

1. Reduces Internal Covariate Shift (Original Paper)

Stabilizes input distributions to each layer
Each layer can learn independently
Less sensitivity to parameter initialization

2. Smooths the Loss Landscape (Recent Research)

Makes optimization easier (Santurkar et al., 2018)
Loss surface becomes smoother
Gradients more predictive
Allows higher learning rates

3. Acts as Regularization

Noise from batch statistics has regularizing effect
Similar to dropout
Reduces overfitting
Sometimes can replace explicit regularization

4. Enables Higher Learning Rates

Training can use learning rates 10-100x higher
Faster convergence
Often train in 1/3 to 1/10 the epochs

Training vs Inference

Critical distinction: BatchNorm behaves differently during training and inference.

Training Mode

Uses batch statistics:


model.train()  # Sets model to training mode
 
# Compute mean and variance from current batch
mu_batch = x.mean(dim=(0, 2, 3), keepdim=True)
var_batch = x.var(dim=(0, 2, 3), keepdim=True, unbiased=False)
 
# Normalize using batch statistics
x_norm = (x - mu_batch) / sqrt(var_batch + eps)
 
# Update running statistics (exponential moving average)
running_mean = momentum * running_mean + (1 - momentum) * mu_batch
running_var = momentum * running_var + (1 - momentum) * var_batch

Inference Mode

Uses running statistics (accumulated during training):


model.eval()  # Sets model to evaluation mode
 
# Use running statistics (no batch dependency)
x_norm = (x - running_mean) / sqrt(running_var + eps)
 
# Apply learned scale and shift
y = gamma * x_norm + beta

Why this matters:

Inference can process single examples (batch_size=1)
Results are deterministic (no batch dependency)
Running statistics represent entire training dataset

BatchNorm Placement

Standard Pattern: Conv → BN → ReLU


# Most common: BatchNorm before activation
x = conv(x)
x = bn(x)
x = relu(x)

Rationale:

Normalize before nonlinearity
Prevents activation saturation
Works well in practice

Alternative: Conv → ReLU → BN

Less common, but sometimes used in specific architectures.

With Residual Connections

See ResNet for detailed usage in skip connections:


# Pre-activation residual block (He et al., 2016)
def forward(self, x):
    identity = x
 
    out = self.bn1(x)         # BN before conv
    out = self.relu(out)
    out = self.conv1(out)
 
    out = self.bn2(out)       # BN before conv
    out = self.relu(out)
    out = self.conv2(out)
 
    out += identity
    return out

Hyperparameters

Momentum (for running statistics)


bn = nn.BatchNorm2d(channels, momentum=0.1)  # Default in PyTorch

Updates running mean/variance: running = momentum * running + (1 - momentum) * batch
Default: 0.1 (PyTorch) or 0.9 (TensorFlow - note the difference!)
Higher momentum: slower adaptation to recent batches
Lower momentum: faster adaptation (more sensitive to recent data)

Epsilon (numerical stability)


bn = nn.BatchNorm2d(channels, eps=1e-5)  # Default

Prevents division by zero
Rarely needs tuning
Can increase (e.g., 1e-3) if numerical instability occurs

Batch Size Considerations

BatchNorm performance depends on batch size:

Large Batches (32-256)

✅ Accurate batch statistics
✅ BatchNorm works well
✅ Stable training

Small Batches (2-8)

⚠ Noisy batch statistics
⚠ BatchNorm may hurt performance
⚠ Consider alternatives (GroupNorm, LayerNorm)

Example: Medical imaging with large 3D volumes


# Small batch size due to GPU memory constraints
batch_size = 4  # Large 3D medical images
 
# GroupNorm instead of BatchNorm
gn = nn.GroupNorm(num_groups=8, num_channels=64)
# Normalizes within groups, independent of batch size

Alternatives to BatchNorm

When BatchNorm doesn’t fit:

Layer Normalization (LayerNorm)

Normalizes across features (not batch)
Used in transformers (see Transformer)
Batch-size independent
Better for NLP

Group Normalization (GroupNorm)

Divides channels into groups
Normalizes within each group
Batch-size independent
Good for small batches (medical imaging, video)

Instance Normalization (InstanceNorm)

Normalizes each sample independently
Used in style transfer
Removes instance-specific contrast

Common Pitfalls

1. Forgetting train/eval Mode


# WRONG: Evaluate with training mode
model.train()
with torch.no_grad():
    predictions = model(test_data)  # ❌ Uses batch statistics!
 
# CORRECT: Set eval mode
model.eval()
with torch.no_grad():
    predictions = model(test_data)  # ✓ Uses running statistics

2. Small Batch Sizes


# Problematic
batch_size = 2  # Too small for accurate statistics
train_loader = DataLoader(dataset, batch_size=2)
 
# Better: Use GroupNorm or LayerNorm
# Or increase batch size with gradient accumulation

3. BatchNorm with Dropout


# Generally avoid using both
x = conv(x)
x = bn(x)
x = relu(x)
x = dropout(x)  # ⚠ Usually not needed, BN has regularization effect
 
# Modern approach: Use only BatchNorm
x = conv(x)
x = bn(x)
x = relu(x)

Impact on Modern Architectures

ResNet (2015): Enabled training 100+ layer networks

BatchNorm after every convolution
See ResNet paper

EfficientNet, MobileNet: Essential for mobile CNNs

BatchNorm for efficiency
Enables deployment on edge devices

Vision Transformers: Less critical (use LayerNorm)

Transformers prefer LayerNorm for NLP compatibility
BatchNorm can work but LayerNorm is standard

Practical Guidelines

When to Use BatchNorm

✅ Use BatchNorm when:

Training CNNs
Batch size ≥ 16
Training on GPU (fast batch stats computation)
Want faster training
Need stability for deep networks

When to Consider Alternatives

⚠ Consider LayerNorm/GroupNorm when:

Batch size < 16
Recurrent models (RNNs, LSTMs)
Transformers (LayerNorm is standard)
Online learning (single examples)
Federated learning (inconsistent batch sizes)

Training Tips

Use BatchNorm after conv layers: Standard modern practice
Place before activation: Conv → BN → ReLU
Set eval mode for inference: model.eval()
Larger learning rates: BN allows 10-100x higher LR
May reduce need for dropout: BN has regularization effect

Key Takeaways

BatchNorm normalizes layer inputs using batch statistics (training) or running statistics (inference)
Enables faster training through higher learning rates and improved gradient flow
Acts as regularization reducing overfitting (sometimes replacing dropout)
Critical for deep networks especially CNNs with 50+ layers
Training/eval modes differ - always set model.eval() for inference
Batch size matters - works best with batches of 16-256
Standard in modern CNNs - used in ResNet, EfficientNet, MobileNet, etc.

Training Practices - Weight init and LR selection
Practical Training Techniques - Warmup, gradient clipping, mixed precision
Regularization - Other techniques to prevent overfitting
Dropout - Alternative regularization (often replaced by BatchNorm)
ResNet - Heavy use of BatchNorm for 100+ layer networks
Optimization - Learning rate and optimizer selection