Pooling Layers
Pooling operations reduce the spatial dimensions of feature maps while retaining the most important information. They provide translation invariance and computational efficiency in CNNs.
Max Pooling
Max pooling takes the maximum value within each pooling window, preserving the strongest activations:
import torch.nn as nn
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Reduces spatial dimensions by 2×
# Input: (batch, channels, H, W)
# Output: (batch, channels, H/2, W/2)Example
For a 2×2 max pooling operation with stride 2:
Input (4×4): Output (2×2):
[1 3 2 4]
[5 6 8 7] → [6 8]
[2 1 4 3] [5 9]
[5 2 9 6]Each 2×2 region is replaced by its maximum value.
Average Pooling
Average pooling computes the mean value in each window:
# Average pooling - smooths features
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
# Global average pooling - reduces each channel to a single value
global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))
# Input: (batch, channels, H, W)
# Output: (batch, channels, 1, 1)Global average pooling is commonly used before the final classification layer in modern architectures, replacing fully-connected layers.
Why Pooling?
Translation Invariance
Small shifts in the input produce minimal changes in the output. If a feature moves by a few pixels, the maximum within the pooling region often remains the same, making detection more robust.
Computational Efficiency
Reducing spatial dimensions decreases:
- Memory requirements: Smaller feature maps
- Computation: Fewer operations in subsequent layers
- Parameters: Fewer weights needed in any subsequent fully-connected layers
Larger Receptive Fields
Pooling accelerates the growth of receptive fields. After a 2× downsampling, each pixel in the next layer corresponds to a 2× larger region in the previous layer.
Max vs Average Pooling
Max Pooling
- When: Convolutional layers in the network body
- Why: Preserves strongest activations, good for detecting presence of features
- Effect: Captures sharp, salient features
Average Pooling
- When: Final layers before classification
- Why: Provides smoother, more distributed representation
- Effect: Reduces overfitting by avoiding over-reliance on strongest activations
Global Average Pooling
- When: Replacing fully-connected layers at network end
- Why: Eliminates position-specific parameters, reduces overfitting
- Effect: Forces feature maps to be semantically meaningful at each spatial location
Used in modern architectures like ResNet, MobileNet, and EfficientNet.
Modern Alternatives
Many recent architectures use strided convolutions instead of pooling for downsampling:
# Instead of conv + pool:
conv_layer = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
pool_layer = nn.MaxPool2d(kernel_size=2, stride=2)
# Use strided conv:
strided_conv = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)Advantages of strided convolutions:
- Network learns the optimal downsampling operation
- Fewer hyperparameters (no separate pooling design)
- Used in architectures like ResNet variants
Pooling still used when:
- Computational efficiency is critical
- Simple, interpretable downsampling desired
- Global aggregation needed (global average pooling)
Parameter-Free Operation
Pooling layers have no learnable parameters. They are fixed operations defined by:
- kernel_size: Size of pooling window
- stride: Step size when moving the window
- padding: Border handling (rarely used with pooling)
This makes them computationally cheap and adds no parameters to the model.
Output Dimension Calculation
For an input of size :
Common configuration (2×2 pool, stride 2, no padding):
Related Concepts
- Convolution Operation - Primary spatial feature extraction in CNNs
Key Papers
- Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) - Introduced pooling in modern CNNs
- Network In Network (Lin et al., 2013) - Introduced global average pooling
Learning Resources
Articles
- CS231n Pooling Layers - Stanford course material on pooling