Pooling Layers

Pooling operations reduce the spatial dimensions of feature maps while retaining the most important information. They provide translation invariance and computational efficiency in CNNs.

Max Pooling

Max pooling takes the maximum value within each pooling window, preserving the strongest activations:


import torch.nn as nn
 
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Reduces spatial dimensions by 2×
# Input:  (batch, channels, H, W)
# Output: (batch, channels, H/2, W/2)

Example

For a 2×2 max pooling operation with stride 2:


Input (4×4):          Output (2×2):
[1  3  2  4]
[5  6  8  7]    →     [6  8]
[2  1  4  3]          [5  9]
[5  2  9  6]

Each 2×2 region is replaced by its maximum value.

Average Pooling

Average pooling computes the mean value in each window:


# Average pooling - smooths features
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
 
# Global average pooling - reduces each channel to a single value
global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))
# Input:  (batch, channels, H, W)
# Output: (batch, channels, 1, 1)

Global average pooling is commonly used before the final classification layer in modern architectures, replacing fully-connected layers.

Why Pooling?

Translation Invariance

Small shifts in the input produce minimal changes in the output. If a feature moves by a few pixels, the maximum within the pooling region often remains the same, making detection more robust.

Computational Efficiency

Reducing spatial dimensions decreases:

Memory requirements: Smaller feature maps
Computation: Fewer operations in subsequent layers
Parameters: Fewer weights needed in any subsequent fully-connected layers

Larger Receptive Fields

Pooling accelerates the growth of receptive fields. After a 2× downsampling, each pixel in the next layer corresponds to a 2× larger region in the previous layer.

Max vs Average Pooling

Max Pooling

When: Convolutional layers in the network body
Why: Preserves strongest activations, good for detecting presence of features
Effect: Captures sharp, salient features

Average Pooling

When: Final layers before classification
Why: Provides smoother, more distributed representation
Effect: Reduces overfitting by avoiding over-reliance on strongest activations

Global Average Pooling

When: Replacing fully-connected layers at network end
Why: Eliminates position-specific parameters, reduces overfitting
Effect: Forces feature maps to be semantically meaningful at each spatial location

Used in modern architectures like ResNet, MobileNet, and EfficientNet.

Modern Alternatives

Many recent architectures use strided convolutions instead of pooling for downsampling:


# Instead of conv + pool:
conv_layer = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
pool_layer = nn.MaxPool2d(kernel_size=2, stride=2)
 
# Use strided conv:
strided_conv = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)

Advantages of strided convolutions:

Network learns the optimal downsampling operation
Fewer hyperparameters (no separate pooling design)
Used in architectures like ResNet variants

Pooling still used when:

Computational efficiency is critical
Simple, interpretable downsampling desired
Global aggregation needed (global average pooling)

Parameter-Free Operation

Pooling layers have no learnable parameters. They are fixed operations defined by:

kernel_size: Size of pooling window
stride: Step size when moving the window
padding: Border handling (rarely used with pooling)

This makes them computationally cheap and adds no parameters to the model.

Output Dimension Calculation

For an input of size $H \times W$ :

H_{out} = \left\lfloor \frac{H + 2 \times \text{padding} - \text{kernel\_size}}{\text{stride}} \right\rfloor + 1

Common configuration (2×2 pool, stride 2, no padding):

H_{out} = \left\lfloor \frac{H - 2}{2} \right\rfloor + 1 = \frac{H}{2}

Convolution Operation - Primary spatial feature extraction in CNNs

Key Papers

Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) - Introduced pooling in modern CNNs
Network In Network (Lin et al., 2013) - Introduced global average pooling

Learning Resources

Articles

CS231n Pooling Layers - Stanford course material on pooling

Pooling Layers

Max Pooling

Example

Average Pooling

Why Pooling?

Translation Invariance

Computational Efficiency

Larger Receptive Fields

Max vs Average Pooling

Max Pooling

Average Pooling

Global Average Pooling

Modern Alternatives

Parameter-Free Operation

Output Dimension Calculation

Related Concepts

Key Papers

Learning Resources

Articles