Skip to Content

Pooling Layers

Pooling operations reduce the spatial dimensions of feature maps while retaining the most important information. They provide translation invariance and computational efficiency in CNNs.

Max Pooling

Max pooling takes the maximum value within each pooling window, preserving the strongest activations:

import torch.nn as nn max_pool = nn.MaxPool2d(kernel_size=2, stride=2) # Reduces spatial dimensions by 2× # Input: (batch, channels, H, W) # Output: (batch, channels, H/2, W/2)

Example

For a 2×2 max pooling operation with stride 2:

Input (4×4): Output (2×2): [1 3 2 4] [5 6 8 7] → [6 8] [2 1 4 3] [5 9] [5 2 9 6]

Each 2×2 region is replaced by its maximum value.

Average Pooling

Average pooling computes the mean value in each window:

# Average pooling - smooths features avg_pool = nn.AvgPool2d(kernel_size=2, stride=2) # Global average pooling - reduces each channel to a single value global_avg_pool = nn.AdaptiveAvgPool2d((1, 1)) # Input: (batch, channels, H, W) # Output: (batch, channels, 1, 1)

Global average pooling is commonly used before the final classification layer in modern architectures, replacing fully-connected layers.

Why Pooling?

Translation Invariance

Small shifts in the input produce minimal changes in the output. If a feature moves by a few pixels, the maximum within the pooling region often remains the same, making detection more robust.

Computational Efficiency

Reducing spatial dimensions decreases:

  • Memory requirements: Smaller feature maps
  • Computation: Fewer operations in subsequent layers
  • Parameters: Fewer weights needed in any subsequent fully-connected layers

Larger Receptive Fields

Pooling accelerates the growth of receptive fields. After a 2× downsampling, each pixel in the next layer corresponds to a 2× larger region in the previous layer.

Max vs Average Pooling

Max Pooling

  • When: Convolutional layers in the network body
  • Why: Preserves strongest activations, good for detecting presence of features
  • Effect: Captures sharp, salient features

Average Pooling

  • When: Final layers before classification
  • Why: Provides smoother, more distributed representation
  • Effect: Reduces overfitting by avoiding over-reliance on strongest activations

Global Average Pooling

  • When: Replacing fully-connected layers at network end
  • Why: Eliminates position-specific parameters, reduces overfitting
  • Effect: Forces feature maps to be semantically meaningful at each spatial location

Used in modern architectures like ResNet, MobileNet, and EfficientNet.

Modern Alternatives

Many recent architectures use strided convolutions instead of pooling for downsampling:

# Instead of conv + pool: conv_layer = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1) pool_layer = nn.MaxPool2d(kernel_size=2, stride=2) # Use strided conv: strided_conv = nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1)

Advantages of strided convolutions:

  • Network learns the optimal downsampling operation
  • Fewer hyperparameters (no separate pooling design)
  • Used in architectures like ResNet variants

Pooling still used when:

  • Computational efficiency is critical
  • Simple, interpretable downsampling desired
  • Global aggregation needed (global average pooling)

Parameter-Free Operation

Pooling layers have no learnable parameters. They are fixed operations defined by:

  • kernel_size: Size of pooling window
  • stride: Step size when moving the window
  • padding: Border handling (rarely used with pooling)

This makes them computationally cheap and adds no parameters to the model.

Output Dimension Calculation

For an input of size H×WH \times W:

Hout=H+2×paddingkernel_sizestride+1H_{out} = \left\lfloor \frac{H + 2 \times \text{padding} - \text{kernel\_size}}{\text{stride}} \right\rfloor + 1

Common configuration (2×2 pool, stride 2, no padding):

Hout=H22+1=H2H_{out} = \left\lfloor \frac{H - 2}{2} \right\rfloor + 1 = \frac{H}{2}

Key Papers

  • Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) - Introduced pooling in modern CNNs
  • Network In Network (Lin et al., 2013) - Introduced global average pooling

Learning Resources

Articles