Convolution Operation
The convolution operation is the fundamental building block of Convolutional Neural Networks (CNNs), enabling them to efficiently process spatial data like images.
Mathematical Definition
A convolution slides a small filter (kernel) over the input , computing dot products at each position:
For an image and kernel , this produces a feature map highlighting specific patterns wherever they appear in the input.
Why Convolution for Images?
Convolution operations are particularly well-suited for visual data because they exploit several key inductive biases:
Spatial Locality
Nearby pixels in images are typically related. Convolution operates on local neighborhoods, capturing these local correlations efficiently.
Translation Invariance
The same filter is applied across all spatial locations, allowing the network to detect features (edges, textures, patterns) regardless of where they appear in the image.
Parameter Sharing
Unlike fully-connected layers which require separate weights for each input-output connection, convolution uses the same filter weights across all spatial positions. This dramatically reduces the number of parameters:
- Fully-connected: parameters
- Convolutional: parameters (where is kernel size)
For a 224×224 RGB image with 64 output channels:
- Fully-connected: ~9.6 million parameters
- 3×3 convolution: ~1,700 parameters (5000× reduction)
Hierarchical Feature Learning
Stacking convolution layers allows CNNs to build hierarchical representations:
- Early layers: Edges, colors, simple textures
- Middle layers: Patterns, shapes, parts
- Late layers: High-level, task-specific concepts (faces, objects, scenes)
Convolution Parameters
import torch.nn as nn
conv = nn.Conv2d(
in_channels=3, # Number of input channels (e.g., RGB = 3)
out_channels=64, # Number of filters (output channels)
kernel_size=3, # Spatial size of filters (3×3)
stride=1, # Step size when sliding filter
padding=1 # Zero-padding around input
)
# Input shape: (batch_size, 3, H, W)
# Output shape: (batch_size, 64, H, W)Stride
Controls how many pixels the filter moves at each step. stride=1 produces output with nearly the same spatial dimensions as input (with padding). stride=2 reduces spatial dimensions by half.
Padding
Adds zeros around the input border. padding=1 with a 3×3 kernel preserves spatial dimensions. Without padding, output dimensions shrink by kernel_size - 1 on each side.
Receptive Field
The receptive field of a neuron is the region of the input that affects its activation. Stacking convolution layers increases the effective receptive field, allowing deeper networks to capture more global context.
Growth with depth:
- Single 3×3 conv: 3×3 receptive field
- Two 3×3 convs: 5×5 receptive field
- Three 3×3 convs: 7×7 receptive field
- layers of 3×3 convs: receptive field
This is one reason why deeper networks can be more powerful than wider shallow networks—they can integrate information from larger spatial regions while maintaining computational efficiency.
Output Dimension Calculation
For an input of size :
Example: 224×224 input, 3×3 kernel, stride=1, padding=1
Related Concepts
- Pooling Layers - Spatial downsampling operations often used with convolution
Key Papers
- Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) - Introduced LeNet and modern CNN architectures
Learning Resources
Videos
- 3Blue1Brown - But what is a convolution? - Visual explanation of convolution
Articles
- CS231n Convolutional Networks - Comprehensive CNN guide from Stanford