Convolution Operation

The convolution operation is the fundamental building block of Convolutional Neural Networks (CNNs), enabling them to efficiently process spatial data like images.

Mathematical Definition

A convolution slides a small filter (kernel) $K$ over the input $I$ , computing dot products at each position:

(I * K)[i,j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m,n]

For an image $I$ and kernel $K$ , this produces a feature map highlighting specific patterns wherever they appear in the input.

Why Convolution for Images?

Convolution operations are particularly well-suited for visual data because they exploit several key inductive biases:

Spatial Locality

Nearby pixels in images are typically related. Convolution operates on local neighborhoods, capturing these local correlations efficiently.

Translation Invariance

The same filter is applied across all spatial locations, allowing the network to detect features (edges, textures, patterns) regardless of where they appear in the image.

Unlike fully-connected layers which require separate weights for each input-output connection, convolution uses the same filter weights across all spatial positions. This dramatically reduces the number of parameters:

Fully-connected: $H \times W \times C_{in} \times C_{out}$ parameters
Convolutional: $k \times k \times C_{in} \times C_{out}$ parameters (where $k$ is kernel size)

For a 224×224 RGB image with 64 output channels:

Fully-connected: ~9.6 million parameters
3×3 convolution: ~1,700 parameters (5000× reduction)

Hierarchical Feature Learning

Stacking convolution layers allows CNNs to build hierarchical representations:

Early layers: Edges, colors, simple textures
Middle layers: Patterns, shapes, parts
Late layers: High-level, task-specific concepts (faces, objects, scenes)

Convolution Parameters


import torch.nn as nn
 
conv = nn.Conv2d(
    in_channels=3,      # Number of input channels (e.g., RGB = 3)
    out_channels=64,    # Number of filters (output channels)
    kernel_size=3,      # Spatial size of filters (3×3)
    stride=1,           # Step size when sliding filter
    padding=1           # Zero-padding around input
)
 
# Input shape:  (batch_size, 3, H, W)
# Output shape: (batch_size, 64, H, W)

Stride

Controls how many pixels the filter moves at each step. stride=1 produces output with nearly the same spatial dimensions as input (with padding). stride=2 reduces spatial dimensions by half.

Padding

Adds zeros around the input border. padding=1 with a 3×3 kernel preserves spatial dimensions. Without padding, output dimensions shrink by kernel_size - 1 on each side.

Receptive Field

The receptive field of a neuron is the region of the input that affects its activation. Stacking convolution layers increases the effective receptive field, allowing deeper networks to capture more global context.

Growth with depth:

Single 3×3 conv: 3×3 receptive field
Two 3×3 convs: 5×5 receptive field
Three 3×3 convs: 7×7 receptive field
$n$ layers of 3×3 convs: $(2n + 1) \times (2n + 1)$ receptive field

This is one reason why deeper networks can be more powerful than wider shallow networks—they can integrate information from larger spatial regions while maintaining computational efficiency.

Output Dimension Calculation

For an input of size $H \times W$ :

H_{out} = \left\lfloor \frac{H + 2 \times \text{padding} - \text{kernel\_size}}{\text{stride}} \right\rfloor + 1

Example: 224×224 input, 3×3 kernel, stride=1, padding=1

H_{out} = \left\lfloor \frac{224 + 2 - 3}{1} \right\rfloor + 1 = 224

Pooling Layers - Spatial downsampling operations often used with convolution

Key Papers

Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) - Introduced LeNet and modern CNN architectures

Convolution Operation

Mathematical Definition

Why Convolution for Images?

Spatial Locality

Translation Invariance

Hierarchical Feature Learning

Convolution Parameters

Stride

Padding

Receptive Field

Output Dimension Calculation

Key Papers

Learning Resources

Videos

Articles

Convolution Operation

Mathematical Definition

Why Convolution for Images?

Spatial Locality

Translation Invariance

Parameter Sharing

Hierarchical Feature Learning

Convolution Parameters

Stride

Padding

Receptive Field

Output Dimension Calculation

Related Concepts

Key Papers

Learning Resources

Videos

Articles