Skip to Content
LibraryConceptsConvolution

Convolution Operation

The convolution operation is the fundamental building block of Convolutional Neural Networks (CNNs), enabling them to efficiently process spatial data like images.

Mathematical Definition

A convolution slides a small filter (kernel) KK over the input II, computing dot products at each position:

(IK)[i,j]=mnI[i+m,j+n]K[m,n](I * K)[i,j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m,n]

For an image II and kernel KK, this produces a feature map highlighting specific patterns wherever they appear in the input.

Why Convolution for Images?

Convolution operations are particularly well-suited for visual data because they exploit several key inductive biases:

Spatial Locality

Nearby pixels in images are typically related. Convolution operates on local neighborhoods, capturing these local correlations efficiently.

Translation Invariance

The same filter is applied across all spatial locations, allowing the network to detect features (edges, textures, patterns) regardless of where they appear in the image.

Parameter Sharing

Unlike fully-connected layers which require separate weights for each input-output connection, convolution uses the same filter weights across all spatial positions. This dramatically reduces the number of parameters:

  • Fully-connected: H×W×Cin×CoutH \times W \times C_{in} \times C_{out} parameters
  • Convolutional: k×k×Cin×Coutk \times k \times C_{in} \times C_{out} parameters (where kk is kernel size)

For a 224×224 RGB image with 64 output channels:

  • Fully-connected: ~9.6 million parameters
  • 3×3 convolution: ~1,700 parameters (5000× reduction)

Hierarchical Feature Learning

Stacking convolution layers allows CNNs to build hierarchical representations:

  • Early layers: Edges, colors, simple textures
  • Middle layers: Patterns, shapes, parts
  • Late layers: High-level, task-specific concepts (faces, objects, scenes)

Convolution Parameters

import torch.nn as nn conv = nn.Conv2d( in_channels=3, # Number of input channels (e.g., RGB = 3) out_channels=64, # Number of filters (output channels) kernel_size=3, # Spatial size of filters (3×3) stride=1, # Step size when sliding filter padding=1 # Zero-padding around input ) # Input shape: (batch_size, 3, H, W) # Output shape: (batch_size, 64, H, W)

Stride

Controls how many pixels the filter moves at each step. stride=1 produces output with nearly the same spatial dimensions as input (with padding). stride=2 reduces spatial dimensions by half.

Padding

Adds zeros around the input border. padding=1 with a 3×3 kernel preserves spatial dimensions. Without padding, output dimensions shrink by kernel_size - 1 on each side.

Receptive Field

The receptive field of a neuron is the region of the input that affects its activation. Stacking convolution layers increases the effective receptive field, allowing deeper networks to capture more global context.

Growth with depth:

  • Single 3×3 conv: 3×3 receptive field
  • Two 3×3 convs: 5×5 receptive field
  • Three 3×3 convs: 7×7 receptive field
  • nn layers of 3×3 convs: (2n+1)×(2n+1)(2n + 1) \times (2n + 1) receptive field

This is one reason why deeper networks can be more powerful than wider shallow networks—they can integrate information from larger spatial regions while maintaining computational efficiency.

Output Dimension Calculation

For an input of size H×WH \times W:

Hout=H+2×paddingkernel_sizestride+1H_{out} = \left\lfloor \frac{H + 2 \times \text{padding} - \text{kernel\_size}}{\text{stride}} \right\rfloor + 1

Example: 224×224 input, 3×3 kernel, stride=1, padding=1

Hout=224+231+1=224H_{out} = \left\lfloor \frac{224 + 2 - 3}{1} \right\rfloor + 1 = 224
  • Pooling Layers - Spatial downsampling operations often used with convolution

Key Papers

  • Gradient-Based Learning Applied to Document Recognition (LeCun et al., 1998) - Introduced LeNet and modern CNN architectures

Learning Resources

Videos

Articles