ResNet: Deep Residual Learning for Image Recognition

Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (Microsoft Research) Year: 2015 Venue: CVPR (IEEE Conference on Computer Vision and Pattern Recognition) Impact: ⭐⭐⭐ - Revolutionary architecture that became the backbone of modern computer vision

Overview

ResNet introduced residual connections (skip connections), solving the degradation problem that prevented training very deep neural networks. This breakthrough enabled networks with 100+ layers and achieved superhuman performance on ImageNet (3.57% top-5 error vs 5.1% human error).

ResNet is now ubiquitous - it’s the most commonly used architecture for transfer learning and forms the backbone of countless computer vision applications.

The Degradation Problem

Before ResNet, a puzzling phenomenon occurred: deeper networks performed worse than shallower ones, even on training data.

Expected vs Observed

Intuition suggests: A deeper network should at least match a shallower one by learning identity mappings in extra layers.

Reality: Training accuracy degraded with depth:

20-layer network: 91% training accuracy
56-layer network: 88% training accuracy (worse!)

This wasn’t overfitting (training error was higher), nor vanishing gradients alone (batch normalization helped but didn’t solve it). Something fundamental was wrong.

The Core Insight

Problem: Learning identity mappings $H(x) = x$ is hard for stacked nonlinear layers.

Solution: Learn residual mappings $F(x) = H(x) - x$ instead, then add back the input:

H(x) = F(x) + x

If identity is optimal, it’s easier to learn $F(x) = 0$ than to learn $H(x) = x$ directly.

Residual Connections

Mathematical Formulation

A residual block computes:

y = F(x, \{W_i\}) + x

Where:

$x$ is the input
$F(x, \{W_i\})$ is the residual function (learned transformation)
$y$ is the output

The $+x$ is the skip connection or shortcut connection.

Implementation


import torch.nn as nn
 
class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)
        self.relu = nn.ReLU(inplace=True)
 
    def forward(self, x):
        identity = x  # Save input for skip connection
 
        # Residual function F(x)
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
 
        out = self.conv2(out)
        out = self.bn2(out)
 
        # Skip connection: add input
        out += identity
 
        # Final activation
        out = self.relu(out)
 
        return out

Projection Shortcuts

When dimensions change (e.g., spatial downsampling or channel increase), use a projection:


class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                               stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
 
        # Projection shortcut when dimensions change
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )
 
        self.relu = nn.ReLU(inplace=True)
 
    def forward(self, x):
        identity = self.shortcut(x)
 
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
 
        out += identity
        out = self.relu(out)
 
        return out

Why Skip Connections Work

1. Gradient Flow

Gradients can flow directly backward through the identity connection:

\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \left(1 + \frac{\partial F}{\partial x}\right)

The $+1$ ensures gradients always have a direct path, preventing vanishing gradients.

2. Easier Optimization

Learning $F(x) = 0$ (do nothing) is easier than learning $H(x) = x$ (perfect identity) from scratch with nonlinear layers.

3. Feature Reuse

Early layer features can be used directly by later layers without transformation, enabling efficient information flow.

4. Ensemble Effect

A ResNet implicitly creates an ensemble of exponentially many paths:

A network with $n$ residual blocks has $2^n$ possible paths
During backpropagation, shorter paths dominate early training
Deeper paths contribute as training progresses

Architecture Variants

ResNet-34 Configuration


Input: 224×224×3

Conv1: 64 filters, 7×7, stride 2
  Output: 112×112×64

MaxPool: 3×3, stride 2
  Output: 56×56×64

Layer1: [64, 64] × 3 blocks
  Output: 56×56×64

Layer2: [128, 128] × 4 blocks (stride 2 in first block)
  Output: 28×28×128

Layer3: [256, 256] × 6 blocks (stride 2 in first block)
  Output: 14×14×256

Layer4: [512, 512] × 3 blocks (stride 2 in first block)
  Output: 7×7×512

Global Average Pool: 7×7 → 1×1
  Output: 1×1×512

FC: 1000 classes

ResNet-50 and Deeper: Bottleneck Blocks

For ResNet-50/101/152, use bottleneck design to reduce computation:


class BottleneckBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        bottleneck_channels = out_channels // 4
 
        # 1×1 conv: reduce dimensions
        self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1)
        self.bn1 = nn.BatchNorm2d(bottleneck_channels)
 
        # 3×3 conv: main computation
        self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3,
                               stride=stride, padding=1)
        self.bn2 = nn.BatchNorm2d(bottleneck_channels)
 
        # 1×1 conv: restore dimensions
        self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1)
        self.bn3 = nn.BatchNorm2d(out_channels)
 
        # Projection shortcut
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride=stride),
                nn.BatchNorm2d(out_channels)
            )
 
        self.relu = nn.ReLU(inplace=True)
 
    def forward(self, x):
        identity = self.shortcut(x)
 
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
 
        out += identity
        out = self.relu(out)
 
        return out

Bottleneck benefits:

Reduces computation: 1×1 convs are cheap
Maintains expressiveness: 3×3 operates on reduced dimensions
Enables much deeper networks (50-152 layers)

Common ResNet Variants

Model	Layers	Parameters	FLOPs	Top-5 Error
ResNet-18	18	11.7M	1.8G	10.2%
ResNet-34	34	21.8M	3.6G	7.4%
ResNet-50	50	25.6M	4.1G	6.7%
ResNet-101	101	44.5M	7.8G	6.2%
ResNet-152	152	60.2M	11.3G	5.7%

Training Details

Dataset: ImageNet ILSVRC-2015 (1.2M training images, 1000 classes)
Optimization: SGD with momentum (0.9)
Learning rate: 0.1, divided by 10 when error plateaus
Weight decay: 0.0001
Batch size: 256
Data augmentation: Random crop/flip, color jitter, scale augmentation
Batch normalization: After every convolution, before activation

Results

ILSVRC-2015 Performance:

ResNet-152 top-5 error: 3.57%
Human performance: ~5.1%
Won: ImageNet classification, detection, and localization

First model to achieve superhuman performance on ImageNet classification.

Impact and Legacy

Immediate Impact

Enabled depth: Made 100+ layer networks trainable
Superhuman performance: Beat human baseline on ImageNet
Transfer learning standard: Became default backbone for most vision tasks

Architectural Influence

Skip connections became standard in:

Image segmentation: U-Net (encoder-decoder with skip connections)
Object detection: Feature Pyramid Networks (FPN)
NLP: Transformers use residual connections around attention blocks
Generative models: Diffusion models, GANs use skip connections

Modern Variants

ResNet inspired many improvements:

ResNeXt: Grouped convolutions for better capacity/efficiency
Wide ResNet: Wider (more channels) but shallower
DenseNet: Dense connections (every layer to every other)
EfficientNet: Compound scaling of depth/width/resolution

Why ResNet Remains Popular

Despite newer architectures (EfficientNet, Vision Transformers), ResNet remains widely used:

1. Transfer Learning

Pre-trained ResNets provide excellent features for most vision tasks:


import torchvision.models as models
 
# Most common choice for transfer learning
resnet50 = models.resnet50(pretrained=True)

2. Proven Performance

Extensive validation across countless tasks and domains makes it reliable.

3. Computational Efficiency

Balanced accuracy/efficiency trade-off, especially ResNet-50.

4. Easy to Understand and Modify

Clean architecture makes experimentation straightforward.

Key Insights

Residual Learning Principle

Learning perturbations ( $F(x)$ ) around identity ( $x$ ) is easier than learning mappings from scratch.

Skip Connections Enable Depth

Direct gradient paths prevent vanishing gradients, making very deep networks trainable.

Simplicity Matters

ResNet’s design is simpler than earlier architectures (Inception) yet more effective.

Practical Usage


import torch
import torchvision.models as models
 
# Load pre-trained ResNet-50
model = models.resnet50(pretrained=True)
 
# For transfer learning: replace final layer
num_classes = 10
model.fc = nn.Linear(model.fc.in_features, num_classes)
 
# For feature extraction: remove final layer
feature_extractor = nn.Sequential(*list(model.children())[:-1])

Skip Connections - Detailed explanation of skip connections
Vanishing Gradient Problem - Problem ResNet solves
AlexNet - Started the deep learning era
VGG Networks - Showed depth matters but couldn’t go as deep

Key Papers

Original paper: Deep Residual Learning for Image Recognition (He et al., CVPR 2016)
Identity Mappings: Identity Mappings in Deep Residual Networks (He et al., 2016) - Improved residual block design

ResNet: Deep Residual Learning for Image Recognition

Overview

The Degradation Problem

Expected vs Observed

The Core Insight

Residual Connections

Mathematical Formulation

Implementation

Projection Shortcuts

Why Skip Connections Work

1. Gradient Flow

2. Easier Optimization

3. Feature Reuse

4. Ensemble Effect

Architecture Variants

ResNet-34 Configuration

ResNet-50 and Deeper: Bottleneck Blocks

Common ResNet Variants

Training Details

Results

Impact and Legacy

Immediate Impact

Architectural Influence

Modern Variants

Why ResNet Remains Popular

1. Transfer Learning

2. Proven Performance

3. Computational Efficiency

4. Easy to Understand and Modify

Key Insights

Residual Learning Principle

Skip Connections Enable Depth

Simplicity Matters

Practical Usage

Key Papers

Learning Resources

Videos

Articles

ResNet: Deep Residual Learning for Image Recognition

Overview

The Degradation Problem

Expected vs Observed

The Core Insight

Residual Connections

Mathematical Formulation

Implementation

Projection Shortcuts

Why Skip Connections Work

1. Gradient Flow

2. Easier Optimization

3. Feature Reuse

4. Ensemble Effect

Architecture Variants

ResNet-34 Configuration

ResNet-50 and Deeper: Bottleneck Blocks

Common ResNet Variants

Training Details

Results

Impact and Legacy

Immediate Impact

Architectural Influence

Modern Variants

Why ResNet Remains Popular

1. Transfer Learning

2. Proven Performance

3. Computational Efficiency

4. Easy to Understand and Modify

Key Insights

Residual Learning Principle

Skip Connections Enable Depth

Simplicity Matters

Practical Usage

Related Concepts

Key Papers

Learning Resources

Videos

Articles