ResNet: Deep Residual Learning for Image Recognition
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (Microsoft Research) Year: 2015 Venue: CVPR (IEEE Conference on Computer Vision and Pattern Recognition) Impact: ⭐⭐⭐ - Revolutionary architecture that became the backbone of modern computer vision
Overview
ResNet introduced residual connections (skip connections), solving the degradation problem that prevented training very deep neural networks. This breakthrough enabled networks with 100+ layers and achieved superhuman performance on ImageNet (3.57% top-5 error vs 5.1% human error).
ResNet is now ubiquitous - it’s the most commonly used architecture for transfer learning and forms the backbone of countless computer vision applications.
The Degradation Problem
Before ResNet, a puzzling phenomenon occurred: deeper networks performed worse than shallower ones, even on training data.
Expected vs Observed
Intuition suggests: A deeper network should at least match a shallower one by learning identity mappings in extra layers.
Reality: Training accuracy degraded with depth:
- 20-layer network: 91% training accuracy
- 56-layer network: 88% training accuracy (worse!)
This wasn’t overfitting (training error was higher), nor vanishing gradients alone (batch normalization helped but didn’t solve it). Something fundamental was wrong.
The Core Insight
Problem: Learning identity mappings is hard for stacked nonlinear layers.
Solution: Learn residual mappings instead, then add back the input:
If identity is optimal, it’s easier to learn than to learn directly.
Residual Connections
Mathematical Formulation
A residual block computes:
Where:
- is the input
- is the residual function (learned transformation)
- is the output
The is the skip connection or shortcut connection.
Implementation
import torch.nn as nn
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(channels)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
identity = x # Save input for skip connection
# Residual function F(x)
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
# Skip connection: add input
out += identity
# Final activation
out = self.relu(out)
return outProjection Shortcuts
When dimensions change (e.g., spatial downsampling or channel increase), use a projection:
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
stride=stride, padding=1)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(out_channels)
# Projection shortcut when dimensions change
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride),
nn.BatchNorm2d(out_channels)
)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
identity = self.shortcut(x)
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += identity
out = self.relu(out)
return outWhy Skip Connections Work
1. Gradient Flow
Gradients can flow directly backward through the identity connection:
The ensures gradients always have a direct path, preventing vanishing gradients.
2. Easier Optimization
Learning (do nothing) is easier than learning (perfect identity) from scratch with nonlinear layers.
3. Feature Reuse
Early layer features can be used directly by later layers without transformation, enabling efficient information flow.
4. Ensemble Effect
A ResNet implicitly creates an ensemble of exponentially many paths:
- A network with residual blocks has possible paths
- During backpropagation, shorter paths dominate early training
- Deeper paths contribute as training progresses
Architecture Variants
ResNet-34 Configuration
Input: 224×224×3
Conv1: 64 filters, 7×7, stride 2
Output: 112×112×64
MaxPool: 3×3, stride 2
Output: 56×56×64
Layer1: [64, 64] × 3 blocks
Output: 56×56×64
Layer2: [128, 128] × 4 blocks (stride 2 in first block)
Output: 28×28×128
Layer3: [256, 256] × 6 blocks (stride 2 in first block)
Output: 14×14×256
Layer4: [512, 512] × 3 blocks (stride 2 in first block)
Output: 7×7×512
Global Average Pool: 7×7 → 1×1
Output: 1×1×512
FC: 1000 classesResNet-50 and Deeper: Bottleneck Blocks
For ResNet-50/101/152, use bottleneck design to reduce computation:
class BottleneckBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
bottleneck_channels = out_channels // 4
# 1×1 conv: reduce dimensions
self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1)
self.bn1 = nn.BatchNorm2d(bottleneck_channels)
# 3×3 conv: main computation
self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3,
stride=stride, padding=1)
self.bn2 = nn.BatchNorm2d(bottleneck_channels)
# 1×1 conv: restore dimensions
self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1)
self.bn3 = nn.BatchNorm2d(out_channels)
# Projection shortcut
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride),
nn.BatchNorm2d(out_channels)
)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
identity = self.shortcut(x)
out = self.relu(self.bn1(self.conv1(x)))
out = self.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
out += identity
out = self.relu(out)
return outBottleneck benefits:
- Reduces computation: 1×1 convs are cheap
- Maintains expressiveness: 3×3 operates on reduced dimensions
- Enables much deeper networks (50-152 layers)
Common ResNet Variants
| Model | Layers | Parameters | FLOPs | Top-5 Error |
|---|---|---|---|---|
| ResNet-18 | 18 | 11.7M | 1.8G | 10.2% |
| ResNet-34 | 34 | 21.8M | 3.6G | 7.4% |
| ResNet-50 | 50 | 25.6M | 4.1G | 6.7% |
| ResNet-101 | 101 | 44.5M | 7.8G | 6.2% |
| ResNet-152 | 152 | 60.2M | 11.3G | 5.7% |
Training Details
- Dataset: ImageNet ILSVRC-2015 (1.2M training images, 1000 classes)
- Optimization: SGD with momentum (0.9)
- Learning rate: 0.1, divided by 10 when error plateaus
- Weight decay: 0.0001
- Batch size: 256
- Data augmentation: Random crop/flip, color jitter, scale augmentation
- Batch normalization: After every convolution, before activation
Results
ILSVRC-2015 Performance:
- ResNet-152 top-5 error: 3.57%
- Human performance: ~5.1%
- Won: ImageNet classification, detection, and localization
First model to achieve superhuman performance on ImageNet classification.
Impact and Legacy
Immediate Impact
- Enabled depth: Made 100+ layer networks trainable
- Superhuman performance: Beat human baseline on ImageNet
- Transfer learning standard: Became default backbone for most vision tasks
Architectural Influence
Skip connections became standard in:
- Image segmentation: U-Net (encoder-decoder with skip connections)
- Object detection: Feature Pyramid Networks (FPN)
- NLP: Transformers use residual connections around attention blocks
- Generative models: Diffusion models, GANs use skip connections
Modern Variants
ResNet inspired many improvements:
- ResNeXt: Grouped convolutions for better capacity/efficiency
- Wide ResNet: Wider (more channels) but shallower
- DenseNet: Dense connections (every layer to every other)
- EfficientNet: Compound scaling of depth/width/resolution
Why ResNet Remains Popular
Despite newer architectures (EfficientNet, Vision Transformers), ResNet remains widely used:
1. Transfer Learning
Pre-trained ResNets provide excellent features for most vision tasks:
import torchvision.models as models
# Most common choice for transfer learning
resnet50 = models.resnet50(pretrained=True)2. Proven Performance
Extensive validation across countless tasks and domains makes it reliable.
3. Computational Efficiency
Balanced accuracy/efficiency trade-off, especially ResNet-50.
4. Easy to Understand and Modify
Clean architecture makes experimentation straightforward.
Key Insights
Residual Learning Principle
Learning perturbations () around identity () is easier than learning mappings from scratch.
Skip Connections Enable Depth
Direct gradient paths prevent vanishing gradients, making very deep networks trainable.
Simplicity Matters
ResNet’s design is simpler than earlier architectures (Inception) yet more effective.
Practical Usage
import torch
import torchvision.models as models
# Load pre-trained ResNet-50
model = models.resnet50(pretrained=True)
# For transfer learning: replace final layer
num_classes = 10
model.fc = nn.Linear(model.fc.in_features, num_classes)
# For feature extraction: remove final layer
feature_extractor = nn.Sequential(*list(model.children())[:-1])Related Concepts
- Skip Connections - Detailed explanation of skip connections
- Vanishing Gradient Problem - Problem ResNet solves
- AlexNet - Started the deep learning era
- VGG Networks - Showed depth matters but couldn’t go as deep
Key Papers
- Original paper: Deep Residual Learning for Image Recognition (He et al., CVPR 2016)
- Identity Mappings: Identity Mappings in Deep Residual Networks (He et al., 2016) - Improved residual block design
Learning Resources
Videos
- Yannic Kilcher - ResNet Explained - Paper walkthrough
- Andrew Ng - Residual Networks - Deep learning course lecture
Articles
- CS231n ResNet - Stanford course coverage
- Understanding ResNets - Visual explanation