Skip to Content
LibraryPapersResNet (2015)

ResNet: Deep Residual Learning for Image Recognition

Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (Microsoft Research) Year: 2015 Venue: CVPR (IEEE Conference on Computer Vision and Pattern Recognition) Impact: ⭐⭐⭐ - Revolutionary architecture that became the backbone of modern computer vision

Overview

ResNet introduced residual connections (skip connections), solving the degradation problem that prevented training very deep neural networks. This breakthrough enabled networks with 100+ layers and achieved superhuman performance on ImageNet (3.57% top-5 error vs 5.1% human error).

ResNet is now ubiquitous - it’s the most commonly used architecture for transfer learning and forms the backbone of countless computer vision applications.

The Degradation Problem

Before ResNet, a puzzling phenomenon occurred: deeper networks performed worse than shallower ones, even on training data.

Expected vs Observed

Intuition suggests: A deeper network should at least match a shallower one by learning identity mappings in extra layers.

Reality: Training accuracy degraded with depth:

  • 20-layer network: 91% training accuracy
  • 56-layer network: 88% training accuracy (worse!)

This wasn’t overfitting (training error was higher), nor vanishing gradients alone (batch normalization helped but didn’t solve it). Something fundamental was wrong.

The Core Insight

Problem: Learning identity mappings H(x)=xH(x) = x is hard for stacked nonlinear layers.

Solution: Learn residual mappings F(x)=H(x)xF(x) = H(x) - x instead, then add back the input:

H(x)=F(x)+xH(x) = F(x) + x

If identity is optimal, it’s easier to learn F(x)=0F(x) = 0 than to learn H(x)=xH(x) = x directly.

Residual Connections

Mathematical Formulation

A residual block computes:

y=F(x,{Wi})+xy = F(x, \{W_i\}) + x

Where:

  • xx is the input
  • F(x,{Wi})F(x, \{W_i\}) is the residual function (learned transformation)
  • yy is the output

The +x+x is the skip connection or shortcut connection.

Implementation

import torch.nn as nn class ResidualBlock(nn.Module): def __init__(self, channels): super().__init__() self.conv1 = nn.Conv2d(channels, channels, 3, padding=1) self.bn1 = nn.BatchNorm2d(channels) self.conv2 = nn.Conv2d(channels, channels, 3, padding=1) self.bn2 = nn.BatchNorm2d(channels) self.relu = nn.ReLU(inplace=True) def forward(self, x): identity = x # Save input for skip connection # Residual function F(x) out = self.conv1(x) out = self.bn1(out) out = self.relu(out) out = self.conv2(out) out = self.bn2(out) # Skip connection: add input out += identity # Final activation out = self.relu(out) return out

Projection Shortcuts

When dimensions change (e.g., spatial downsampling or channel increase), use a projection:

class ResidualBlock(nn.Module): def __init__(self, in_channels, out_channels, stride=1): super().__init__() self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1) self.bn1 = nn.BatchNorm2d(out_channels) self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1) self.bn2 = nn.BatchNorm2d(out_channels) # Projection shortcut when dimensions change self.shortcut = nn.Sequential() if stride != 1 or in_channels != out_channels: self.shortcut = nn.Sequential( nn.Conv2d(in_channels, out_channels, 1, stride=stride), nn.BatchNorm2d(out_channels) ) self.relu = nn.ReLU(inplace=True) def forward(self, x): identity = self.shortcut(x) out = self.relu(self.bn1(self.conv1(x))) out = self.bn2(self.conv2(out)) out += identity out = self.relu(out) return out

Why Skip Connections Work

1. Gradient Flow

Gradients can flow directly backward through the identity connection:

Lx=Ly(1+Fx)\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \left(1 + \frac{\partial F}{\partial x}\right)

The +1+1 ensures gradients always have a direct path, preventing vanishing gradients.

2. Easier Optimization

Learning F(x)=0F(x) = 0 (do nothing) is easier than learning H(x)=xH(x) = x (perfect identity) from scratch with nonlinear layers.

3. Feature Reuse

Early layer features can be used directly by later layers without transformation, enabling efficient information flow.

4. Ensemble Effect

A ResNet implicitly creates an ensemble of exponentially many paths:

  • A network with nn residual blocks has 2n2^n possible paths
  • During backpropagation, shorter paths dominate early training
  • Deeper paths contribute as training progresses

Architecture Variants

ResNet-34 Configuration

Input: 224×224×3 Conv1: 64 filters, 7×7, stride 2 Output: 112×112×64 MaxPool: 3×3, stride 2 Output: 56×56×64 Layer1: [64, 64] × 3 blocks Output: 56×56×64 Layer2: [128, 128] × 4 blocks (stride 2 in first block) Output: 28×28×128 Layer3: [256, 256] × 6 blocks (stride 2 in first block) Output: 14×14×256 Layer4: [512, 512] × 3 blocks (stride 2 in first block) Output: 7×7×512 Global Average Pool: 7×7 → 1×1 Output: 1×1×512 FC: 1000 classes

ResNet-50 and Deeper: Bottleneck Blocks

For ResNet-50/101/152, use bottleneck design to reduce computation:

class BottleneckBlock(nn.Module): def __init__(self, in_channels, out_channels, stride=1): super().__init__() bottleneck_channels = out_channels // 4 # 1×1 conv: reduce dimensions self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1) self.bn1 = nn.BatchNorm2d(bottleneck_channels) # 3×3 conv: main computation self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3, stride=stride, padding=1) self.bn2 = nn.BatchNorm2d(bottleneck_channels) # 1×1 conv: restore dimensions self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1) self.bn3 = nn.BatchNorm2d(out_channels) # Projection shortcut self.shortcut = nn.Sequential() if stride != 1 or in_channels != out_channels: self.shortcut = nn.Sequential( nn.Conv2d(in_channels, out_channels, 1, stride=stride), nn.BatchNorm2d(out_channels) ) self.relu = nn.ReLU(inplace=True) def forward(self, x): identity = self.shortcut(x) out = self.relu(self.bn1(self.conv1(x))) out = self.relu(self.bn2(self.conv2(out))) out = self.bn3(self.conv3(out)) out += identity out = self.relu(out) return out

Bottleneck benefits:

  • Reduces computation: 1×1 convs are cheap
  • Maintains expressiveness: 3×3 operates on reduced dimensions
  • Enables much deeper networks (50-152 layers)

Common ResNet Variants

ModelLayersParametersFLOPsTop-5 Error
ResNet-181811.7M1.8G10.2%
ResNet-343421.8M3.6G7.4%
ResNet-505025.6M4.1G6.7%
ResNet-10110144.5M7.8G6.2%
ResNet-15215260.2M11.3G5.7%

Training Details

  • Dataset: ImageNet ILSVRC-2015 (1.2M training images, 1000 classes)
  • Optimization: SGD with momentum (0.9)
  • Learning rate: 0.1, divided by 10 when error plateaus
  • Weight decay: 0.0001
  • Batch size: 256
  • Data augmentation: Random crop/flip, color jitter, scale augmentation
  • Batch normalization: After every convolution, before activation

Results

ILSVRC-2015 Performance:

  • ResNet-152 top-5 error: 3.57%
  • Human performance: ~5.1%
  • Won: ImageNet classification, detection, and localization

First model to achieve superhuman performance on ImageNet classification.

Impact and Legacy

Immediate Impact

  1. Enabled depth: Made 100+ layer networks trainable
  2. Superhuman performance: Beat human baseline on ImageNet
  3. Transfer learning standard: Became default backbone for most vision tasks

Architectural Influence

Skip connections became standard in:

  • Image segmentation: U-Net (encoder-decoder with skip connections)
  • Object detection: Feature Pyramid Networks (FPN)
  • NLP: Transformers use residual connections around attention blocks
  • Generative models: Diffusion models, GANs use skip connections

Modern Variants

ResNet inspired many improvements:

  • ResNeXt: Grouped convolutions for better capacity/efficiency
  • Wide ResNet: Wider (more channels) but shallower
  • DenseNet: Dense connections (every layer to every other)
  • EfficientNet: Compound scaling of depth/width/resolution

Despite newer architectures (EfficientNet, Vision Transformers), ResNet remains widely used:

1. Transfer Learning

Pre-trained ResNets provide excellent features for most vision tasks:

import torchvision.models as models # Most common choice for transfer learning resnet50 = models.resnet50(pretrained=True)

2. Proven Performance

Extensive validation across countless tasks and domains makes it reliable.

3. Computational Efficiency

Balanced accuracy/efficiency trade-off, especially ResNet-50.

4. Easy to Understand and Modify

Clean architecture makes experimentation straightforward.

Key Insights

Residual Learning Principle

Learning perturbations (F(x)F(x)) around identity (xx) is easier than learning mappings from scratch.

Skip Connections Enable Depth

Direct gradient paths prevent vanishing gradients, making very deep networks trainable.

Simplicity Matters

ResNet’s design is simpler than earlier architectures (Inception) yet more effective.

Practical Usage

import torch import torchvision.models as models # Load pre-trained ResNet-50 model = models.resnet50(pretrained=True) # For transfer learning: replace final layer num_classes = 10 model.fc = nn.Linear(model.fc.in_features, num_classes) # For feature extraction: remove final layer feature_extractor = nn.Sequential(*list(model.children())[:-1])
  • Skip Connections - Detailed explanation of skip connections
  • Vanishing Gradient Problem - Problem ResNet solves
  • AlexNet - Started the deep learning era
  • VGG Networks - Showed depth matters but couldn’t go as deep

Key Papers

Learning Resources

Videos

Articles