VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition

Authors: Karen Simonyan, Andrew Zisserman Year: 2014 Venue: ICLR Impact: ⭐⭐ - Established importance of depth and simplicity in CNN design

Overview

VGG networks demonstrated that depth matters in CNNs and that a simple, uniform architecture using only 3×3 convolutions could achieve excellent performance. VGG’s clean design made it highly influential for understanding CNNs and remains widely used for feature extraction despite being overtaken in raw performance by later architectures.

Core Philosophy

VGG explored a simple question: How deep can we go with a very simple, uniform architecture?

The answer: Very deep (16-19 layers), and deeper is better (up to a point).

Key Innovation: Small Filters Throughout

VGG uses only 3×3 convolutions throughout the entire network, with occasional 1×1 convolutions.

Why 3×3 Filters?

Receptive field equivalence: Two stacked 3×3 convolutions have the same receptive field as one 5×5 convolution:


Layer 1: 3×3 conv → output neuron sees 3×3 region
Layer 2: 3×3 conv → output neuron now sees 5×5 region of original input

Parameter efficiency: Stacking small filters uses fewer parameters:

Two 3×3 convs: $2 \times (3 \times 3) = 18$ parameters per input/output channel
One 5×5 conv: $5 \times 5 = 25$ parameters per input/output channel

More non-linearity: Each convolutional layer is followed by ReLU, so stacking adds more non-linear transformations, making the network more expressive.

Three 3×3 convs have the same receptive field as one 7×7 conv but with:

Fewer parameters: $3 \times 9 = 27$ vs $49$ (45% reduction)
More non-linearity: 3 ReLUs instead of 1

Architecture Variants

VGG came in multiple depths, with VGG-16 and VGG-19 being most common:

VGG-16 Configuration


Input: 224×224×3

Block 1:
  Conv3-64 → Conv3-64 → MaxPool(2×2, stride 2)
  Output: 112×112×64

Block 2:
  Conv3-128 → Conv3-128 → MaxPool(2×2, stride 2)
  Output: 56×56×128

Block 3:
  Conv3-256 → Conv3-256 → Conv3-256 → MaxPool(2×2, stride 2)
  Output: 28×28×256

Block 4:
  Conv3-512 → Conv3-512 → Conv3-512 → MaxPool(2×2, stride 2)
  Output: 14×14×512

Block 5:
  Conv3-512 → Conv3-512 → Conv3-512 → MaxPool(2×2, stride 2)
  Output: 7×7×512

Classifier:
  FC-4096 → ReLU → Dropout(0.5)
  FC-4096 → ReLU → Dropout(0.5)
  FC-1000 → Softmax

Design Patterns

Doubling channels: After each pooling, double the number of filters (64 → 128 → 256 → 512)
Fixed filter size: 3×3 everywhere (with stride 1, padding 1)
Spatial reduction: 2×2 max pooling with stride 2 halves spatial dimensions
Regular structure: Easy to understand and modify

VGG-19

VGG-19 has 3 additional convolutional layers:

Blocks 3, 4, 5 each have one more 3×3 conv (4 convs instead of 3)
Marginal performance improvement over VGG-16
Significantly more computation

Parameters

VGG-16 parameters: ~138 million

Convolutional layers: ~15 million
Fully-connected layers: ~123 million (~90% of total)

The large FC layers made VGG memory-intensive. Modern architectures replace them with global average pooling.

Training Details

Dataset: ImageNet ILSVRC-2014 (1.2M training images, 1000 classes)
Optimization: SGD with momentum (0.9)
Learning rate: 0.01, reduced by 10× when validation plateaus
Weight decay: 0.0005
Batch size: 256
Data augmentation: Random crops, flips, color jittering
Multi-scale training: Trained on multiple scales for better generalization

Results

ILSVRC-2014 Performance:

VGG-16 top-5 error: 7.4%
VGG-19 top-5 error: 7.3%
Localization task: 1st place

Improved significantly over AlexNet (15.3%) while using a simpler, more principled design.

Why VGG Remains Popular

Despite being overtaken by ResNet and other architectures, VGG remains widely used:

1. Feature Extraction

VGG features are semantically rich and transfer well:


import torchvision.models as models
 
vgg16 = models.vgg16(pretrained=True)
# Remove classifier, use as feature extractor
features = nn.Sequential(*list(vgg16.features.children()))

2. Style Transfer

VGG’s feature representations are excellent for neural style transfer (Gatys et al., 2015). The intermediate layers capture style information effectively.

3. Perceptual Loss

VGG features are commonly used as perceptual loss for image generation:


# Use VGG features to measure perceptual similarity
vgg_features = vgg_model(generated_image)
target_features = vgg_model(target_image)
perceptual_loss = mse_loss(vgg_features, target_features)

4. Simplicity

The uniform architecture makes VGG easy to:

Understand and teach
Modify for experiments
Use as a baseline

Limitations

1. Parameter Inefficiency

138M parameters (mostly in FC layers) make VGG:

Memory-intensive
Slow to train and inference
Difficult to deploy on resource-constrained devices

2. No Skip Connections

Without residual connections, VGG struggles with:

Very deep variants (>19 layers)
Vanishing gradients
Optimization difficulty

3. Computational Cost

Many sequential convolutions without shortcuts make VGG slower than modern efficient architectures.

Key Insights

Depth Matters

VGG conclusively demonstrated that deeper networks outperform shallower ones when properly designed. This motivated research into even deeper architectures (ResNet, DenseNet).

Simple Can Be Effective

The uniform 3×3 conv design showed that architectural complexity isn’t necessary - simple, principled designs can achieve excellent results.

Small Filters Are Better

The 3×3 filter strategy (better parameter efficiency + more non-linearity) became standard in modern CNNs.

Legacy

VGG’s impact:

Established design principles: Small filters, depth, regular structure
Transfer learning backbone: Still commonly used for feature extraction
Perceptual metrics: VGG features became standard for measuring perceptual similarity
Pedagogical value: Clean architecture makes it excellent for teaching CNNs

Modern Alternatives

For new projects, consider:

ResNet: Better accuracy with fewer parameters (residual connections)
EfficientNet: Better accuracy/efficiency trade-off
Vision Transformers: State-of-the-art on many benchmarks

But VGG remains valuable for feature extraction and as a perceptual backbone.

AlexNet - Preceded VGG, used larger filters
ResNet - Solved VGG’s depth limitations with skip connections
Receptive Field - Understanding how stacked 3×3 convs build receptive fields

Key Papers

Original paper: Very Deep Convolutional Networks for Large-Scale Image Recognition (Simonyan & Zisserman, ICLR 2015)
Style transfer: A Neural Algorithm of Artistic Style (Gatys et al., 2015) - Uses VGG for style transfer

Learning Resources

Articles

CS231n VGG Architecture - Stanford course coverage