VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition
Authors: Karen Simonyan, Andrew Zisserman Year: 2014 Venue: ICLR Impact: ⭐⭐ - Established importance of depth and simplicity in CNN design
Overview
VGG networks demonstrated that depth matters in CNNs and that a simple, uniform architecture using only 3×3 convolutions could achieve excellent performance. VGG’s clean design made it highly influential for understanding CNNs and remains widely used for feature extraction despite being overtaken in raw performance by later architectures.
Core Philosophy
VGG explored a simple question: How deep can we go with a very simple, uniform architecture?
The answer: Very deep (16-19 layers), and deeper is better (up to a point).
Key Innovation: Small Filters Throughout
VGG uses only 3×3 convolutions throughout the entire network, with occasional 1×1 convolutions.
Why 3×3 Filters?
Receptive field equivalence: Two stacked 3×3 convolutions have the same receptive field as one 5×5 convolution:
Layer 1: 3×3 conv → output neuron sees 3×3 region
Layer 2: 3×3 conv → output neuron now sees 5×5 region of original inputParameter efficiency: Stacking small filters uses fewer parameters:
- Two 3×3 convs: parameters per input/output channel
- One 5×5 conv: parameters per input/output channel
More non-linearity: Each convolutional layer is followed by ReLU, so stacking adds more non-linear transformations, making the network more expressive.
Three 3×3 convs have the same receptive field as one 7×7 conv but with:
- Fewer parameters: vs (45% reduction)
- More non-linearity: 3 ReLUs instead of 1
Architecture Variants
VGG came in multiple depths, with VGG-16 and VGG-19 being most common:
VGG-16 Configuration
Input: 224×224×3
Block 1:
Conv3-64 → Conv3-64 → MaxPool(2×2, stride 2)
Output: 112×112×64
Block 2:
Conv3-128 → Conv3-128 → MaxPool(2×2, stride 2)
Output: 56×56×128
Block 3:
Conv3-256 → Conv3-256 → Conv3-256 → MaxPool(2×2, stride 2)
Output: 28×28×256
Block 4:
Conv3-512 → Conv3-512 → Conv3-512 → MaxPool(2×2, stride 2)
Output: 14×14×512
Block 5:
Conv3-512 → Conv3-512 → Conv3-512 → MaxPool(2×2, stride 2)
Output: 7×7×512
Classifier:
FC-4096 → ReLU → Dropout(0.5)
FC-4096 → ReLU → Dropout(0.5)
FC-1000 → SoftmaxDesign Patterns
- Doubling channels: After each pooling, double the number of filters (64 → 128 → 256 → 512)
- Fixed filter size: 3×3 everywhere (with stride 1, padding 1)
- Spatial reduction: 2×2 max pooling with stride 2 halves spatial dimensions
- Regular structure: Easy to understand and modify
VGG-19
VGG-19 has 3 additional convolutional layers:
- Blocks 3, 4, 5 each have one more 3×3 conv (4 convs instead of 3)
- Marginal performance improvement over VGG-16
- Significantly more computation
Parameters
VGG-16 parameters: ~138 million
- Convolutional layers: ~15 million
- Fully-connected layers: ~123 million (~90% of total)
The large FC layers made VGG memory-intensive. Modern architectures replace them with global average pooling.
Training Details
- Dataset: ImageNet ILSVRC-2014 (1.2M training images, 1000 classes)
- Optimization: SGD with momentum (0.9)
- Learning rate: 0.01, reduced by 10× when validation plateaus
- Weight decay: 0.0005
- Batch size: 256
- Data augmentation: Random crops, flips, color jittering
- Multi-scale training: Trained on multiple scales for better generalization
Results
ILSVRC-2014 Performance:
- VGG-16 top-5 error: 7.4%
- VGG-19 top-5 error: 7.3%
- Localization task: 1st place
Improved significantly over AlexNet (15.3%) while using a simpler, more principled design.
Why VGG Remains Popular
Despite being overtaken by ResNet and other architectures, VGG remains widely used:
1. Feature Extraction
VGG features are semantically rich and transfer well:
import torchvision.models as models
vgg16 = models.vgg16(pretrained=True)
# Remove classifier, use as feature extractor
features = nn.Sequential(*list(vgg16.features.children()))2. Style Transfer
VGG’s feature representations are excellent for neural style transfer (Gatys et al., 2015). The intermediate layers capture style information effectively.
3. Perceptual Loss
VGG features are commonly used as perceptual loss for image generation:
# Use VGG features to measure perceptual similarity
vgg_features = vgg_model(generated_image)
target_features = vgg_model(target_image)
perceptual_loss = mse_loss(vgg_features, target_features)4. Simplicity
The uniform architecture makes VGG easy to:
- Understand and teach
- Modify for experiments
- Use as a baseline
Limitations
1. Parameter Inefficiency
138M parameters (mostly in FC layers) make VGG:
- Memory-intensive
- Slow to train and inference
- Difficult to deploy on resource-constrained devices
2. No Skip Connections
Without residual connections, VGG struggles with:
- Very deep variants (>19 layers)
- Vanishing gradients
- Optimization difficulty
3. Computational Cost
Many sequential convolutions without shortcuts make VGG slower than modern efficient architectures.
Key Insights
Depth Matters
VGG conclusively demonstrated that deeper networks outperform shallower ones when properly designed. This motivated research into even deeper architectures (ResNet, DenseNet).
Simple Can Be Effective
The uniform 3×3 conv design showed that architectural complexity isn’t necessary - simple, principled designs can achieve excellent results.
Small Filters Are Better
The 3×3 filter strategy (better parameter efficiency + more non-linearity) became standard in modern CNNs.
Legacy
VGG’s impact:
- Established design principles: Small filters, depth, regular structure
- Transfer learning backbone: Still commonly used for feature extraction
- Perceptual metrics: VGG features became standard for measuring perceptual similarity
- Pedagogical value: Clean architecture makes it excellent for teaching CNNs
Modern Alternatives
For new projects, consider:
- ResNet: Better accuracy with fewer parameters (residual connections)
- EfficientNet: Better accuracy/efficiency trade-off
- Vision Transformers: State-of-the-art on many benchmarks
But VGG remains valuable for feature extraction and as a perceptual backbone.
Related Concepts
- AlexNet - Preceded VGG, used larger filters
- ResNet - Solved VGG’s depth limitations with skip connections
- Receptive Field - Understanding how stacked 3×3 convs build receptive fields
Key Papers
- Original paper: Very Deep Convolutional Networks for Large-Scale Image Recognition (Simonyan & Zisserman, ICLR 2015)
- Style transfer: A Neural Algorithm of Artistic Style (Gatys et al., 2015) - Uses VGG for style transfer
Learning Resources
Articles
- CS231n VGG Architecture - Stanford course coverage