Skip to Content
LibraryPapersVGG (2014)

VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition

Authors: Karen Simonyan, Andrew Zisserman Year: 2014 Venue: ICLR Impact: ⭐⭐ - Established importance of depth and simplicity in CNN design

Overview

VGG networks demonstrated that depth matters in CNNs and that a simple, uniform architecture using only 3×3 convolutions could achieve excellent performance. VGG’s clean design made it highly influential for understanding CNNs and remains widely used for feature extraction despite being overtaken in raw performance by later architectures.

Core Philosophy

VGG explored a simple question: How deep can we go with a very simple, uniform architecture?

The answer: Very deep (16-19 layers), and deeper is better (up to a point).

Key Innovation: Small Filters Throughout

VGG uses only 3×3 convolutions throughout the entire network, with occasional 1×1 convolutions.

Why 3×3 Filters?

Receptive field equivalence: Two stacked 3×3 convolutions have the same receptive field as one 5×5 convolution:

Layer 1: 3×3 conv → output neuron sees 3×3 region Layer 2: 3×3 conv → output neuron now sees 5×5 region of original input

Parameter efficiency: Stacking small filters uses fewer parameters:

  • Two 3×3 convs: 2×(3×3)=182 \times (3 \times 3) = 18 parameters per input/output channel
  • One 5×5 conv: 5×5=255 \times 5 = 25 parameters per input/output channel

More non-linearity: Each convolutional layer is followed by ReLU, so stacking adds more non-linear transformations, making the network more expressive.

Three 3×3 convs have the same receptive field as one 7×7 conv but with:

  • Fewer parameters: 3×9=273 \times 9 = 27 vs 4949 (45% reduction)
  • More non-linearity: 3 ReLUs instead of 1

Architecture Variants

VGG came in multiple depths, with VGG-16 and VGG-19 being most common:

VGG-16 Configuration

Input: 224×224×3 Block 1: Conv3-64 → Conv3-64 → MaxPool(2×2, stride 2) Output: 112×112×64 Block 2: Conv3-128 → Conv3-128 → MaxPool(2×2, stride 2) Output: 56×56×128 Block 3: Conv3-256 → Conv3-256 → Conv3-256 → MaxPool(2×2, stride 2) Output: 28×28×256 Block 4: Conv3-512 → Conv3-512 → Conv3-512 → MaxPool(2×2, stride 2) Output: 14×14×512 Block 5: Conv3-512 → Conv3-512 → Conv3-512 → MaxPool(2×2, stride 2) Output: 7×7×512 Classifier: FC-4096 → ReLU → Dropout(0.5) FC-4096 → ReLU → Dropout(0.5) FC-1000 → Softmax

Design Patterns

  1. Doubling channels: After each pooling, double the number of filters (64 → 128 → 256 → 512)
  2. Fixed filter size: 3×3 everywhere (with stride 1, padding 1)
  3. Spatial reduction: 2×2 max pooling with stride 2 halves spatial dimensions
  4. Regular structure: Easy to understand and modify

VGG-19

VGG-19 has 3 additional convolutional layers:

  • Blocks 3, 4, 5 each have one more 3×3 conv (4 convs instead of 3)
  • Marginal performance improvement over VGG-16
  • Significantly more computation

Parameters

VGG-16 parameters: ~138 million

  • Convolutional layers: ~15 million
  • Fully-connected layers: ~123 million (~90% of total)

The large FC layers made VGG memory-intensive. Modern architectures replace them with global average pooling.

Training Details

  • Dataset: ImageNet ILSVRC-2014 (1.2M training images, 1000 classes)
  • Optimization: SGD with momentum (0.9)
  • Learning rate: 0.01, reduced by 10× when validation plateaus
  • Weight decay: 0.0005
  • Batch size: 256
  • Data augmentation: Random crops, flips, color jittering
  • Multi-scale training: Trained on multiple scales for better generalization

Results

ILSVRC-2014 Performance:

  • VGG-16 top-5 error: 7.4%
  • VGG-19 top-5 error: 7.3%
  • Localization task: 1st place

Improved significantly over AlexNet (15.3%) while using a simpler, more principled design.

Despite being overtaken by ResNet and other architectures, VGG remains widely used:

1. Feature Extraction

VGG features are semantically rich and transfer well:

import torchvision.models as models vgg16 = models.vgg16(pretrained=True) # Remove classifier, use as feature extractor features = nn.Sequential(*list(vgg16.features.children()))

2. Style Transfer

VGG’s feature representations are excellent for neural style transfer (Gatys et al., 2015). The intermediate layers capture style information effectively.

3. Perceptual Loss

VGG features are commonly used as perceptual loss for image generation:

# Use VGG features to measure perceptual similarity vgg_features = vgg_model(generated_image) target_features = vgg_model(target_image) perceptual_loss = mse_loss(vgg_features, target_features)

4. Simplicity

The uniform architecture makes VGG easy to:

  • Understand and teach
  • Modify for experiments
  • Use as a baseline

Limitations

1. Parameter Inefficiency

138M parameters (mostly in FC layers) make VGG:

  • Memory-intensive
  • Slow to train and inference
  • Difficult to deploy on resource-constrained devices

2. No Skip Connections

Without residual connections, VGG struggles with:

  • Very deep variants (>19 layers)
  • Vanishing gradients
  • Optimization difficulty

3. Computational Cost

Many sequential convolutions without shortcuts make VGG slower than modern efficient architectures.

Key Insights

Depth Matters

VGG conclusively demonstrated that deeper networks outperform shallower ones when properly designed. This motivated research into even deeper architectures (ResNet, DenseNet).

Simple Can Be Effective

The uniform 3×3 conv design showed that architectural complexity isn’t necessary - simple, principled designs can achieve excellent results.

Small Filters Are Better

The 3×3 filter strategy (better parameter efficiency + more non-linearity) became standard in modern CNNs.

Legacy

VGG’s impact:

  1. Established design principles: Small filters, depth, regular structure
  2. Transfer learning backbone: Still commonly used for feature extraction
  3. Perceptual metrics: VGG features became standard for measuring perceptual similarity
  4. Pedagogical value: Clean architecture makes it excellent for teaching CNNs

Modern Alternatives

For new projects, consider:

  • ResNet: Better accuracy with fewer parameters (residual connections)
  • EfficientNet: Better accuracy/efficiency trade-off
  • Vision Transformers: State-of-the-art on many benchmarks

But VGG remains valuable for feature extraction and as a perceptual backbone.

  • AlexNet - Preceded VGG, used larger filters
  • ResNet - Solved VGG’s depth limitations with skip connections
  • Receptive Field - Understanding how stacked 3×3 convs build receptive fields

Key Papers

Learning Resources

Articles