Computer Vision with Convolutional Neural Networks
Convolutional Neural Networks (CNNs) revolutionized computer vision by learning spatial hierarchies of features automatically. This module teaches you the architectural principles that make CNNs effective for image data, from convolution operations to modern architectures like ResNet.
Why CNNs Matter
CNNs represent the first major architectural innovation in deep learning:
- Spatial inductive bias: Built-in assumptions about images (locality, translation invariance)
- Parameter efficiency: Weight sharing dramatically reduces parameters
- Hierarchical features: Automatically learn edge → texture → pattern → object hierarchy
- Foundation for VLMs: Vision transformers build on CNN principles
Learning Objectives
After completing this module, you will:
- CNN Architecture Mastery: Understand convolution and pooling operations, receptive fields, and why CNNs excel at processing spatial data
- Modern Architecture Knowledge: Learn the evolution from AlexNet to ResNet, understanding key innovations like skip connections and batch normalization
- Transfer Learning Skills: Apply pre-trained models to new tasks, a crucial technique for limited datasets (especially in medical imaging)
- Vision Foundation for VLMs: Build the vision encoder understanding necessary for multimodal models that combine images with text
Prerequisites
Before starting this module, ensure you have completed:
- Neural Network Foundations
- Strong understanding of backpropagation and optimization
- PyTorch Basics: Ready to transition from NumPy to modern frameworks
Week 1: CNN Building Blocks
Day 1-2: Convolution Operations
Core Concept:
Convolution OperationsWhat You’ll Learn:
- Convolution as local connectivity + weight sharing
- Filters, kernels, and feature maps
- Receptive fields and how they grow with depth
- Padding (same vs valid) and stride
- Why convolution works for images
Key Intuitions:
- Early layers detect edges and simple patterns
- Deeper layers detect textures and objects
- Hierarchical feature learning
Practical Skills:
- Compute output dimensions given input, kernel, stride, padding
- Visualize what different kernels detect (edges, blobs, etc.)
- Understand parameter count for conv layers
Learning Resources:
- Videos: CS231n Lecture 5 (Convolutional Neural Networks)
- Reading: CS231n CNN notes
- Interactive: Visualize convolution operations (online tools)
Exercises:
- Manually compute convolution output for small example
- Implement conv2d forward pass in NumPy
- Visualize learned filters from trained CNN
Checkpoint: Can you explain why convolution is more parameter-efficient than fully connected layers for images?
Day 3-4: Pooling and CNN Architectures
Core Concept:
Pooling LayersWhat You’ll Learn:
- Max pooling vs average pooling vs global pooling
- Why pooling: translation invariance and dimension reduction
- Spatial pyramid pooling
- Modern alternatives (strided convolutions)
Practical Skills:
- Compute pooling output dimensions
- Understand when to use different pooling types
- Know pooling’s effect on receptive field
Learning Resources:
- Videos: CS231n Lecture 5 (continued)
- Reading: CS231n CNN architecture notes
- Code: Build a simple CNN in PyTorch (Conv → ReLU → Pool → FC)
Exercises:
- Implement max pooling forward and backward pass
- Build and train a simple CNN on CIFAR-10
- Visualize feature maps at different layers
Checkpoint: Can you build a complete CNN architecture and explain each design choice?
Day 5-7: The Evolution of CNN Architectures
Architecture Papers:
-
AlexNet (2012) - The Deep Learning Breakthrough
- 8-layer CNN that won ImageNet 2012
- ReLU activation, dropout, data augmentation
- GPU training and local response normalization
- Impact: Proved deep learning works at scale, started the deep learning revolution
-
VGG (2014) - The Power of Depth
- 16-19 layer networks with simple 3×3 convolutions
- Demonstrated importance of network depth
- Stack of 3×3 convs = single 5×5 or 7×7 conv (but better)
- Impact: Showed that depth matters more than kernel size
-
ResNet (2015) - Skip Connections Revolution
- 50, 101, 152 layer networks (superhuman ImageNet performance)
- Skip connections solve vanishing gradient problem
- Identity mappings make optimization easier
- Residual learning: learn F(x) = H(x) - x instead of H(x)
- Impact: Enabled training of very deep networks, skip connections now ubiquitous
Key Innovations Timeline:
- AlexNet (2012): Depth + ReLU + Dropout + GPU → 15% error (revolutionary at the time)
- VGG (2014): Deeper + 3×3 convs → 7.3% error
- ResNet (2015): Very deep + skip connections → 3.57% error (superhuman)
Learning Resources:
- Papers: Read all three papers (AlexNet, VGG, ResNet) in full
- Videos: CS231n Lecture 9 (CNN Architectures)
- Code: Implement ResNet blocks in PyTorch
Exercises:
- Compare AlexNet vs VGG vs ResNet parameter counts
- Implement a residual block from scratch
- Visualize what happens with/without skip connections during training
- Train a small ResNet on CIFAR-10
Checkpoint:
- Can you explain the key innovation of each architecture?
- Can you implement a ResNet block from memory?
- Do you understand why skip connections enable very deep networks?
Week 2: Transfer Learning and Applications
Day 8-10: Transfer Learning
Core Concept:
Transfer LearningWhat You’ll Learn:
- Why transfer learning works (feature universality)
- Pre-training on ImageNet, fine-tuning on target task
- Feature extraction vs fine-tuning strategies
- When to freeze layers vs train end-to-end
- Domain adaptation challenges
Transfer Learning Strategies:
-
Feature Extraction: Freeze pre-trained layers, train only final classifier
- Use when: Very small dataset (<1000 examples), similar domain
- Fast training, prevents overfitting
-
Fine-Tuning: Unfreeze some/all layers, train with small learning rate
- Use when: Medium dataset (1000-100k examples)
- Better performance but requires more data
-
Train from Scratch: Random initialization
- Use when: Very large dataset (>1M examples) or very different domain
- Requires significant compute and data
Practical Guidelines:
- Tiny dataset (<1000): Feature extraction only
- Small dataset (1000-10k): Fine-tune top layers
- Medium dataset (10k-100k): Fine-tune most/all layers
- Large dataset (>100k): Consider training from scratch or fine-tuning everything
Medical Imaging Application:
- Transfer from ImageNet to X-rays: 30-40% performance boost
- Critical technique for limited medical datasets
- Often fine-tune with domain-specific augmentation
Learning Resources:
- Reading: Transfer learning best practices, domain adaptation papers
- Videos: CS231n Transfer Learning lecture
- Code: Fine-tune ResNet on a small custom dataset
Exercises:
- Download a pre-trained ResNet from torchvision
- Fine-tune on a custom small dataset (<1000 images)
- Compare feature extraction vs fine-tuning vs training from scratch
- Experiment with freezing different numbers of layers
Checkpoint:
- Can you explain when to use feature extraction vs fine-tuning?
- Can you implement transfer learning in PyTorch?
- Do you understand why transfer learning is critical for medical imaging?
Day 11-12: Modern CNN Techniques
Additional Important Concepts:
Batch Normalization:
- Normalizes layer inputs during training
- Allows higher learning rates
- Acts as regularization
- Now standard in most architectures
Data Augmentation:
- Random crops, flips, rotations
- Color jittering, cutout, mixup
- Domain-specific augmentations for medical images
Regularization for CNNs:
- Dropout (but less common in modern architectures)
- L2 weight decay
- Data augmentation as regularization
- Early stopping
Learning Resources:
- Papers: Batch Normalization (Ioffe & Szegedy, 2015)
- Reading: CS231n training notes
- Code: Add batch norm and augmentation to your CNN
Checkpoint: Can you explain how batch normalization helps training?
Day 13-14: Hands-On Project
Project: Image Classification with Transfer Learning
Requirements:
- Choose a dataset:
- Stanford Dogs - Fine-grained breed classification
- Caltech-101 - Object recognition across diverse categories
- Planet Amazon Rainforest - Multi-label satellite imagery
- Steel Defect Detection - Industrial quality control
- Plant Disease Recognition - Agricultural disease detection
- Custom dataset - Your own domain
- Implement three approaches:
- Train simple CNN from scratch
- Feature extraction with pre-trained ResNet
- Fine-tuning pre-trained ResNet
- Compare performance, training time, and convergence
- Use proper augmentation and regularization
- Achieve competitive performance (>80% accuracy)
Deliverables:
- Working PyTorch implementation
- Training curves comparing all three approaches
- Analysis of results and learned features
- Visualization of learned filters and feature maps
- Class activation maps (CAM) for interpretability
Time Estimate: 6-10 hours
Module Completion Criteria
You have completed this module when you can:
- ✅ Explain convolution operation and receptive fields
- ✅ Implement basic CNN architectures in PyTorch
- ✅ Understand the key innovation of AlexNet, VGG, and ResNet
- ✅ Explain why skip connections enable very deep networks
- ✅ Apply transfer learning to new image classification tasks
- ✅ Choose appropriate fine-tuning strategy for a given dataset size
- ✅ Use data augmentation and batch normalization effectively
- ✅ Visualize and interpret CNN learned features
Key Resources
Essential Papers (Must Read)
- ImageNet Classification with Deep CNNs (Krizhevsky et al., 2012) - AlexNet
- Very Deep Convolutional Networks (Simonyan & Zisserman, 2014) - VGG
- Deep Residual Learning for Image Recognition (He et al., 2015) - ResNet
Videos
- CS231n Lectures 5, 7, 9: CNNs, Training, Architectures (~4 hours)
- Stanford CS231n Assignment 2: CNN implementation and training
Code Resources
- PyTorch torchvision models (pre-trained weights)
- CS231n Assignment 2 (TensorFlow/PyTorch)
Connection to Advanced Topics
CNNs are the foundation for:
- Vision Transformers (ViT): Patches treated like tokens, but convolution principles still apply
- Medical Imaging AI: Transfer learning from ImageNet to medical images
- Multimodal Models (CLIP): CNN or ViT as vision encoder
- Object Detection: Faster R-CNN, YOLO (built on CNN backbones)
- Semantic Segmentation: U-Net, FCN (encoder-decoder CNNs)
Common Pitfalls
1. Not Using Pre-Trained Models
Problem: Training from scratch on small dataset Solution: Always try transfer learning first, especially for <10k examples
2. Wrong Augmentation
Problem: Using augmentation that changes semantic meaning Solution: Only use augmentations valid for your domain (e.g., no rotations for text)
3. Forgetting to Switch to Eval Mode
Problem: BatchNorm and Dropout behave differently at test time
Solution: Always call model.eval() before inference
4. Over-Fine-Tuning
Problem: Fine-tuning with too high learning rate destroys pre-trained features Solution: Use 10-100x smaller learning rate for fine-tuning than for training from scratch
Success Tips
-
Visualize Everything
- Visualize learned filters
- Visualize feature maps
- Use Class Activation Maps (CAM) to understand predictions
-
Start with Pre-Trained Models
- Don’t train from scratch unless you have massive data
- Transfer learning almost always helps
-
Understand Receptive Fields
- Track how receptive field grows with depth
- Ensure final receptive field covers entire object
-
Monitor Training Carefully
- Watch train/val curves
- Use TensorBoard or similar
- Catch overfitting early
Next Steps
After completing this module, proceed to:
- Next Module: Module 3: Attention and Transformers
- Advanced Vision: Vision Transformers (ViT), object detection, segmentation
- Domain Application: Medical imaging with CNNs, apply to healthcare data
Time Investment
Total estimated time: 12-18 hours over 2 weeks
- Papers: 3-4 hours (AlexNet, VGG, ResNet)
- Videos: 3-4 hours (CS231n lectures)
- Coding: 6-10 hours (exercises + project)
Key Takeaway
“CNNs work because they encode the right inductive biases for images: locality and translation invariance.”
Understanding why CNNs work is as important as knowing how to use them. The architectural principles (convolution, pooling, skip connections) generalize beyond computer vision to any data with spatial or sequential structure.
Ready to begin? Start with Convolution Operations.