Computer Vision with Convolutional Neural Networks

Convolutional Neural Networks (CNNs) revolutionized computer vision by learning spatial hierarchies of features automatically. This module teaches you the architectural principles that make CNNs effective for image data, from convolution operations to modern architectures like ResNet.

Why CNNs Matter

CNNs represent the first major architectural innovation in deep learning:

Spatial inductive bias: Built-in assumptions about images (locality, translation invariance)
Parameter efficiency: Weight sharing dramatically reduces parameters
Hierarchical features: Automatically learn edge → texture → pattern → object hierarchy
Foundation for VLMs: Vision transformers build on CNN principles

Learning Objectives

After completing this module, you will:

CNN Architecture Mastery: Understand convolution and pooling operations, receptive fields, and why CNNs excel at processing spatial data
Modern Architecture Knowledge: Learn the evolution from AlexNet to ResNet, understanding key innovations like skip connections and batch normalization
Transfer Learning Skills: Apply pre-trained models to new tasks, a crucial technique for limited datasets (especially in medical imaging)
Vision Foundation for VLMs: Build the vision encoder understanding necessary for multimodal models that combine images with text

Prerequisites

Before starting this module, ensure you have completed:

Neural Network Foundations
Strong understanding of backpropagation and optimization
PyTorch Basics: Ready to transition from NumPy to modern frameworks

Week 1: CNN Building Blocks

Day 1-2: Convolution Operations

Core Concept:

Convolution Operations

What You’ll Learn:

Convolution as local connectivity + weight sharing
Filters, kernels, and feature maps
Receptive fields and how they grow with depth
Padding (same vs valid) and stride
Why convolution works for images

Key Intuitions:

Early layers detect edges and simple patterns
Deeper layers detect textures and objects
Hierarchical feature learning

Practical Skills:

Compute output dimensions given input, kernel, stride, padding
Visualize what different kernels detect (edges, blobs, etc.)
Understand parameter count for conv layers

Learning Resources:

Videos: CS231n Lecture 5 (Convolutional Neural Networks)
Reading: CS231n CNN notes
Interactive: Visualize convolution operations (online tools)

Exercises:

Manually compute convolution output for small example
Implement conv2d forward pass in NumPy
Visualize learned filters from trained CNN

Checkpoint: Can you explain why convolution is more parameter-efficient than fully connected layers for images?

Day 3-4: Pooling and CNN Architectures

Core Concept:

Pooling Layers

What You’ll Learn:

Max pooling vs average pooling vs global pooling
Why pooling: translation invariance and dimension reduction
Spatial pyramid pooling
Modern alternatives (strided convolutions)

Practical Skills:

Compute pooling output dimensions
Understand when to use different pooling types
Know pooling’s effect on receptive field

Learning Resources:

Videos: CS231n Lecture 5 (continued)
Reading: CS231n CNN architecture notes
Code: Build a simple CNN in PyTorch (Conv → ReLU → Pool → FC)

Exercises:

Implement max pooling forward and backward pass
Build and train a simple CNN on CIFAR-10
Visualize feature maps at different layers

Checkpoint: Can you build a complete CNN architecture and explain each design choice?

Day 5-7: The Evolution of CNN Architectures

Architecture Papers:

AlexNet (2012) - The Deep Learning Breakthrough
- 8-layer CNN that won ImageNet 2012
- ReLU activation, dropout, data augmentation
- GPU training and local response normalization
- Impact: Proved deep learning works at scale, started the deep learning revolution
VGG (2014) - The Power of Depth
- 16-19 layer networks with simple 3×3 convolutions
- Demonstrated importance of network depth
- Stack of 3×3 convs = single 5×5 or 7×7 conv (but better)
- Impact: Showed that depth matters more than kernel size
ResNet (2015) - Skip Connections Revolution
- 50, 101, 152 layer networks (superhuman ImageNet performance)
- Skip connections solve vanishing gradient problem
- Identity mappings make optimization easier
- Residual learning: learn F(x) = H(x) - x instead of H(x)
- Impact: Enabled training of very deep networks, skip connections now ubiquitous

Key Innovations Timeline:

AlexNet (2012): Depth + ReLU + Dropout + GPU → 15% error (revolutionary at the time)
VGG (2014): Deeper + 3×3 convs → 7.3% error
ResNet (2015): Very deep + skip connections → 3.57% error (superhuman)

Learning Resources:

Papers: Read all three papers (AlexNet, VGG, ResNet) in full
Videos: CS231n Lecture 9 (CNN Architectures)
Code: Implement ResNet blocks in PyTorch

Exercises:

Compare AlexNet vs VGG vs ResNet parameter counts
Implement a residual block from scratch
Visualize what happens with/without skip connections during training
Train a small ResNet on CIFAR-10

Checkpoint:

Can you explain the key innovation of each architecture?
Can you implement a ResNet block from memory?
Do you understand why skip connections enable very deep networks?

Week 2: Transfer Learning and Applications

Day 8-10: Transfer Learning

Core Concept:

Transfer Learning

What You’ll Learn:

Why transfer learning works (feature universality)
Pre-training on ImageNet, fine-tuning on target task
Feature extraction vs fine-tuning strategies
When to freeze layers vs train end-to-end
Domain adaptation challenges

Transfer Learning Strategies:

Feature Extraction: Freeze pre-trained layers, train only final classifier
- Use when: Very small dataset (<1000 examples), similar domain
- Fast training, prevents overfitting
Fine-Tuning: Unfreeze some/all layers, train with small learning rate
- Use when: Medium dataset (1000-100k examples)
- Better performance but requires more data
Train from Scratch: Random initialization
- Use when: Very large dataset (>1M examples) or very different domain
- Requires significant compute and data

Practical Guidelines:

Tiny dataset (<1000): Feature extraction only
Small dataset (1000-10k): Fine-tune top layers
Medium dataset (10k-100k): Fine-tune most/all layers
Large dataset (>100k): Consider training from scratch or fine-tuning everything

Medical Imaging Application:

Transfer from ImageNet to X-rays: 30-40% performance boost
Critical technique for limited medical datasets
Often fine-tune with domain-specific augmentation

Learning Resources:

Reading: Transfer learning best practices, domain adaptation papers
Videos: CS231n Transfer Learning lecture
Code: Fine-tune ResNet on a small custom dataset

Exercises:

Download a pre-trained ResNet from torchvision
Fine-tune on a custom small dataset (<1000 images)
Compare feature extraction vs fine-tuning vs training from scratch
Experiment with freezing different numbers of layers

Checkpoint:

Can you explain when to use feature extraction vs fine-tuning?
Can you implement transfer learning in PyTorch?
Do you understand why transfer learning is critical for medical imaging?

Day 11-12: Modern CNN Techniques

Additional Important Concepts:

Batch Normalization:

Normalizes layer inputs during training
Allows higher learning rates
Acts as regularization
Now standard in most architectures

Data Augmentation:

Random crops, flips, rotations
Color jittering, cutout, mixup
Domain-specific augmentations for medical images

Regularization for CNNs:

Dropout (but less common in modern architectures)
L2 weight decay
Data augmentation as regularization
Early stopping

Learning Resources:

Papers: Batch Normalization (Ioffe & Szegedy, 2015)
Reading: CS231n training notes
Code: Add batch norm and augmentation to your CNN

Checkpoint: Can you explain how batch normalization helps training?

Day 13-14: Hands-On Project

Project: Image Classification with Transfer Learning

Requirements:

Choose a dataset:
- Stanford Dogs - Fine-grained breed classification
- Caltech-101 - Object recognition across diverse categories
- Planet Amazon Rainforest - Multi-label satellite imagery
- Steel Defect Detection - Industrial quality control
- Plant Disease Recognition - Agricultural disease detection
- Custom dataset - Your own domain
Implement three approaches:
- Train simple CNN from scratch
- Feature extraction with pre-trained ResNet
- Fine-tuning pre-trained ResNet
Compare performance, training time, and convergence
Use proper augmentation and regularization
Achieve competitive performance (>80% accuracy)

Deliverables:

Working PyTorch implementation
Training curves comparing all three approaches
Analysis of results and learned features
Visualization of learned filters and feature maps
Class activation maps (CAM) for interpretability

Time Estimate: 6-10 hours

Module Completion Criteria

You have completed this module when you can:

✅ Explain convolution operation and receptive fields
✅ Implement basic CNN architectures in PyTorch
✅ Understand the key innovation of AlexNet, VGG, and ResNet
✅ Explain why skip connections enable very deep networks
✅ Apply transfer learning to new image classification tasks
✅ Choose appropriate fine-tuning strategy for a given dataset size
✅ Use data augmentation and batch normalization effectively
✅ Visualize and interpret CNN learned features

Key Resources

Essential Papers (Must Read)

ImageNet Classification with Deep CNNs (Krizhevsky et al., 2012) - AlexNet
Very Deep Convolutional Networks (Simonyan & Zisserman, 2014) - VGG
Deep Residual Learning for Image Recognition (He et al., 2015) - ResNet

Videos

CS231n Lectures 5, 7, 9: CNNs, Training, Architectures (~4 hours)
Stanford CS231n Assignment 2: CNN implementation and training

Code Resources

PyTorch torchvision models (pre-trained weights)
CS231n Assignment 2 (TensorFlow/PyTorch)

Connection to Advanced Topics

CNNs are the foundation for:

Vision Transformers (ViT): Patches treated like tokens, but convolution principles still apply
Medical Imaging AI: Transfer learning from ImageNet to medical images
Multimodal Models (CLIP): CNN or ViT as vision encoder
Object Detection: Faster R-CNN, YOLO (built on CNN backbones)
Semantic Segmentation: U-Net, FCN (encoder-decoder CNNs)

Common Pitfalls

1. Not Using Pre-Trained Models

Problem: Training from scratch on small dataset Solution: Always try transfer learning first, especially for <10k examples

2. Wrong Augmentation

Problem: Using augmentation that changes semantic meaning Solution: Only use augmentations valid for your domain (e.g., no rotations for text)

3. Forgetting to Switch to Eval Mode

Problem: BatchNorm and Dropout behave differently at test time Solution: Always call model.eval() before inference

4. Over-Fine-Tuning

Problem: Fine-tuning with too high learning rate destroys pre-trained features Solution: Use 10-100x smaller learning rate for fine-tuning than for training from scratch

Success Tips

Visualize Everything
- Visualize learned filters
- Visualize feature maps
- Use Class Activation Maps (CAM) to understand predictions
Start with Pre-Trained Models
- Don’t train from scratch unless you have massive data
- Transfer learning almost always helps
Understand Receptive Fields
- Track how receptive field grows with depth
- Ensure final receptive field covers entire object
Monitor Training Carefully
- Watch train/val curves
- Use TensorBoard or similar
- Catch overfitting early

Next Steps

After completing this module, proceed to:

Next Module: Module 3: Attention and Transformers
Advanced Vision: Vision Transformers (ViT), object detection, segmentation
Domain Application: Medical imaging with CNNs, apply to healthcare data

Time Investment

Total estimated time: 12-18 hours over 2 weeks

Papers: 3-4 hours (AlexNet, VGG, ResNet)
Videos: 3-4 hours (CS231n lectures)
Coding: 6-10 hours (exercises + project)

Key Takeaway

“CNNs work because they encode the right inductive biases for images: locality and translation invariance.”

Understanding why CNNs work is as important as knowing how to use them. The architectural principles (convolution, pooling, skip connections) generalize beyond computer vision to any data with spatial or sequential structure.

Ready to begin? Start with Convolution Operations.