Skip to Content
Learning PathsFoundationsComputer Vision & CNNs

Computer Vision with Convolutional Neural Networks

Convolutional Neural Networks (CNNs) revolutionized computer vision by learning spatial hierarchies of features automatically. This module teaches you the architectural principles that make CNNs effective for image data, from convolution operations to modern architectures like ResNet.

Why CNNs Matter

CNNs represent the first major architectural innovation in deep learning:

  • Spatial inductive bias: Built-in assumptions about images (locality, translation invariance)
  • Parameter efficiency: Weight sharing dramatically reduces parameters
  • Hierarchical features: Automatically learn edge → texture → pattern → object hierarchy
  • Foundation for VLMs: Vision transformers build on CNN principles

Learning Objectives

After completing this module, you will:

  • CNN Architecture Mastery: Understand convolution and pooling operations, receptive fields, and why CNNs excel at processing spatial data
  • Modern Architecture Knowledge: Learn the evolution from AlexNet to ResNet, understanding key innovations like skip connections and batch normalization
  • Transfer Learning Skills: Apply pre-trained models to new tasks, a crucial technique for limited datasets (especially in medical imaging)
  • Vision Foundation for VLMs: Build the vision encoder understanding necessary for multimodal models that combine images with text

Prerequisites

Before starting this module, ensure you have completed:

  • Neural Network Foundations
  • Strong understanding of backpropagation and optimization
  • PyTorch Basics: Ready to transition from NumPy to modern frameworks

Week 1: CNN Building Blocks

Day 1-2: Convolution Operations

Core Concept:

Convolution Operations

What You’ll Learn:

  • Convolution as local connectivity + weight sharing
  • Filters, kernels, and feature maps
  • Receptive fields and how they grow with depth
  • Padding (same vs valid) and stride
  • Why convolution works for images

Key Intuitions:

  • Early layers detect edges and simple patterns
  • Deeper layers detect textures and objects
  • Hierarchical feature learning

Practical Skills:

  • Compute output dimensions given input, kernel, stride, padding
  • Visualize what different kernels detect (edges, blobs, etc.)
  • Understand parameter count for conv layers

Learning Resources:

  • Videos: CS231n Lecture 5 (Convolutional Neural Networks)
  • Reading: CS231n CNN notes
  • Interactive: Visualize convolution operations (online tools)

Exercises:

  • Manually compute convolution output for small example
  • Implement conv2d forward pass in NumPy
  • Visualize learned filters from trained CNN

Checkpoint: Can you explain why convolution is more parameter-efficient than fully connected layers for images?

Day 3-4: Pooling and CNN Architectures

Core Concept:

Pooling Layers

What You’ll Learn:

  • Max pooling vs average pooling vs global pooling
  • Why pooling: translation invariance and dimension reduction
  • Spatial pyramid pooling
  • Modern alternatives (strided convolutions)

Practical Skills:

  • Compute pooling output dimensions
  • Understand when to use different pooling types
  • Know pooling’s effect on receptive field

Learning Resources:

  • Videos: CS231n Lecture 5 (continued)
  • Reading: CS231n CNN architecture notes
  • Code: Build a simple CNN in PyTorch (Conv → ReLU → Pool → FC)

Exercises:

  • Implement max pooling forward and backward pass
  • Build and train a simple CNN on CIFAR-10
  • Visualize feature maps at different layers

Checkpoint: Can you build a complete CNN architecture and explain each design choice?

Day 5-7: The Evolution of CNN Architectures

Architecture Papers:

  1. AlexNet (2012) - The Deep Learning Breakthrough

    • 8-layer CNN that won ImageNet 2012
    • ReLU activation, dropout, data augmentation
    • GPU training and local response normalization
    • Impact: Proved deep learning works at scale, started the deep learning revolution
  2. VGG (2014) - The Power of Depth

    • 16-19 layer networks with simple 3×3 convolutions
    • Demonstrated importance of network depth
    • Stack of 3×3 convs = single 5×5 or 7×7 conv (but better)
    • Impact: Showed that depth matters more than kernel size
  3. ResNet (2015) - Skip Connections Revolution

    • 50, 101, 152 layer networks (superhuman ImageNet performance)
    • Skip connections solve vanishing gradient problem
    • Identity mappings make optimization easier
    • Residual learning: learn F(x) = H(x) - x instead of H(x)
    • Impact: Enabled training of very deep networks, skip connections now ubiquitous

Key Innovations Timeline:

  • AlexNet (2012): Depth + ReLU + Dropout + GPU → 15% error (revolutionary at the time)
  • VGG (2014): Deeper + 3×3 convs → 7.3% error
  • ResNet (2015): Very deep + skip connections → 3.57% error (superhuman)

Learning Resources:

  • Papers: Read all three papers (AlexNet, VGG, ResNet) in full
  • Videos: CS231n Lecture 9 (CNN Architectures)
  • Code: Implement ResNet blocks in PyTorch

Exercises:

  • Compare AlexNet vs VGG vs ResNet parameter counts
  • Implement a residual block from scratch
  • Visualize what happens with/without skip connections during training
  • Train a small ResNet on CIFAR-10

Checkpoint:

  • Can you explain the key innovation of each architecture?
  • Can you implement a ResNet block from memory?
  • Do you understand why skip connections enable very deep networks?

Week 2: Transfer Learning and Applications

Day 8-10: Transfer Learning

Core Concept:

Transfer Learning

What You’ll Learn:

  • Why transfer learning works (feature universality)
  • Pre-training on ImageNet, fine-tuning on target task
  • Feature extraction vs fine-tuning strategies
  • When to freeze layers vs train end-to-end
  • Domain adaptation challenges

Transfer Learning Strategies:

  1. Feature Extraction: Freeze pre-trained layers, train only final classifier

    • Use when: Very small dataset (<1000 examples), similar domain
    • Fast training, prevents overfitting
  2. Fine-Tuning: Unfreeze some/all layers, train with small learning rate

    • Use when: Medium dataset (1000-100k examples)
    • Better performance but requires more data
  3. Train from Scratch: Random initialization

    • Use when: Very large dataset (>1M examples) or very different domain
    • Requires significant compute and data

Practical Guidelines:

  • Tiny dataset (<1000): Feature extraction only
  • Small dataset (1000-10k): Fine-tune top layers
  • Medium dataset (10k-100k): Fine-tune most/all layers
  • Large dataset (>100k): Consider training from scratch or fine-tuning everything

Medical Imaging Application:

  • Transfer from ImageNet to X-rays: 30-40% performance boost
  • Critical technique for limited medical datasets
  • Often fine-tune with domain-specific augmentation

Learning Resources:

  • Reading: Transfer learning best practices, domain adaptation papers
  • Videos: CS231n Transfer Learning lecture
  • Code: Fine-tune ResNet on a small custom dataset

Exercises:

  • Download a pre-trained ResNet from torchvision
  • Fine-tune on a custom small dataset (<1000 images)
  • Compare feature extraction vs fine-tuning vs training from scratch
  • Experiment with freezing different numbers of layers

Checkpoint:

  • Can you explain when to use feature extraction vs fine-tuning?
  • Can you implement transfer learning in PyTorch?
  • Do you understand why transfer learning is critical for medical imaging?

Day 11-12: Modern CNN Techniques

Additional Important Concepts:

Batch Normalization:

  • Normalizes layer inputs during training
  • Allows higher learning rates
  • Acts as regularization
  • Now standard in most architectures

Data Augmentation:

  • Random crops, flips, rotations
  • Color jittering, cutout, mixup
  • Domain-specific augmentations for medical images

Regularization for CNNs:

  • Dropout (but less common in modern architectures)
  • L2 weight decay
  • Data augmentation as regularization
  • Early stopping

Learning Resources:

  • Papers: Batch Normalization (Ioffe & Szegedy, 2015)
  • Reading: CS231n training notes
  • Code: Add batch norm and augmentation to your CNN

Checkpoint: Can you explain how batch normalization helps training?

Day 13-14: Hands-On Project

Project: Image Classification with Transfer Learning

Requirements:

  1. Choose a dataset:
  2. Implement three approaches:
    • Train simple CNN from scratch
    • Feature extraction with pre-trained ResNet
    • Fine-tuning pre-trained ResNet
  3. Compare performance, training time, and convergence
  4. Use proper augmentation and regularization
  5. Achieve competitive performance (>80% accuracy)

Deliverables:

  • Working PyTorch implementation
  • Training curves comparing all three approaches
  • Analysis of results and learned features
  • Visualization of learned filters and feature maps
  • Class activation maps (CAM) for interpretability

Time Estimate: 6-10 hours

Module Completion Criteria

You have completed this module when you can:

  • ✅ Explain convolution operation and receptive fields
  • ✅ Implement basic CNN architectures in PyTorch
  • ✅ Understand the key innovation of AlexNet, VGG, and ResNet
  • ✅ Explain why skip connections enable very deep networks
  • ✅ Apply transfer learning to new image classification tasks
  • ✅ Choose appropriate fine-tuning strategy for a given dataset size
  • ✅ Use data augmentation and batch normalization effectively
  • ✅ Visualize and interpret CNN learned features

Key Resources

Essential Papers (Must Read)

  1. ImageNet Classification with Deep CNNs (Krizhevsky et al., 2012) - AlexNet
  2. Very Deep Convolutional Networks (Simonyan & Zisserman, 2014) - VGG
  3. Deep Residual Learning for Image Recognition (He et al., 2015) - ResNet

Videos

  • CS231n Lectures 5, 7, 9: CNNs, Training, Architectures (~4 hours)
  • Stanford CS231n Assignment 2: CNN implementation and training

Code Resources

  • PyTorch torchvision models (pre-trained weights)
  • CS231n Assignment 2 (TensorFlow/PyTorch)

Connection to Advanced Topics

CNNs are the foundation for:

  • Vision Transformers (ViT): Patches treated like tokens, but convolution principles still apply
  • Medical Imaging AI: Transfer learning from ImageNet to medical images
  • Multimodal Models (CLIP): CNN or ViT as vision encoder
  • Object Detection: Faster R-CNN, YOLO (built on CNN backbones)
  • Semantic Segmentation: U-Net, FCN (encoder-decoder CNNs)

Common Pitfalls

1. Not Using Pre-Trained Models

Problem: Training from scratch on small dataset Solution: Always try transfer learning first, especially for <10k examples

2. Wrong Augmentation

Problem: Using augmentation that changes semantic meaning Solution: Only use augmentations valid for your domain (e.g., no rotations for text)

3. Forgetting to Switch to Eval Mode

Problem: BatchNorm and Dropout behave differently at test time Solution: Always call model.eval() before inference

4. Over-Fine-Tuning

Problem: Fine-tuning with too high learning rate destroys pre-trained features Solution: Use 10-100x smaller learning rate for fine-tuning than for training from scratch

Success Tips

  1. Visualize Everything

    • Visualize learned filters
    • Visualize feature maps
    • Use Class Activation Maps (CAM) to understand predictions
  2. Start with Pre-Trained Models

    • Don’t train from scratch unless you have massive data
    • Transfer learning almost always helps
  3. Understand Receptive Fields

    • Track how receptive field grows with depth
    • Ensure final receptive field covers entire object
  4. Monitor Training Carefully

    • Watch train/val curves
    • Use TensorBoard or similar
    • Catch overfitting early

Next Steps

After completing this module, proceed to:

  1. Next Module: Module 3: Attention and Transformers
  2. Advanced Vision: Vision Transformers (ViT), object detection, segmentation
  3. Domain Application: Medical imaging with CNNs, apply to healthcare data

Time Investment

Total estimated time: 12-18 hours over 2 weeks

  • Papers: 3-4 hours (AlexNet, VGG, ResNet)
  • Videos: 3-4 hours (CS231n lectures)
  • Coding: 6-10 hours (exercises + project)

Key Takeaway

“CNNs work because they encode the right inductive biases for images: locality and translation invariance.”

Understanding why CNNs work is as important as knowing how to use them. The architectural principles (convolution, pooling, skip connections) generalize beyond computer vision to any data with spatial or sequential structure.


Ready to begin? Start with Convolution Operations.