Module 2 Overview: Computer Vision with CNNs

Foundation Module

Time: 12-18 hours over 1-2 weeks

Learning Objectives

After completing this module, you will be able to:

CNN Architecture Mastery: Understand convolution and pooling operations, receptive fields, and why CNNs excel at processing spatial data
Modern Architecture Knowledge: Learn the evolution from AlexNet to ResNet, understanding key innovations like skip connections and batch normalization
Transfer Learning Skills: Apply pre-trained models to new tasks, a crucial technique for limited medical imaging datasets
Vision Foundation for VLMs: Build the vision encoder understanding necessary for multimodal models that combine images with text

Why This Module Matters

Convolutional Neural Networks (CNNs) revolutionized computer vision by learning spatial hierarchies of features automatically. This module teaches you the architectural principles that make CNNs effective for image data, from convolution operations to modern architectures like ResNet.

What makes this essential:

CNNs are the foundation for all visual understanding
Transfer learning enables working with small datasets (critical for medical imaging)
Architectural innovations (skip connections, batch norm) apply broadly
Vision encoders in multimodal models (CLIP, VLMs) are CNNs or vision transformers

Connection to Healthcare AI

CNNs are fundamental for healthcare applications:

Medical Imaging: X-rays, CT scans, MRI, pathology slides
Transfer Learning: Pre-trained ImageNet models improve 30-40% vs. training from scratch on medical data
Interpretability: Techniques like Grad-CAM help clinicians trust predictions
Multimodal Fusion: CNNs encode visual information to combine with EHR data

Prerequisites

Before starting this module:

Module 1: Strong understanding of neural networks, backpropagation, and optimization
Linear Algebra: Matrix operations, convolutions
PyTorch Basics: Ready to transition from NumPy to modern frameworks

Module Path

Follow Computer Vision with CNNs Learning Path for the complete curriculum.

Key concepts covered:

Convolution Operations - Spatial feature extraction
Pooling Layers - Downsampling strategies
AlexNet - The deep learning breakthrough (2012)
VGG - Depth and simplicity (2014)
ResNet - Skip connections revolution (2015)
Batch Normalization - Training stability
Transfer Learning - Leveraging pre-trained models
CNN Applications - Real-world deployment

Critical Checkpoints

Must complete before proceeding to Module 3:

✅ Understand convolution as spatial filtering
✅ Can calculate receptive field sizes
✅ Understand why pooling reduces spatial dimensions
✅ Know the evolution: AlexNet → VGG → ResNet
✅ Deeply understand skip connections and why they enable deep networks
✅ Can implement batch normalization
✅ Understand transfer learning strategies (freeze vs. fine-tune)
✅ Completed a CNN project (image classification or detection)

Time Breakdown

Total: 12-18 hours over 1-2 weeks

Videos: 3-4 hours (CS231n lectures on CNNs)
Reading: 3-4 hours (papers: AlexNet, VGG, ResNet)
Implementation: 6-10 hours (CNN from scratch, transfer learning project)
Exercises: 2-3 hours

Key Architecture Evolution

Understanding the evolution of CNN architectures is crucial:

AlexNet (2012): Proved deep learning works at scale, won ImageNet by huge margin
VGG (2014): Showed depth matters, simple 3×3 convolutions throughout
ResNet (2015): Skip connections solved vanishing gradients, enabled 100+ layer networks

Each architecture contributed key insights that remain relevant today.

Key Takeaway

Skip connections changed everything.

ResNet’s skip connections solved the degradation problem and enabled truly deep networks. This innovation applies far beyond CNNs—skip connections appear in transformers, diffusion models, and virtually all modern architectures. Understanding why they work is essential.

Next Steps

After completing this module:

Module 3: Attention and Transformers
Module 4: Language Models with NanoGPT
Advanced: Vision-Language Models
Healthcare: Medical Imaging with CNNs

Ready to start? Begin with Computer Vision with CNNs Learning Path.