Multimodal Vision-Language Models (VLMs)
Learn how to combine vision and language representations into shared embedding spaces, understand contrastive learning with CLIP, and explore state-of-the-art VLM architectures. These are the techniques powering modern multimodal AI - from zero-shot image classification to medical image captioning.
Why This Module Is Essential
Real-world data is inherently multimodal. In healthcare, you have medical images, clinical notes, EHR data, patient-reported symptoms, and body location sketches. Learning from multiple modalities provides richer understanding than any single modality alone.
This module is the direct precursor to multimodal healthcare AI:
- Combining medical images with radiology reports
- Fusing EHR data with symptom text and body sketches
- Zero-shot diagnosis for rare conditions
- Cross-modal clinical reasoning
Learning Objectives
After completing this module, you will:
- Multimodal Fusion Understanding: Master different strategies for combining vision and language (early fusion, late fusion, cross-modal attention)
- Contrastive Learning Mastery: Deeply understand InfoNCE loss and how it creates aligned embedding spaces
- CLIP Architecture: Know how CLIP trains on 400M pairs and enables zero-shot transfer
- Vision Transformers: Understand how transformers apply to images through patch-based representations
- Healthcare Application: Apply VLM techniques to medical imaging and clinical text
Prerequisites
Before starting this module, ensure you have:
- Strong CNN understanding (vision encoders)
- Deep transformer knowledge (text encoders)
- Language model understanding
- Linear Algebra: Vector spaces, cosine similarity, projections
Week 1: Foundations and Contrastive Learning
Day 1-3: Multimodal Foundations
Core Concept:
Multimodal FoundationsWhat You’ll Learn:
Fusion Strategies:
-
Early Fusion (Feature-Level)
- Concatenate raw features from different modalities
- Single model processes combined features
- Pros: Maximum interaction between modalities
- Cons: Expensive, needs aligned data
-
Late Fusion (Decision-Level)
- Separate models for each modality
- Combine predictions at output
- Pros: Modality-specific processing, flexible
- Cons: Limited cross-modal interaction
-
Intermediate Fusion (Hybrid)
- Process modalities separately initially
- Fuse at intermediate layers
- Cross-attention between modality representations
- Pros: Balanced interaction and efficiency
- Most common in modern VLMs
Modality Alignment:
- Mapping different modalities to shared embedding space
- Cosine similarity for measuring alignment
- Contrastive objectives for learning alignment
Cross-Modal Attention:
- Query from one modality, Key+Value from another
- Enables deep interaction between modalities
- Used in encoder-decoder VLMs (BLIP, Flamingo)
Challenges:
- Modality imbalance (one dominates)
- Alignment noise (imperfect pairs)
- Computational cost of joint processing
Learning Resources:
- Papers: “Multimodal Learning with Transformers: A Survey”
- Reading: Multimodal fusion taxonomy
- Code: Implement simple early and late fusion
Exercises:
- Compare fusion strategies on toy multimodal dataset
- Visualize learned embedding spaces
- Implement cross-modal attention layer
Checkpoint: Can you explain the trade-offs between early, late, and intermediate fusion?
Day 4-7: Contrastive Learning
Core Concept:
Contrastive LearningWhat You’ll Learn:
Contrastive Learning Intuition:
- Pull positive pairs together in embedding space
- Push negative pairs apart
- Self-supervised learning from natural pairing (images + captions)
InfoNCE Loss (The Core Loss Function):
Where:
- is the anchor (e.g., image embedding)
- is the positive (e.g., matching text embedding)
- are negatives (all other samples in batch)
- is temperature (controls distribution sharpness)
- sim is cosine similarity
Why InfoNCE Works:
- Implicitly performs hard negative mining
- Scales with batch size (more negatives = better)
- Symmetric loss (both directions)
- No explicit negative sampling needed
Temperature Parameter ():
- Low temperature (0.01-0.1): Sharp distribution, confident predictions
- High temperature (1.0+): Smooth distribution, uncertain predictions
- Typical: 0.07 (CLIP uses learnable temperature)
Batch Size Importance:
- Larger batch = more negative pairs
- CLIP trains with 32,768 batch size!
- Gradient accumulation or distributed training required
Variants:
- SimCLR: Image-image contrastive learning
- MoCo: Momentum encoders for consistency
- CLIP: Image-text contrastive learning at scale
False Negatives Problem:
- Some “negatives” might actually be semantically similar
- E.g., “dog” and “puppy” treated as negatives
- Solutions: Hard negative mining, multi-positive contrastive learning
Learning Resources:
- Papers:
- “A Simple Framework for Contrastive Learning of Visual Representations” (SimCLR)
- “Momentum Contrast for Unsupervised Visual Representation Learning” (MoCo)
- “Representation Learning with Contrastive Predictive Coding” (InfoNCE origin)
- Videos: Contrastive learning explained
- Code: Implement InfoNCE loss from scratch
Exercises:
- Implement contrastive loss in PyTorch
- Experiment with different temperatures
- Visualize embedding space with t-SNE
- Compare different negative sampling strategies
- Train simple contrastive model on image pairs
Checkpoint: Can you implement InfoNCE loss and explain why batch size matters?
Week 2: Vision Transformers and CLIP
Day 8-10: Vision Transformers (ViT)
Architecture Paper:
Vision Transformer (ViT) - “An Image is Worth 16x16 Words”What You’ll Learn:
Patch-Based Image Tokenization:
Image (224×224×3)
↓ Split into patches (16×16 patches = 196 patches)
↓ Flatten each patch (16×16×3 = 768-dim vector)
↓ Linear projection to embedding dim
↓ Add [CLS] token + positional embeddings
↓ 197 tokens (196 patches + 1 CLS) → TransformerViT Architecture:
- Patch Embedding: Split image into fixed-size patches
- Position Embeddings: Learned or sinusoidal (like in NLP transformers)
- [CLS] Token: Classification token (like BERT)
- Transformer Encoder: Standard transformer blocks (no modification!)
- Classification Head: MLP on [CLS] output
Key Insights:
- Transformers work for vision with sufficient data (JFT-300M pre-training)
- Attention provides interpretability (visualize attention maps)
- ViT < ResNet on small datasets, ViT > ResNet on large datasets
- Inductive biases vs data scale trade-off
Pre-Training Strategies:
- Supervised pre-training on ImageNet or JFT-300M
- Self-supervised (MAE - Masked Autoencoder)
- Contrastive (CLIP-style)
ViT vs CNNs:
| Aspect | CNN | ViT |
|---|---|---|
| Inductive bias | Strong (locality, translation equivariance) | Weak |
| Data efficiency | Better on small data | Needs large-scale pre-training |
| Interpretability | Limited (activations) | High (attention maps) |
| Scalability | Diminishing returns | Scales well with data/compute |
| Flexibility | Specialized for images | General architecture |
Learning Resources:
- Papers: “An Image is Worth 16x16 Words” (ViT paper)
- Code: timm library (ViT implementations), HuggingFace transformers
- Videos: ViT explained
Exercises:
- Implement ViT patch embedding
- Visualize attention maps
- Compare ViT vs ResNet on ImageNet
- Fine-tune pre-trained ViT on custom dataset
Checkpoint: Can you implement ViT patch embedding and explain when to use ViT vs CNNs?
Day 11-14: CLIP Architecture and Training
Architecture Paper:
CLIP - “Learning Transferable Visual Models from Natural Language Supervision”What You’ll Learn:
CLIP Core Idea:
- Train vision and text encoders jointly
- Optimize contrastive objective: match correct image-text pairs
- Learn from natural language supervision (400M pairs from internet)
- Enable zero-shot transfer to downstream tasks
CLIP Architecture:
Image Encoder (Vision Transformer or ResNet)
↓
Image Embedding (512-dim)
↓
Normalize → Cosine Similarity Matrix ← Normalize
↑
Text Embedding (512-dim)
↑
Text Encoder (Transformer)Dual-Encoder Design:
- Image Encoder: ViT-L/14 or ResNet-50 (various sizes)
- Text Encoder: Transformer (12 layers, 512-dim)
- Projection Heads: Linear layers to shared 512-dim space
- Contrastive Loss: Symmetric InfoNCE over batch
Training Details:
- Scale: 400M image-text pairs from internet
- Batch Size: 32,768 (massive!)
- Epochs: 32 epochs
- Optimization: AdamW with cosine LR schedule
- Augmentation: Random crop, no color jittering (text describes colors)
- Temperature: Learnable, initialized to 0.07
Zero-Shot Transfer:
# At inference:
image_features = image_encoder(image) # [512]
text_features = text_encoder("a photo of a {class}") # [num_classes, 512]
# Compute similarities
similarities = image_features @ text_features.T # [num_classes]
probs = softmax(similarities / temperature)Prompt Engineering for Vision:
- “a photo of a {class}” works well
- Ensemble of prompts improves performance:
- “a photo of a {class}”
- “a cropped photo of a {class}”
- “a photo of the {class}”
- Domain-specific prompts for medical imaging
CLIP’s Key Innovations:
- Natural Language Supervision: 400M pairs >> ImageNet 1M images
- Zero-Shot Transfer: No task-specific training needed
- Prompt Engineering: Natural language as flexible interface
- Scale: Contrastive learning at unprecedented scale
Performance:
- Matches supervised ResNet-50 on ImageNet in zero-shot
- Better distribution shift robustness
- Enables powerful image-text applications
Learning Resources:
- Papers:
- “Learning Transferable Visual Models from Natural Language Supervision” (CLIP paper)
- OpenAI CLIP blog post
- Code:
- Official CLIP repository (PyTorch)
- OpenCLIP (community reimplementation)
- Videos: CLIP explained, prompt engineering tutorials
Exercises:
- Implement CLIP contrastive loss (symmetric InfoNCE)
- Use pre-trained CLIP for zero-shot classification
- Experiment with prompt engineering
- Fine-tune CLIP on domain-specific data
- Visualize learned embedding space
Checkpoint:
- Can you explain how CLIP enables zero-shot transfer?
- Can you implement CLIP’s contrastive loss?
- Do you understand why batch size is critical?
Healthcare Applications
Medical Image Captioning
Chest X-ray → CLIP Image Encoder → Medical Report GenerationApproach:
- Fine-tune CLIP image encoder on medical images
- Pair with medical report decoder (GPT-style)
- Train on (image, report) pairs from MIMIC-CXR
Zero-Shot Medical Image Classification
# Define medical classes via prompts
prompts = [
"a chest x-ray showing pneumonia",
"a chest x-ray showing normal lungs",
"a chest x-ray showing cardiomegaly",
]
# Zero-shot prediction
image_features = clip.encode_image(xray)
text_features = clip.encode_text(prompts)
probs = (image_features @ text_features.T).softmax(dim=-1)Benefits:
- No labeled data needed for new conditions
- Rapidly adapt to new classifications
- Interpret via natural language
Multimodal Clinical AI
EmergAI Thesis Application:
- Symptom Text Encoder: CLIP text encoder fine-tuned on clinical text
- Body Sketch Encoder: CLIP image encoder adapted for body location drawings
- EHR Encoder: Transformer for structured data
- Fusion: Cross-attention between modalities
- Prediction: Emergency department outcomes
Implementation Strategy:
- Pre-train on medical data (image-report pairs)
- Fine-tune encoders on EmergAI data
- Add cross-attention fusion layers
- Train end-to-end for outcome prediction
Cross-Modal Retrieval
Find images matching clinical description:
query = "patient with shortness of breath and fever"
query_embedding = text_encoder(query)
# Retrieve similar images from database
image_embeddings = database.get_all_image_embeddings()
similarities = query_embedding @ image_embeddings.T
top_images = database.get_images(similarities.topk(k=10))Module Completion Criteria
You have completed this module when you can:
- ✅ Explain multimodal fusion strategies and trade-offs
- ✅ Implement contrastive learning (InfoNCE loss)
- ✅ Understand and use Vision Transformers
- ✅ Deeply understand CLIP architecture and training
- ✅ Perform zero-shot image classification with CLIP
- ✅ Engineer prompts for vision tasks
- ✅ Apply VLM techniques to healthcare problems
Key Resources
Essential Papers (Must Read)
- “Learning Transferable Visual Models from Natural Language Supervision” (Radford et al., 2021) - CLIP
- “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020) - ViT
- “A Simple Framework for Contrastive Learning” (Chen et al., 2020) - SimCLR
Code Resources
- OpenAI CLIP repository (PyTorch)
- OpenCLIP (community implementation)
- HuggingFace transformers (ViT, CLIP models)
- timm library (ViT implementations)
Healthcare-Specific Resources
- MIMIC-CXR dataset (chest X-rays + reports)
- BioViL (medical vision-language model)
- MedCLIP (medical adaptation of CLIP)
Common Pitfalls
1. Insufficient Batch Size
Problem: Contrastive learning needs many negatives Solution: Use gradient accumulation or distributed training
2. Temperature Too High/Low
Problem: Loss doesn’t converge or collapses Solution: Start with 0.07, make it learnable
3. Modality Imbalance
Problem: One modality dominates, other ignored Solution: Balance gradients, use modality-specific LR, or modality dropout
4. Poor Prompt Engineering
Problem: Zero-shot performance much worse than expected Solution: Try ensemble of prompts, domain-specific templates
5. Overfitting to Small Medical Datasets
Problem: Fine-tuning CLIP on small medical dataset destroys pre-trained knowledge Solution: Use very small LR (1e-5), freeze most layers, add lightweight adapter layers
Success Tips
-
Start with Pre-Trained Models
- Don’t train CLIP from scratch (needs 400M pairs!)
- Fine-tune on domain data
-
Experiment with Prompts
- Try multiple prompt templates
- Ensemble prompts for better performance
- Domain-specific prompts (medical terminology)
-
Visualize Embeddings
- Use t-SNE or UMAP
- Verify modality alignment
- Identify failure modes
-
Monitor Both Modalities
- Check gradients for both encoders
- Ensure neither modality is ignored
- Balance learning rates if needed
Next Steps
After completing this module:
- Immediate: Diffusion Models (text-conditioned generation uses CLIP!)
- Healthcare: Apply to EHR Analysis with multimodal fusion
- Advanced: Explore recent VLMs (BLIP-2, Flamingo, GPT-4V)
Time Investment
Total estimated time: 12-18 hours over 2 weeks
- Multimodal foundations: 3-4 hours
- Contrastive learning: 4-5 hours
- Vision Transformers: 2-3 hours
- CLIP architecture and training: 3-4 hours
- Healthcare applications: 2-3 hours
Key Takeaway
“Natural language is a flexible interface to visual models.”
CLIP revolutionized computer vision by showing that language supervision scales better than fixed class labels. This enables zero-shot transfer, flexible task specification via prompts, and powerful multimodal reasoning. For healthcare AI, this means we can leverage medical text (reports, notes) to improve image understanding without expensive pixel-level annotations.
Understand contrastive learning deeply. Master CLIP. Apply to multimodal healthcare problems.
Ready to begin? Start with Multimodal Foundations.