Multimodal Vision-Language Models (VLMs)

Learn how to combine vision and language representations into shared embedding spaces, understand contrastive learning with CLIP, and explore state-of-the-art VLM architectures. These are the techniques powering modern multimodal AI - from zero-shot image classification to medical image captioning.

Why This Module Is Essential

Real-world data is inherently multimodal. In healthcare, you have medical images, clinical notes, EHR data, patient-reported symptoms, and body location sketches. Learning from multiple modalities provides richer understanding than any single modality alone.

This module is the direct precursor to multimodal healthcare AI:

Combining medical images with radiology reports
Fusing EHR data with symptom text and body sketches
Zero-shot diagnosis for rare conditions
Cross-modal clinical reasoning

Learning Objectives

After completing this module, you will:

Multimodal Fusion Understanding: Master different strategies for combining vision and language (early fusion, late fusion, cross-modal attention)
Contrastive Learning Mastery: Deeply understand InfoNCE loss and how it creates aligned embedding spaces
CLIP Architecture: Know how CLIP trains on 400M pairs and enables zero-shot transfer
Vision Transformers: Understand how transformers apply to images through patch-based representations
Healthcare Application: Apply VLM techniques to medical imaging and clinical text

Prerequisites

Before starting this module, ensure you have:

Strong CNN understanding (vision encoders)
Deep transformer knowledge (text encoders)
Language model understanding
Linear Algebra: Vector spaces, cosine similarity, projections

Week 1: Foundations and Contrastive Learning

Day 1-3: Multimodal Foundations

Core Concept:

Multimodal Foundations

What You’ll Learn:

Fusion Strategies:

Early Fusion (Feature-Level)
- Concatenate raw features from different modalities
- Single model processes combined features
- Pros: Maximum interaction between modalities
- Cons: Expensive, needs aligned data
Late Fusion (Decision-Level)
- Separate models for each modality
- Combine predictions at output
- Pros: Modality-specific processing, flexible
- Cons: Limited cross-modal interaction
Intermediate Fusion (Hybrid)
- Process modalities separately initially
- Fuse at intermediate layers
- Cross-attention between modality representations
- Pros: Balanced interaction and efficiency
- Most common in modern VLMs

Modality Alignment:

Mapping different modalities to shared embedding space
Cosine similarity for measuring alignment
Contrastive objectives for learning alignment

Cross-Modal Attention:

Query from one modality, Key+Value from another
Enables deep interaction between modalities
Used in encoder-decoder VLMs (BLIP, Flamingo)

Challenges:

Modality imbalance (one dominates)
Alignment noise (imperfect pairs)
Computational cost of joint processing

Learning Resources:

Papers: “Multimodal Learning with Transformers: A Survey”
Reading: Multimodal fusion taxonomy
Code: Implement simple early and late fusion

Exercises:

Compare fusion strategies on toy multimodal dataset
Visualize learned embedding spaces
Implement cross-modal attention layer

Checkpoint: Can you explain the trade-offs between early, late, and intermediate fusion?

Day 4-7: Contrastive Learning

Core Concept:

Contrastive Learning

What You’ll Learn:

Contrastive Learning Intuition:

Pull positive pairs together in embedding space
Push negative pairs apart
Self-supervised learning from natural pairing (images + captions)

InfoNCE Loss (The Core Loss Function):

\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_i^+) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i, z_j) / \tau)}

Where:

$z_i$ is the anchor (e.g., image embedding)
$z_i^+$ is the positive (e.g., matching text embedding)
$z_j$ are negatives (all other samples in batch)
$\tau$ is temperature (controls distribution sharpness)
sim is cosine similarity

Why InfoNCE Works:

Implicitly performs hard negative mining
Scales with batch size (more negatives = better)
Symmetric loss (both directions)
No explicit negative sampling needed

Temperature Parameter ( $\tau$ ):

Low temperature (0.01-0.1): Sharp distribution, confident predictions
High temperature (1.0+): Smooth distribution, uncertain predictions
Typical: 0.07 (CLIP uses learnable temperature)

Batch Size Importance:

Larger batch = more negative pairs
CLIP trains with 32,768 batch size!
Gradient accumulation or distributed training required

Variants:

SimCLR: Image-image contrastive learning
MoCo: Momentum encoders for consistency
CLIP: Image-text contrastive learning at scale

False Negatives Problem:

Some “negatives” might actually be semantically similar
E.g., “dog” and “puppy” treated as negatives
Solutions: Hard negative mining, multi-positive contrastive learning

Learning Resources:

Papers:
- “A Simple Framework for Contrastive Learning of Visual Representations” (SimCLR)
- “Momentum Contrast for Unsupervised Visual Representation Learning” (MoCo)
- “Representation Learning with Contrastive Predictive Coding” (InfoNCE origin)
Videos: Contrastive learning explained
Code: Implement InfoNCE loss from scratch

Exercises:

Implement contrastive loss in PyTorch
Experiment with different temperatures
Visualize embedding space with t-SNE
Compare different negative sampling strategies
Train simple contrastive model on image pairs

Checkpoint: Can you implement InfoNCE loss and explain why batch size matters?

Week 2: Vision Transformers and CLIP

Day 8-10: Vision Transformers (ViT)

Architecture Paper:

Vision Transformer (ViT) - “An Image is Worth 16x16 Words”

What You’ll Learn:

Patch-Based Image Tokenization:


Image (224×224×3)
  ↓ Split into patches (16×16 patches = 196 patches)
  ↓ Flatten each patch (16×16×3 = 768-dim vector)
  ↓ Linear projection to embedding dim
  ↓ Add [CLS] token + positional embeddings
  ↓ 197 tokens (196 patches + 1 CLS) → Transformer

ViT Architecture:

Patch Embedding: Split image into fixed-size patches
Position Embeddings: Learned or sinusoidal (like in NLP transformers)
[CLS] Token: Classification token (like BERT)
Transformer Encoder: Standard transformer blocks (no modification!)
Classification Head: MLP on [CLS] output

Key Insights:

Transformers work for vision with sufficient data (JFT-300M pre-training)
Attention provides interpretability (visualize attention maps)
ViT < ResNet on small datasets, ViT > ResNet on large datasets
Inductive biases vs data scale trade-off

Pre-Training Strategies:

Supervised pre-training on ImageNet or JFT-300M
Self-supervised (MAE - Masked Autoencoder)
Contrastive (CLIP-style)

ViT vs CNNs:

Aspect	CNN	ViT
Inductive bias	Strong (locality, translation equivariance)	Weak
Data efficiency	Better on small data	Needs large-scale pre-training
Interpretability	Limited (activations)	High (attention maps)
Scalability	Diminishing returns	Scales well with data/compute
Flexibility	Specialized for images	General architecture

Learning Resources:

Papers: “An Image is Worth 16x16 Words” (ViT paper)
Code: timm library (ViT implementations), HuggingFace transformers
Videos: ViT explained

Exercises:

Implement ViT patch embedding
Visualize attention maps
Compare ViT vs ResNet on ImageNet
Fine-tune pre-trained ViT on custom dataset

Checkpoint: Can you implement ViT patch embedding and explain when to use ViT vs CNNs?

Day 11-14: CLIP Architecture and Training

Architecture Paper:

CLIP - “Learning Transferable Visual Models from Natural Language Supervision”

What You’ll Learn:

CLIP Core Idea:

Train vision and text encoders jointly
Optimize contrastive objective: match correct image-text pairs
Learn from natural language supervision (400M pairs from internet)
Enable zero-shot transfer to downstream tasks

CLIP Architecture:


Image Encoder (Vision Transformer or ResNet)
  ↓
Image Embedding (512-dim)
  ↓
Normalize → Cosine Similarity Matrix ← Normalize
                                       ↑
                                 Text Embedding (512-dim)
                                       ↑
Text Encoder (Transformer)

Dual-Encoder Design:

Image Encoder: ViT-L/14 or ResNet-50 (various sizes)
Text Encoder: Transformer (12 layers, 512-dim)
Projection Heads: Linear layers to shared 512-dim space
Contrastive Loss: Symmetric InfoNCE over batch

Training Details:

Scale: 400M image-text pairs from internet
Batch Size: 32,768 (massive!)
Epochs: 32 epochs
Optimization: AdamW with cosine LR schedule
Augmentation: Random crop, no color jittering (text describes colors)
Temperature: Learnable, initialized to 0.07

Zero-Shot Transfer:


# At inference:
image_features = image_encoder(image)  # [512]
text_features = text_encoder("a photo of a {class}")  # [num_classes, 512]
 
# Compute similarities
similarities = image_features @ text_features.T  # [num_classes]
probs = softmax(similarities / temperature)

Prompt Engineering for Vision:

“a photo of a {class}” works well
Ensemble of prompts improves performance:
- “a photo of a {class}”
- “a cropped photo of a {class}”
- “a photo of the {class}”
Domain-specific prompts for medical imaging

CLIP’s Key Innovations:

Natural Language Supervision: 400M pairs >> ImageNet 1M images
Zero-Shot Transfer: No task-specific training needed
Prompt Engineering: Natural language as flexible interface
Scale: Contrastive learning at unprecedented scale

Performance:

Matches supervised ResNet-50 on ImageNet in zero-shot
Better distribution shift robustness
Enables powerful image-text applications

Learning Resources:

Papers:
- “Learning Transferable Visual Models from Natural Language Supervision” (CLIP paper)
- OpenAI CLIP blog post
Code:
- Official CLIP repository (PyTorch)
- OpenCLIP (community reimplementation)
Videos: CLIP explained, prompt engineering tutorials

Exercises:

Implement CLIP contrastive loss (symmetric InfoNCE)
Use pre-trained CLIP for zero-shot classification
Experiment with prompt engineering
Fine-tune CLIP on domain-specific data
Visualize learned embedding space

Checkpoint:

Can you explain how CLIP enables zero-shot transfer?
Can you implement CLIP’s contrastive loss?
Do you understand why batch size is critical?

Healthcare Applications

Medical Image Captioning


Chest X-ray → CLIP Image Encoder → Medical Report Generation

Approach:

Fine-tune CLIP image encoder on medical images
Pair with medical report decoder (GPT-style)
Train on (image, report) pairs from MIMIC-CXR

Zero-Shot Medical Image Classification


# Define medical classes via prompts
prompts = [
    "a chest x-ray showing pneumonia",
    "a chest x-ray showing normal lungs",
    "a chest x-ray showing cardiomegaly",
]
 
# Zero-shot prediction
image_features = clip.encode_image(xray)
text_features = clip.encode_text(prompts)
probs = (image_features @ text_features.T).softmax(dim=-1)

Benefits:

No labeled data needed for new conditions
Rapidly adapt to new classifications
Interpret via natural language

Multimodal Clinical AI

EmergAI Thesis Application:

Symptom Text Encoder: CLIP text encoder fine-tuned on clinical text
Body Sketch Encoder: CLIP image encoder adapted for body location drawings
EHR Encoder: Transformer for structured data
Fusion: Cross-attention between modalities
Prediction: Emergency department outcomes

Implementation Strategy:

Pre-train on medical data (image-report pairs)
Fine-tune encoders on EmergAI data
Add cross-attention fusion layers
Train end-to-end for outcome prediction

Find images matching clinical description:


query = "patient with shortness of breath and fever"
query_embedding = text_encoder(query)
 
# Retrieve similar images from database
image_embeddings = database.get_all_image_embeddings()
similarities = query_embedding @ image_embeddings.T
top_images = database.get_images(similarities.topk(k=10))

Module Completion Criteria

You have completed this module when you can:

✅ Explain multimodal fusion strategies and trade-offs
✅ Implement contrastive learning (InfoNCE loss)
✅ Understand and use Vision Transformers
✅ Deeply understand CLIP architecture and training
✅ Perform zero-shot image classification with CLIP
✅ Engineer prompts for vision tasks
✅ Apply VLM techniques to healthcare problems

Key Resources

Essential Papers (Must Read)

“Learning Transferable Visual Models from Natural Language Supervision” (Radford et al., 2021) - CLIP
“An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020) - ViT
“A Simple Framework for Contrastive Learning” (Chen et al., 2020) - SimCLR

Code Resources

OpenAI CLIP repository (PyTorch)
OpenCLIP (community implementation)
HuggingFace transformers (ViT, CLIP models)
timm library (ViT implementations)

Healthcare-Specific Resources

MIMIC-CXR dataset (chest X-rays + reports)
BioViL (medical vision-language model)
MedCLIP (medical adaptation of CLIP)

Common Pitfalls

1. Insufficient Batch Size

Problem: Contrastive learning needs many negatives Solution: Use gradient accumulation or distributed training

2. Temperature Too High/Low

Problem: Loss doesn’t converge or collapses Solution: Start with 0.07, make it learnable

3. Modality Imbalance

Problem: One modality dominates, other ignored Solution: Balance gradients, use modality-specific LR, or modality dropout

4. Poor Prompt Engineering

Problem: Zero-shot performance much worse than expected Solution: Try ensemble of prompts, domain-specific templates

5. Overfitting to Small Medical Datasets

Problem: Fine-tuning CLIP on small medical dataset destroys pre-trained knowledge Solution: Use very small LR (1e-5), freeze most layers, add lightweight adapter layers

Success Tips

Start with Pre-Trained Models
- Don’t train CLIP from scratch (needs 400M pairs!)
- Fine-tune on domain data
Experiment with Prompts
- Try multiple prompt templates
- Ensemble prompts for better performance
- Domain-specific prompts (medical terminology)
Visualize Embeddings
- Use t-SNE or UMAP
- Verify modality alignment
- Identify failure modes
Monitor Both Modalities
- Check gradients for both encoders
- Ensure neither modality is ignored
- Balance learning rates if needed

Next Steps

After completing this module:

Immediate: Diffusion Models (text-conditioned generation uses CLIP!)
Healthcare: Apply to EHR Analysis with multimodal fusion
Advanced: Explore recent VLMs (BLIP-2, Flamingo, GPT-4V)

Time Investment

Total estimated time: 12-18 hours over 2 weeks

Multimodal foundations: 3-4 hours
Contrastive learning: 4-5 hours
Vision Transformers: 2-3 hours
CLIP architecture and training: 3-4 hours
Healthcare applications: 2-3 hours

Key Takeaway

“Natural language is a flexible interface to visual models.”

CLIP revolutionized computer vision by showing that language supervision scales better than fixed class labels. This enables zero-shot transfer, flexible task specification via prompts, and powerful multimodal reasoning. For healthcare AI, this means we can leverage medical text (reports, notes) to improve image understanding without expensive pixel-level annotations.

Understand contrastive learning deeply. Master CLIP. Apply to multimodal healthcare problems.

Ready to begin? Start with Multimodal Foundations.

Multimodal Vision-Language Models (VLMs)

Why This Module Is Essential

Learning Objectives

Prerequisites

Week 1: Foundations and Contrastive Learning

Day 1-3: Multimodal Foundations

Day 4-7: Contrastive Learning

Week 2: Vision Transformers and CLIP

Day 8-10: Vision Transformers (ViT)

Day 11-14: CLIP Architecture and Training

Healthcare Applications

Medical Image Captioning

Zero-Shot Medical Image Classification

Multimodal Clinical AI

Cross-Modal Retrieval

Module Completion Criteria

Key Resources

Essential Papers (Must Read)

Code Resources

Healthcare-Specific Resources

Common Pitfalls

1. Insufficient Batch Size

2. Temperature Too High/Low

3. Modality Imbalance

4. Poor Prompt Engineering

5. Overfitting to Small Medical Datasets

Success Tips

Next Steps

Time Investment

Key Takeaway