Skip to Content
Learning PathsAdvancedVision-Language Models

Multimodal Vision-Language Models (VLMs)

Learn how to combine vision and language representations into shared embedding spaces, understand contrastive learning with CLIP, and explore state-of-the-art VLM architectures. These are the techniques powering modern multimodal AI - from zero-shot image classification to medical image captioning.

Why This Module Is Essential

Real-world data is inherently multimodal. In healthcare, you have medical images, clinical notes, EHR data, patient-reported symptoms, and body location sketches. Learning from multiple modalities provides richer understanding than any single modality alone.

This module is the direct precursor to multimodal healthcare AI:

  • Combining medical images with radiology reports
  • Fusing EHR data with symptom text and body sketches
  • Zero-shot diagnosis for rare conditions
  • Cross-modal clinical reasoning

Learning Objectives

After completing this module, you will:

  • Multimodal Fusion Understanding: Master different strategies for combining vision and language (early fusion, late fusion, cross-modal attention)
  • Contrastive Learning Mastery: Deeply understand InfoNCE loss and how it creates aligned embedding spaces
  • CLIP Architecture: Know how CLIP trains on 400M pairs and enables zero-shot transfer
  • Vision Transformers: Understand how transformers apply to images through patch-based representations
  • Healthcare Application: Apply VLM techniques to medical imaging and clinical text

Prerequisites

Before starting this module, ensure you have:

Week 1: Foundations and Contrastive Learning

Day 1-3: Multimodal Foundations

Core Concept:

Multimodal Foundations

What You’ll Learn:

Fusion Strategies:

  1. Early Fusion (Feature-Level)

    • Concatenate raw features from different modalities
    • Single model processes combined features
    • Pros: Maximum interaction between modalities
    • Cons: Expensive, needs aligned data
  2. Late Fusion (Decision-Level)

    • Separate models for each modality
    • Combine predictions at output
    • Pros: Modality-specific processing, flexible
    • Cons: Limited cross-modal interaction
  3. Intermediate Fusion (Hybrid)

    • Process modalities separately initially
    • Fuse at intermediate layers
    • Cross-attention between modality representations
    • Pros: Balanced interaction and efficiency
    • Most common in modern VLMs

Modality Alignment:

  • Mapping different modalities to shared embedding space
  • Cosine similarity for measuring alignment
  • Contrastive objectives for learning alignment

Cross-Modal Attention:

  • Query from one modality, Key+Value from another
  • Enables deep interaction between modalities
  • Used in encoder-decoder VLMs (BLIP, Flamingo)

Challenges:

  • Modality imbalance (one dominates)
  • Alignment noise (imperfect pairs)
  • Computational cost of joint processing

Learning Resources:

  • Papers: “Multimodal Learning with Transformers: A Survey”
  • Reading: Multimodal fusion taxonomy
  • Code: Implement simple early and late fusion

Exercises:

  • Compare fusion strategies on toy multimodal dataset
  • Visualize learned embedding spaces
  • Implement cross-modal attention layer

Checkpoint: Can you explain the trade-offs between early, late, and intermediate fusion?

Day 4-7: Contrastive Learning

Core Concept:

Contrastive Learning

What You’ll Learn:

Contrastive Learning Intuition:

  • Pull positive pairs together in embedding space
  • Push negative pairs apart
  • Self-supervised learning from natural pairing (images + captions)

InfoNCE Loss (The Core Loss Function):

L=logexp(sim(zi,zi+)/τ)j=1Nexp(sim(zi,zj)/τ)\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_i^+) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i, z_j) / \tau)}

Where:

  • ziz_i is the anchor (e.g., image embedding)
  • zi+z_i^+ is the positive (e.g., matching text embedding)
  • zjz_j are negatives (all other samples in batch)
  • τ\tau is temperature (controls distribution sharpness)
  • sim is cosine similarity

Why InfoNCE Works:

  • Implicitly performs hard negative mining
  • Scales with batch size (more negatives = better)
  • Symmetric loss (both directions)
  • No explicit negative sampling needed

Temperature Parameter (τ\tau):

  • Low temperature (0.01-0.1): Sharp distribution, confident predictions
  • High temperature (1.0+): Smooth distribution, uncertain predictions
  • Typical: 0.07 (CLIP uses learnable temperature)

Batch Size Importance:

  • Larger batch = more negative pairs
  • CLIP trains with 32,768 batch size!
  • Gradient accumulation or distributed training required

Variants:

  • SimCLR: Image-image contrastive learning
  • MoCo: Momentum encoders for consistency
  • CLIP: Image-text contrastive learning at scale

False Negatives Problem:

  • Some “negatives” might actually be semantically similar
  • E.g., “dog” and “puppy” treated as negatives
  • Solutions: Hard negative mining, multi-positive contrastive learning

Learning Resources:

  • Papers:
    • “A Simple Framework for Contrastive Learning of Visual Representations” (SimCLR)
    • “Momentum Contrast for Unsupervised Visual Representation Learning” (MoCo)
    • “Representation Learning with Contrastive Predictive Coding” (InfoNCE origin)
  • Videos: Contrastive learning explained
  • Code: Implement InfoNCE loss from scratch

Exercises:

  • Implement contrastive loss in PyTorch
  • Experiment with different temperatures
  • Visualize embedding space with t-SNE
  • Compare different negative sampling strategies
  • Train simple contrastive model on image pairs

Checkpoint: Can you implement InfoNCE loss and explain why batch size matters?

Week 2: Vision Transformers and CLIP

Day 8-10: Vision Transformers (ViT)

Architecture Paper:

Vision Transformer (ViT) - “An Image is Worth 16x16 Words”

What You’ll Learn:

Patch-Based Image Tokenization:

Image (224×224×3) ↓ Split into patches (16×16 patches = 196 patches) ↓ Flatten each patch (16×16×3 = 768-dim vector) ↓ Linear projection to embedding dim ↓ Add [CLS] token + positional embeddings ↓ 197 tokens (196 patches + 1 CLS) → Transformer

ViT Architecture:

  1. Patch Embedding: Split image into fixed-size patches
  2. Position Embeddings: Learned or sinusoidal (like in NLP transformers)
  3. [CLS] Token: Classification token (like BERT)
  4. Transformer Encoder: Standard transformer blocks (no modification!)
  5. Classification Head: MLP on [CLS] output

Key Insights:

  • Transformers work for vision with sufficient data (JFT-300M pre-training)
  • Attention provides interpretability (visualize attention maps)
  • ViT < ResNet on small datasets, ViT > ResNet on large datasets
  • Inductive biases vs data scale trade-off

Pre-Training Strategies:

  • Supervised pre-training on ImageNet or JFT-300M
  • Self-supervised (MAE - Masked Autoencoder)
  • Contrastive (CLIP-style)

ViT vs CNNs:

AspectCNNViT
Inductive biasStrong (locality, translation equivariance)Weak
Data efficiencyBetter on small dataNeeds large-scale pre-training
InterpretabilityLimited (activations)High (attention maps)
ScalabilityDiminishing returnsScales well with data/compute
FlexibilitySpecialized for imagesGeneral architecture

Learning Resources:

  • Papers: “An Image is Worth 16x16 Words” (ViT paper)
  • Code: timm library (ViT implementations), HuggingFace transformers
  • Videos: ViT explained

Exercises:

  • Implement ViT patch embedding
  • Visualize attention maps
  • Compare ViT vs ResNet on ImageNet
  • Fine-tune pre-trained ViT on custom dataset

Checkpoint: Can you implement ViT patch embedding and explain when to use ViT vs CNNs?

Day 11-14: CLIP Architecture and Training

Architecture Paper:

CLIP - “Learning Transferable Visual Models from Natural Language Supervision”

What You’ll Learn:

CLIP Core Idea:

  • Train vision and text encoders jointly
  • Optimize contrastive objective: match correct image-text pairs
  • Learn from natural language supervision (400M pairs from internet)
  • Enable zero-shot transfer to downstream tasks

CLIP Architecture:

Image Encoder (Vision Transformer or ResNet) Image Embedding (512-dim) Normalize → Cosine Similarity Matrix ← Normalize Text Embedding (512-dim) Text Encoder (Transformer)

Dual-Encoder Design:

  • Image Encoder: ViT-L/14 or ResNet-50 (various sizes)
  • Text Encoder: Transformer (12 layers, 512-dim)
  • Projection Heads: Linear layers to shared 512-dim space
  • Contrastive Loss: Symmetric InfoNCE over batch

Training Details:

  • Scale: 400M image-text pairs from internet
  • Batch Size: 32,768 (massive!)
  • Epochs: 32 epochs
  • Optimization: AdamW with cosine LR schedule
  • Augmentation: Random crop, no color jittering (text describes colors)
  • Temperature: Learnable, initialized to 0.07

Zero-Shot Transfer:

# At inference: image_features = image_encoder(image) # [512] text_features = text_encoder("a photo of a {class}") # [num_classes, 512] # Compute similarities similarities = image_features @ text_features.T # [num_classes] probs = softmax(similarities / temperature)

Prompt Engineering for Vision:

  • “a photo of a {class}” works well
  • Ensemble of prompts improves performance:
    • “a photo of a {class}”
    • “a cropped photo of a {class}”
    • “a photo of the {class}”
  • Domain-specific prompts for medical imaging

CLIP’s Key Innovations:

  1. Natural Language Supervision: 400M pairs >> ImageNet 1M images
  2. Zero-Shot Transfer: No task-specific training needed
  3. Prompt Engineering: Natural language as flexible interface
  4. Scale: Contrastive learning at unprecedented scale

Performance:

  • Matches supervised ResNet-50 on ImageNet in zero-shot
  • Better distribution shift robustness
  • Enables powerful image-text applications

Learning Resources:

  • Papers:
    • “Learning Transferable Visual Models from Natural Language Supervision” (CLIP paper)
    • OpenAI CLIP blog post
  • Code:
    • Official CLIP repository (PyTorch)
    • OpenCLIP (community reimplementation)
  • Videos: CLIP explained, prompt engineering tutorials

Exercises:

  • Implement CLIP contrastive loss (symmetric InfoNCE)
  • Use pre-trained CLIP for zero-shot classification
  • Experiment with prompt engineering
  • Fine-tune CLIP on domain-specific data
  • Visualize learned embedding space

Checkpoint:

  • Can you explain how CLIP enables zero-shot transfer?
  • Can you implement CLIP’s contrastive loss?
  • Do you understand why batch size is critical?

Healthcare Applications

Medical Image Captioning

Chest X-ray → CLIP Image Encoder → Medical Report Generation

Approach:

  • Fine-tune CLIP image encoder on medical images
  • Pair with medical report decoder (GPT-style)
  • Train on (image, report) pairs from MIMIC-CXR

Zero-Shot Medical Image Classification

# Define medical classes via prompts prompts = [ "a chest x-ray showing pneumonia", "a chest x-ray showing normal lungs", "a chest x-ray showing cardiomegaly", ] # Zero-shot prediction image_features = clip.encode_image(xray) text_features = clip.encode_text(prompts) probs = (image_features @ text_features.T).softmax(dim=-1)

Benefits:

  • No labeled data needed for new conditions
  • Rapidly adapt to new classifications
  • Interpret via natural language

Multimodal Clinical AI

EmergAI Thesis Application:

  • Symptom Text Encoder: CLIP text encoder fine-tuned on clinical text
  • Body Sketch Encoder: CLIP image encoder adapted for body location drawings
  • EHR Encoder: Transformer for structured data
  • Fusion: Cross-attention between modalities
  • Prediction: Emergency department outcomes

Implementation Strategy:

  1. Pre-train on medical data (image-report pairs)
  2. Fine-tune encoders on EmergAI data
  3. Add cross-attention fusion layers
  4. Train end-to-end for outcome prediction

Cross-Modal Retrieval

Find images matching clinical description:

query = "patient with shortness of breath and fever" query_embedding = text_encoder(query) # Retrieve similar images from database image_embeddings = database.get_all_image_embeddings() similarities = query_embedding @ image_embeddings.T top_images = database.get_images(similarities.topk(k=10))

Module Completion Criteria

You have completed this module when you can:

  • ✅ Explain multimodal fusion strategies and trade-offs
  • ✅ Implement contrastive learning (InfoNCE loss)
  • ✅ Understand and use Vision Transformers
  • ✅ Deeply understand CLIP architecture and training
  • ✅ Perform zero-shot image classification with CLIP
  • ✅ Engineer prompts for vision tasks
  • ✅ Apply VLM techniques to healthcare problems

Key Resources

Essential Papers (Must Read)

  1. “Learning Transferable Visual Models from Natural Language Supervision” (Radford et al., 2021) - CLIP
  2. “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020) - ViT
  3. “A Simple Framework for Contrastive Learning” (Chen et al., 2020) - SimCLR

Code Resources

  • OpenAI CLIP repository (PyTorch)
  • OpenCLIP (community implementation)
  • HuggingFace transformers (ViT, CLIP models)
  • timm library (ViT implementations)

Healthcare-Specific Resources

  • MIMIC-CXR dataset (chest X-rays + reports)
  • BioViL (medical vision-language model)
  • MedCLIP (medical adaptation of CLIP)

Common Pitfalls

1. Insufficient Batch Size

Problem: Contrastive learning needs many negatives Solution: Use gradient accumulation or distributed training

2. Temperature Too High/Low

Problem: Loss doesn’t converge or collapses Solution: Start with 0.07, make it learnable

3. Modality Imbalance

Problem: One modality dominates, other ignored Solution: Balance gradients, use modality-specific LR, or modality dropout

4. Poor Prompt Engineering

Problem: Zero-shot performance much worse than expected Solution: Try ensemble of prompts, domain-specific templates

5. Overfitting to Small Medical Datasets

Problem: Fine-tuning CLIP on small medical dataset destroys pre-trained knowledge Solution: Use very small LR (1e-5), freeze most layers, add lightweight adapter layers

Success Tips

  1. Start with Pre-Trained Models

    • Don’t train CLIP from scratch (needs 400M pairs!)
    • Fine-tune on domain data
  2. Experiment with Prompts

    • Try multiple prompt templates
    • Ensemble prompts for better performance
    • Domain-specific prompts (medical terminology)
  3. Visualize Embeddings

    • Use t-SNE or UMAP
    • Verify modality alignment
    • Identify failure modes
  4. Monitor Both Modalities

    • Check gradients for both encoders
    • Ensure neither modality is ignored
    • Balance learning rates if needed

Next Steps

After completing this module:

  1. Immediate: Diffusion Models (text-conditioned generation uses CLIP!)
  2. Healthcare: Apply to EHR Analysis with multimodal fusion
  3. Advanced: Explore recent VLMs (BLIP-2, Flamingo, GPT-4V)

Time Investment

Total estimated time: 12-18 hours over 2 weeks

  • Multimodal foundations: 3-4 hours
  • Contrastive learning: 4-5 hours
  • Vision Transformers: 2-3 hours
  • CLIP architecture and training: 3-4 hours
  • Healthcare applications: 2-3 hours

Key Takeaway

“Natural language is a flexible interface to visual models.”

CLIP revolutionized computer vision by showing that language supervision scales better than fixed class labels. This enables zero-shot transfer, flexible task specification via prompts, and powerful multimodal reasoning. For healthcare AI, this means we can leverage medical text (reports, notes) to improve image understanding without expensive pixel-level annotations.

Understand contrastive learning deeply. Master CLIP. Apply to multimodal healthcare problems.


Ready to begin? Start with Multimodal Foundations.