Advanced VLM Architectures
Beyond CLIP, several advanced architectures have pushed the boundaries of vision-language modeling. These models achieve stronger performance through innovative fusion strategies, frozen pre-trained components, and instruction tuning. This guide covers three key architectures: Flamingo (few-shot learning), BLIP-2 (efficient fusion), and LLaVA (instruction following).
Flamingo (DeepMind, 2022)
Flamingo interleaves vision and language for powerful few-shot learning on vision-language tasks.
Architecture
Key Features:
- Frozen pre-trained vision encoder (Normalizer-Free ResNet)
- Frozen pre-trained language model (Chinchilla, 70B parameters)
- Gated cross-attention layers that bridge vision and language
- Total: 80B parameters (only ~1.5B trainable)
Gated Cross-Attention
The key innovation: Perceiver Resampler + Gated Cross-Attention
The gated cross-attention mechanism allows the language model to control how much visual information to incorporate at each layer:
import torch
import torch.nn as nn
class GatedCrossAttention(nn.Module):
def __init__(self, dim, num_heads=8):
"""
Gated cross-attention that allows LM to attend to visual features.
The gating mechanism allows the model to control how much visual
information to incorporate at each layer.
"""
super().__init__()
self.cross_attention = nn.MultiheadAttention(dim, num_heads)
self.gate = nn.Parameter(torch.zeros(1)) # Starts at 0 (no visual input)
def forward(self, text_features, visual_features):
"""
Args:
text_features: (seq_len, batch, dim) - from language model
visual_features: (num_tokens, batch, dim) - from vision encoder
Returns:
gated_features: (seq_len, batch, dim)
"""
# Cross-attention: text queries, visual keys/values
attended, _ = self.cross_attention(
query=text_features,
key=visual_features,
value=visual_features
)
# Gated residual connection
# tanh(gate) ranges from -1 to 1
return text_features + torch.tanh(self.gate) * attendedWhy gating?
- Starts at 0: model initially behaves like text-only LM
- Gradually learns to incorporate visual information
- Prevents destabilizing pre-trained LM
- Each layer can control visual contribution independently
Strengths
- Few-shot learning: Can learn from just a few examples in context (0-shot, 4-shot, 8-shot)
- Interleaved inputs: Handles sequences like “image, text, image, text” naturally
- Strong zero-shot: Competes with fine-tuned models without task-specific training
- Frozen components: Leverages powerful pre-trained vision and language models
Limitations
- Huge scale: 80B parameters (expensive to train and deploy)
- Frozen encoders: Can’t adapt encoder features to downstream tasks
- Compute intensive: Requires significant inference resources
- Limited public access: Model not publicly released
BLIP-2 (Salesforce, 2023)
BLIP-2 provides a more efficient approach to VLMs through a two-stage training process with a lightweight Query Transformer (Q-Former).
Architecture
Key Innovation: Q-Former (Query Transformer)
The Q-Former bridges frozen vision and language models using a small set of learnable query tokens:
import torch
import torch.nn as nn
class BLIP2(nn.Module):
def __init__(self, vision_encoder, text_encoder, llm, num_queries=32):
"""
BLIP-2 with Q-Former architecture.
Args:
vision_encoder: Frozen vision model (e.g., ViT)
text_encoder: Frozen text encoder (e.g., BERT)
llm: Frozen large language model
num_queries: Number of learnable query tokens
"""
super().__init__()
# Frozen components
self.vision_encoder = vision_encoder
for param in self.vision_encoder.parameters():
param.requires_grad = False
self.llm = llm
for param in self.llm.parameters():
param.requires_grad = False
# Learnable Q-Former components
self.query_tokens = nn.Parameter(torch.randn(1, num_queries, 768))
self.q_former = BertModel(...) # Trainable BERT-like model (~200M params)
# Projection to LLM dimension
self.proj = nn.Linear(768, llm.config.hidden_size)
def forward(self, images, texts):
"""
Two-stage forward pass.
Stage 1: Q-Former extracts visual features via cross-attention
Stage 2: Project to LLM and generate
"""
batch_size = images.shape[0]
# Extract frozen visual features
with torch.no_grad():
image_features = self.vision_encoder(images)
# (batch, num_patches, vision_dim) - e.g., (batch, 257, 1408)
# Q-Former: learnable queries extract relevant info
query_tokens = self.query_tokens.expand(batch_size, -1, -1)
# Q-Former has self-attention on queries + cross-attention to vision
query_output = self.q_former(
query_embeds=query_tokens,
encoder_hidden_states=image_features,
encoder_attention_mask=None
)
# (batch, num_queries, 768) - e.g., (batch, 32, 768)
# Project to LLM dimension
visual_embeddings = self.proj(query_output)
# (batch, num_queries, llm_dim) - e.g., (batch, 32, 4096)
# Feed to frozen LLM for generation
with torch.no_grad():
outputs = self.llm(
inputs_embeds=torch.cat([visual_embeddings, text_embeddings], dim=1)
)
return outputsTwo-Stage Training
Stage 1: Vision-Language Representation Learning
- Train Q-Former with frozen vision encoder
- Three objectives:
- Image-text contrastive learning: Align query outputs with text
- Image-text matching: Binary classification (match/no match)
- Image-grounded text generation: Generate text from queries
Stage 2: Vision-to-Language Generative Learning
- Connect Q-Former to frozen LLM
- Train only projection layer
- Generate text conditioned on images
- LLM learns to interpret visual embeddings as soft prompts
Why Q-Former Works
The Q-Former acts as an information bottleneck:
- Input: High-dimensional visual features (e.g., 257 tokens × 1408 dim from ViT)
- Processing: Learnable queries (32 tokens) extract relevant information via cross-attention
- Output: Fixed number of tokens (32 tokens × 768 dim)
- Benefit:
- Reduces sequence length for LLM (257 → 32 tokens = 8× reduction)
- Extracts only task-relevant visual information
- Enables efficient frozen LLM usage
Analogy: The Q-Former is like a “compression layer” that asks the vision encoder specific questions (queries) and summarizes the answers for the LLM.
Strengths
- Efficient: Doesn’t fine-tune huge encoders or LLMs (only ~200M trainable params)
- Flexible: Can plug in different vision encoders and LLMs
- Data-efficient: Leverages pre-trained models, needs less paired data
- Strong performance: Matches or beats much larger models
- Public release: Code and models available
LLaVA (Microsoft, 2023)
LLaVA takes a simpler approach: CLIP vision encoder + linear projection + LLM with instruction tuning.
Architecture
import torch
import torch.nn as nn
class LLaVA(nn.Module):
def __init__(self, clip_vision, llm):
"""
LLaVA: Large Language and Vision Assistant
Simple but effective architecture.
"""
super().__init__()
self.vision_encoder = clip_vision # CLIP ViT-L/14
self.projection = nn.Linear(
clip_vision.output_dim, # 1024
llm.config.hidden_size # 4096 (for LLaMA-7B)
)
self.llm = llm # Vicuna or LLaMA
def forward(self, images, instruction):
# Encode image with CLIP
visual_features = self.vision_encoder(images)
# (batch, 257, 1024) - 256 patches + 1 CLS token
# Project to LLM space (simple linear layer!)
visual_embeddings = self.projection(visual_features)
# (batch, 257, 4096)
# Encode instruction text
instruction_embeddings = self.llm.embed_tokens(instruction)
# Concatenate visual and text embeddings
inputs_embeds = torch.cat([visual_embeddings, instruction_embeddings], dim=1)
# Generate response
return self.llm(inputs_embeds=inputs_embeds)Key insight: A simple linear projection is sufficient to align CLIP visual features with LLM text space. No need for complex cross-attention or Q-Former.
Training Strategy
Instruction Tuning with GPT-4 Generated Data
LLaVA’s secret sauce is high-quality instruction-following data:
-
Data generation: Use GPT-4 to create diverse image-instruction pairs
- “Describe this image in detail”
- “What is unusual about this image?”
- “Answer the question about this image: [question]”
- “What are the main objects and their relationships?”
-
Two-stage training:
-
Stage 1 (pre-training): Train projection layer only (frozen CLIP + frozen LLM)
- Task: Image captioning
- Data: 595K image-caption pairs from CC3M
- Duration: 1 epoch
-
Stage 2 (instruction tuning): Fine-tune LLM + projection (CLIP stays frozen)
- Task: Instruction following
- Data: 158K GPT-4 generated instruction-response pairs
- Duration: 3 epochs
-
Strengths
- Simplicity: Easiest to understand and implement among advanced VLMs
- Effective: Strong instruction-following capabilities
- Open-source: Code, models, and data publicly available
- Flexible: Easy to adapt to new tasks and domains
- Efficient fine-tuning: Only projection + LLM parameters updated
- Strong community: Active development and improvements (LLaVA-1.5, LLaVA-NeXT)
Comparing VLM Architectures
| Architecture | Trainable Params | Total Params | Efficiency | Performance | Complexity |
|---|---|---|---|---|---|
| CLIP | All (~400M) | ~400M | Moderate | Good | Low |
| Flamingo | ~1.5B (gates) | 80B | Low | Excellent | High |
| BLIP-2 | ~200M (Q-Former) | ~10B | High | Excellent | Medium |
| LLaVA | ~7B (LLM + proj) | ~7B | Moderate | Very Good | Low |
Key takeaways:
- BLIP-2 is most parameter-efficient (frozen components)
- LLaVA is simplest and most accessible
- Flamingo has best performance but hardest to reproduce
- CLIP is the foundation all others build on
Key Design Choices
When building a VLM, consider these design decisions:
1. Vision Encoder
| Option | Pros | Cons | Best For |
|---|---|---|---|
| CNN (ResNet) | Fast, efficient, works with small data | Limited long-range modeling | Medical imaging, small datasets |
| ViT | Strong performance with enough data | Requires large datasets, slower | General-purpose, large-scale |
| CLIP Vision | Pre-trained on image-text pairs, zero-shot | Fixed to CLIP’s resolution (224px) | When you need alignment |
2. Text Encoder/LLM
| Option | Pros | Cons | Best For |
|---|---|---|---|
| BERT-style | Good for alignment tasks, efficient | Not generative | Retrieval, classification |
| GPT-style | Generative, instruction-following | Large, expensive | Captioning, VQA, chat |
| T5-style | Flexible encoder-decoder | Complex architecture | Translation, summarization |
3. Fusion Strategy
| Strategy | Used In | Trainable Params | Best For |
|---|---|---|---|
| Contrastive | CLIP | All | Alignment, retrieval, zero-shot |
| Cross-attention | Flamingo | Attention layers | Few-shot, interleaved inputs |
| Q-Former | BLIP-2 | Q-Former only | Efficient frozen model integration |
| Simple projection | LLaVA | Projection + LLM | Simplicity, instruction following |
4. Training Approach
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Joint training | End-to-end learning | Expensive, needs lots of data | Unlimited resources |
| Two-stage | More stable, data-efficient | May miss end-to-end optimization | Most practical scenarios |
| Frozen components | Very efficient | Limited adaptability | Limited compute/data |
Recommended Approach for Healthcare VLMs
For building a healthcare VLM (e.g., symptom sketches + clinical notes):
-
Vision Encoder: ResNet or hybrid CNN-Transformer
- Pre-train on ImageNet or RadImageNet
- Fine-tune on medical images or sketches
- Why: Medical images differ from natural images, fine-tuning helps
-
Text Encoder: ClinicalBERT or BioClinicalBERT
- Pre-trained on clinical text (MIMIC notes)
- Understands medical terminology and abbreviations
- Why: Domain vocabulary is critical
-
Fusion Strategy: Multi-stage approach
- Stage 1: CLIP-style contrastive learning on (sketch, text) pairs
- Stage 2: Add Q-Former or simple projection for structured data
- Stage 3: Fine-tune for outcome prediction
- Why: Leverage contrastive pre-training, then adapt to downstream task
-
Training: Multi-stage with frozen components
- Start with frozen pre-trained models
- Efficient with limited data (~2,000 pairs)
- Fine-tune selectively based on performance
Example architecture:
class HealthcareVLM(nn.Module):
def __init__(self):
super().__init__()
# Pre-trained on medical images
self.vision_encoder = ResNet50(pretrained_on='radiology')
# Pre-trained on clinical notes
self.text_encoder = ClinicalBERT()
# Lightweight fusion
self.q_former = QFormer(num_queries=16, dim=768)
# Outcome prediction head
self.classifier = nn.Linear(768, num_outcomes)
# Multi-stage training:
# 1. Contrastive pre-training (sketch-text alignment)
# 2. Q-Former training (information extraction)
# 3. Classifier fine-tuning (outcome prediction)Emerging Trends
1. Unified Multimodal Models
- Single model handling multiple modalities (vision, language, audio, video)
- Examples: ImageBind (Meta, 6 modalities), Unified-IO (AllenAI)
- Enables cross-modal reasoning and generation
2. Efficient Architectures
- Smaller models with strong performance
- Mobile-friendly VLMs (e.g., MobileVLM, TinyLLaVA)
- Quantization (4-bit, 8-bit) and distillation
- Inference optimization (FlashAttention, KV caching)
3. Instruction-Following VLMs
- Better at following complex instructions
- Conversational interfaces (ChatGPT-style for images)
- Chain-of-thought reasoning with images
- Example: LLaVA-1.5 improves multi-turn dialogue
4. Multimodal Agents
- VLMs that can take actions (click, type, navigate)
- Planning and tool use (search, calculate, run code)
- Integration with robotics (visual grounding for manipulation)
- Example: Visual-ChatGPT, HuggingGPT
5. Higher Resolution and Detail
- Moving beyond 224×224 and 336×336 images
- Patch-level detail understanding
- Multiple resolution inputs
- Example: LLaVA-NeXT supports up to 672×672
Related Concepts
- CLIP - Foundation for modern VLMs
- Multimodal Foundations - Core fusion strategies
- Contrastive Learning - Training objective for alignment
- Vision Transformers - Vision encoder architecture
- Clinical VLMs - Healthcare-specific applications
- VLM Applications - Real-world use cases
Further Reading
- Flamingo Paper - Few-shot learning with interleaved vision-language
- BLIP-2 Paper - Efficient frozen model integration with Q-Former
- LLaVA Paper - Instruction tuning for vision-language assistants
- BLIP-2 Code - Official implementation
- LLaVA Code - Official implementation with demo