Skip to Content
LibraryConceptsAdvanced VLMs

Advanced VLM Architectures

Beyond CLIP, several advanced architectures have pushed the boundaries of vision-language modeling. These models achieve stronger performance through innovative fusion strategies, frozen pre-trained components, and instruction tuning. This guide covers three key architectures: Flamingo (few-shot learning), BLIP-2 (efficient fusion), and LLaVA (instruction following).

Flamingo (DeepMind, 2022)

Flamingo interleaves vision and language for powerful few-shot learning on vision-language tasks.

Architecture

Key Features:

  • Frozen pre-trained vision encoder (Normalizer-Free ResNet)
  • Frozen pre-trained language model (Chinchilla, 70B parameters)
  • Gated cross-attention layers that bridge vision and language
  • Total: 80B parameters (only ~1.5B trainable)

Gated Cross-Attention

The key innovation: Perceiver Resampler + Gated Cross-Attention

The gated cross-attention mechanism allows the language model to control how much visual information to incorporate at each layer:

import torch import torch.nn as nn class GatedCrossAttention(nn.Module): def __init__(self, dim, num_heads=8): """ Gated cross-attention that allows LM to attend to visual features. The gating mechanism allows the model to control how much visual information to incorporate at each layer. """ super().__init__() self.cross_attention = nn.MultiheadAttention(dim, num_heads) self.gate = nn.Parameter(torch.zeros(1)) # Starts at 0 (no visual input) def forward(self, text_features, visual_features): """ Args: text_features: (seq_len, batch, dim) - from language model visual_features: (num_tokens, batch, dim) - from vision encoder Returns: gated_features: (seq_len, batch, dim) """ # Cross-attention: text queries, visual keys/values attended, _ = self.cross_attention( query=text_features, key=visual_features, value=visual_features ) # Gated residual connection # tanh(gate) ranges from -1 to 1 return text_features + torch.tanh(self.gate) * attended

Why gating?

  • Starts at 0: model initially behaves like text-only LM
  • Gradually learns to incorporate visual information
  • Prevents destabilizing pre-trained LM
  • Each layer can control visual contribution independently

Strengths

  • Few-shot learning: Can learn from just a few examples in context (0-shot, 4-shot, 8-shot)
  • Interleaved inputs: Handles sequences like “image, text, image, text” naturally
  • Strong zero-shot: Competes with fine-tuned models without task-specific training
  • Frozen components: Leverages powerful pre-trained vision and language models

Limitations

  • Huge scale: 80B parameters (expensive to train and deploy)
  • Frozen encoders: Can’t adapt encoder features to downstream tasks
  • Compute intensive: Requires significant inference resources
  • Limited public access: Model not publicly released

BLIP-2 (Salesforce, 2023)

BLIP-2 provides a more efficient approach to VLMs through a two-stage training process with a lightweight Query Transformer (Q-Former).

Architecture

Key Innovation: Q-Former (Query Transformer)

The Q-Former bridges frozen vision and language models using a small set of learnable query tokens:

import torch import torch.nn as nn class BLIP2(nn.Module): def __init__(self, vision_encoder, text_encoder, llm, num_queries=32): """ BLIP-2 with Q-Former architecture. Args: vision_encoder: Frozen vision model (e.g., ViT) text_encoder: Frozen text encoder (e.g., BERT) llm: Frozen large language model num_queries: Number of learnable query tokens """ super().__init__() # Frozen components self.vision_encoder = vision_encoder for param in self.vision_encoder.parameters(): param.requires_grad = False self.llm = llm for param in self.llm.parameters(): param.requires_grad = False # Learnable Q-Former components self.query_tokens = nn.Parameter(torch.randn(1, num_queries, 768)) self.q_former = BertModel(...) # Trainable BERT-like model (~200M params) # Projection to LLM dimension self.proj = nn.Linear(768, llm.config.hidden_size) def forward(self, images, texts): """ Two-stage forward pass. Stage 1: Q-Former extracts visual features via cross-attention Stage 2: Project to LLM and generate """ batch_size = images.shape[0] # Extract frozen visual features with torch.no_grad(): image_features = self.vision_encoder(images) # (batch, num_patches, vision_dim) - e.g., (batch, 257, 1408) # Q-Former: learnable queries extract relevant info query_tokens = self.query_tokens.expand(batch_size, -1, -1) # Q-Former has self-attention on queries + cross-attention to vision query_output = self.q_former( query_embeds=query_tokens, encoder_hidden_states=image_features, encoder_attention_mask=None ) # (batch, num_queries, 768) - e.g., (batch, 32, 768) # Project to LLM dimension visual_embeddings = self.proj(query_output) # (batch, num_queries, llm_dim) - e.g., (batch, 32, 4096) # Feed to frozen LLM for generation with torch.no_grad(): outputs = self.llm( inputs_embeds=torch.cat([visual_embeddings, text_embeddings], dim=1) ) return outputs

Two-Stage Training

Stage 1: Vision-Language Representation Learning

  • Train Q-Former with frozen vision encoder
  • Three objectives:
    1. Image-text contrastive learning: Align query outputs with text
    2. Image-text matching: Binary classification (match/no match)
    3. Image-grounded text generation: Generate text from queries

Stage 2: Vision-to-Language Generative Learning

  • Connect Q-Former to frozen LLM
  • Train only projection layer
  • Generate text conditioned on images
  • LLM learns to interpret visual embeddings as soft prompts

Why Q-Former Works

The Q-Former acts as an information bottleneck:

  • Input: High-dimensional visual features (e.g., 257 tokens × 1408 dim from ViT)
  • Processing: Learnable queries (32 tokens) extract relevant information via cross-attention
  • Output: Fixed number of tokens (32 tokens × 768 dim)
  • Benefit:
    • Reduces sequence length for LLM (257 → 32 tokens = 8× reduction)
    • Extracts only task-relevant visual information
    • Enables efficient frozen LLM usage

Analogy: The Q-Former is like a “compression layer” that asks the vision encoder specific questions (queries) and summarizes the answers for the LLM.

Strengths

  • Efficient: Doesn’t fine-tune huge encoders or LLMs (only ~200M trainable params)
  • Flexible: Can plug in different vision encoders and LLMs
  • Data-efficient: Leverages pre-trained models, needs less paired data
  • Strong performance: Matches or beats much larger models
  • Public release: Code and models available

LLaVA (Microsoft, 2023)

LLaVA takes a simpler approach: CLIP vision encoder + linear projection + LLM with instruction tuning.

Architecture

import torch import torch.nn as nn class LLaVA(nn.Module): def __init__(self, clip_vision, llm): """ LLaVA: Large Language and Vision Assistant Simple but effective architecture. """ super().__init__() self.vision_encoder = clip_vision # CLIP ViT-L/14 self.projection = nn.Linear( clip_vision.output_dim, # 1024 llm.config.hidden_size # 4096 (for LLaMA-7B) ) self.llm = llm # Vicuna or LLaMA def forward(self, images, instruction): # Encode image with CLIP visual_features = self.vision_encoder(images) # (batch, 257, 1024) - 256 patches + 1 CLS token # Project to LLM space (simple linear layer!) visual_embeddings = self.projection(visual_features) # (batch, 257, 4096) # Encode instruction text instruction_embeddings = self.llm.embed_tokens(instruction) # Concatenate visual and text embeddings inputs_embeds = torch.cat([visual_embeddings, instruction_embeddings], dim=1) # Generate response return self.llm(inputs_embeds=inputs_embeds)

Key insight: A simple linear projection is sufficient to align CLIP visual features with LLM text space. No need for complex cross-attention or Q-Former.

Training Strategy

Instruction Tuning with GPT-4 Generated Data

LLaVA’s secret sauce is high-quality instruction-following data:

  1. Data generation: Use GPT-4 to create diverse image-instruction pairs

    • “Describe this image in detail”
    • “What is unusual about this image?”
    • “Answer the question about this image: [question]”
    • “What are the main objects and their relationships?”
  2. Two-stage training:

    • Stage 1 (pre-training): Train projection layer only (frozen CLIP + frozen LLM)

      • Task: Image captioning
      • Data: 595K image-caption pairs from CC3M
      • Duration: 1 epoch
    • Stage 2 (instruction tuning): Fine-tune LLM + projection (CLIP stays frozen)

      • Task: Instruction following
      • Data: 158K GPT-4 generated instruction-response pairs
      • Duration: 3 epochs

Strengths

  • Simplicity: Easiest to understand and implement among advanced VLMs
  • Effective: Strong instruction-following capabilities
  • Open-source: Code, models, and data publicly available
  • Flexible: Easy to adapt to new tasks and domains
  • Efficient fine-tuning: Only projection + LLM parameters updated
  • Strong community: Active development and improvements (LLaVA-1.5, LLaVA-NeXT)

Comparing VLM Architectures

ArchitectureTrainable ParamsTotal ParamsEfficiencyPerformanceComplexity
CLIPAll (~400M)~400MModerateGoodLow
Flamingo~1.5B (gates)80BLowExcellentHigh
BLIP-2~200M (Q-Former)~10BHighExcellentMedium
LLaVA~7B (LLM + proj)~7BModerateVery GoodLow

Key takeaways:

  • BLIP-2 is most parameter-efficient (frozen components)
  • LLaVA is simplest and most accessible
  • Flamingo has best performance but hardest to reproduce
  • CLIP is the foundation all others build on

Key Design Choices

When building a VLM, consider these design decisions:

1. Vision Encoder

OptionProsConsBest For
CNN (ResNet)Fast, efficient, works with small dataLimited long-range modelingMedical imaging, small datasets
ViTStrong performance with enough dataRequires large datasets, slowerGeneral-purpose, large-scale
CLIP VisionPre-trained on image-text pairs, zero-shotFixed to CLIP’s resolution (224px)When you need alignment

2. Text Encoder/LLM

OptionProsConsBest For
BERT-styleGood for alignment tasks, efficientNot generativeRetrieval, classification
GPT-styleGenerative, instruction-followingLarge, expensiveCaptioning, VQA, chat
T5-styleFlexible encoder-decoderComplex architectureTranslation, summarization

3. Fusion Strategy

StrategyUsed InTrainable ParamsBest For
ContrastiveCLIPAllAlignment, retrieval, zero-shot
Cross-attentionFlamingoAttention layersFew-shot, interleaved inputs
Q-FormerBLIP-2Q-Former onlyEfficient frozen model integration
Simple projectionLLaVAProjection + LLMSimplicity, instruction following

4. Training Approach

ApproachProsConsBest For
Joint trainingEnd-to-end learningExpensive, needs lots of dataUnlimited resources
Two-stageMore stable, data-efficientMay miss end-to-end optimizationMost practical scenarios
Frozen componentsVery efficientLimited adaptabilityLimited compute/data

For building a healthcare VLM (e.g., symptom sketches + clinical notes):

  1. Vision Encoder: ResNet or hybrid CNN-Transformer

    • Pre-train on ImageNet or RadImageNet
    • Fine-tune on medical images or sketches
    • Why: Medical images differ from natural images, fine-tuning helps
  2. Text Encoder: ClinicalBERT or BioClinicalBERT

    • Pre-trained on clinical text (MIMIC notes)
    • Understands medical terminology and abbreviations
    • Why: Domain vocabulary is critical
  3. Fusion Strategy: Multi-stage approach

    • Stage 1: CLIP-style contrastive learning on (sketch, text) pairs
    • Stage 2: Add Q-Former or simple projection for structured data
    • Stage 3: Fine-tune for outcome prediction
    • Why: Leverage contrastive pre-training, then adapt to downstream task
  4. Training: Multi-stage with frozen components

    • Start with frozen pre-trained models
    • Efficient with limited data (~2,000 pairs)
    • Fine-tune selectively based on performance

Example architecture:

class HealthcareVLM(nn.Module): def __init__(self): super().__init__() # Pre-trained on medical images self.vision_encoder = ResNet50(pretrained_on='radiology') # Pre-trained on clinical notes self.text_encoder = ClinicalBERT() # Lightweight fusion self.q_former = QFormer(num_queries=16, dim=768) # Outcome prediction head self.classifier = nn.Linear(768, num_outcomes) # Multi-stage training: # 1. Contrastive pre-training (sketch-text alignment) # 2. Q-Former training (information extraction) # 3. Classifier fine-tuning (outcome prediction)

1. Unified Multimodal Models

  • Single model handling multiple modalities (vision, language, audio, video)
  • Examples: ImageBind (Meta, 6 modalities), Unified-IO (AllenAI)
  • Enables cross-modal reasoning and generation

2. Efficient Architectures

  • Smaller models with strong performance
  • Mobile-friendly VLMs (e.g., MobileVLM, TinyLLaVA)
  • Quantization (4-bit, 8-bit) and distillation
  • Inference optimization (FlashAttention, KV caching)

3. Instruction-Following VLMs

  • Better at following complex instructions
  • Conversational interfaces (ChatGPT-style for images)
  • Chain-of-thought reasoning with images
  • Example: LLaVA-1.5 improves multi-turn dialogue

4. Multimodal Agents

  • VLMs that can take actions (click, type, navigate)
  • Planning and tool use (search, calculate, run code)
  • Integration with robotics (visual grounding for manipulation)
  • Example: Visual-ChatGPT, HuggingGPT

5. Higher Resolution and Detail

  • Moving beyond 224×224 and 336×336 images
  • Patch-level detail understanding
  • Multiple resolution inputs
  • Example: LLaVA-NeXT supports up to 672×672

Further Reading