Advanced VLM Architectures

Beyond CLIP, several advanced architectures have pushed the boundaries of vision-language modeling. These models achieve stronger performance through innovative fusion strategies, frozen pre-trained components, and instruction tuning. This guide covers three key architectures: Flamingo (few-shot learning), BLIP-2 (efficient fusion), and LLaVA (instruction following).

Flamingo (DeepMind, 2022)

Flamingo interleaves vision and language for powerful few-shot learning on vision-language tasks.

Architecture

Key Features:

Frozen pre-trained vision encoder (Normalizer-Free ResNet)
Frozen pre-trained language model (Chinchilla, 70B parameters)
Gated cross-attention layers that bridge vision and language
Total: 80B parameters (only ~1.5B trainable)

Gated Cross-Attention

The key innovation: Perceiver Resampler + Gated Cross-Attention

The gated cross-attention mechanism allows the language model to control how much visual information to incorporate at each layer:


import torch
import torch.nn as nn
 
class GatedCrossAttention(nn.Module):
    def __init__(self, dim, num_heads=8):
        """
        Gated cross-attention that allows LM to attend to visual features.
 
        The gating mechanism allows the model to control how much visual
        information to incorporate at each layer.
        """
        super().__init__()
        self.cross_attention = nn.MultiheadAttention(dim, num_heads)
        self.gate = nn.Parameter(torch.zeros(1))  # Starts at 0 (no visual input)
 
    def forward(self, text_features, visual_features):
        """
        Args:
            text_features: (seq_len, batch, dim) - from language model
            visual_features: (num_tokens, batch, dim) - from vision encoder
 
        Returns:
            gated_features: (seq_len, batch, dim)
        """
        # Cross-attention: text queries, visual keys/values
        attended, _ = self.cross_attention(
            query=text_features,
            key=visual_features,
            value=visual_features
        )
 
        # Gated residual connection
        # tanh(gate) ranges from -1 to 1
        return text_features + torch.tanh(self.gate) * attended

Why gating?

Starts at 0: model initially behaves like text-only LM
Gradually learns to incorporate visual information
Prevents destabilizing pre-trained LM
Each layer can control visual contribution independently

Strengths

Few-shot learning: Can learn from just a few examples in context (0-shot, 4-shot, 8-shot)
Interleaved inputs: Handles sequences like “image, text, image, text” naturally
Strong zero-shot: Competes with fine-tuned models without task-specific training
Frozen components: Leverages powerful pre-trained vision and language models

Limitations

Huge scale: 80B parameters (expensive to train and deploy)
Frozen encoders: Can’t adapt encoder features to downstream tasks
Compute intensive: Requires significant inference resources
Limited public access: Model not publicly released

BLIP-2 (Salesforce, 2023)

BLIP-2 provides a more efficient approach to VLMs through a two-stage training process with a lightweight Query Transformer (Q-Former).

Architecture

Key Innovation: Q-Former (Query Transformer)

The Q-Former bridges frozen vision and language models using a small set of learnable query tokens:


import torch
import torch.nn as nn
 
class BLIP2(nn.Module):
    def __init__(self, vision_encoder, text_encoder, llm, num_queries=32):
        """
        BLIP-2 with Q-Former architecture.
 
        Args:
            vision_encoder: Frozen vision model (e.g., ViT)
            text_encoder: Frozen text encoder (e.g., BERT)
            llm: Frozen large language model
            num_queries: Number of learnable query tokens
        """
        super().__init__()
 
        # Frozen components
        self.vision_encoder = vision_encoder
        for param in self.vision_encoder.parameters():
            param.requires_grad = False
 
        self.llm = llm
        for param in self.llm.parameters():
            param.requires_grad = False
 
        # Learnable Q-Former components
        self.query_tokens = nn.Parameter(torch.randn(1, num_queries, 768))
        self.q_former = BertModel(...)  # Trainable BERT-like model (~200M params)
 
        # Projection to LLM dimension
        self.proj = nn.Linear(768, llm.config.hidden_size)
 
    def forward(self, images, texts):
        """
        Two-stage forward pass.
 
        Stage 1: Q-Former extracts visual features via cross-attention
        Stage 2: Project to LLM and generate
        """
        batch_size = images.shape[0]
 
        # Extract frozen visual features
        with torch.no_grad():
            image_features = self.vision_encoder(images)
            # (batch, num_patches, vision_dim) - e.g., (batch, 257, 1408)
 
        # Q-Former: learnable queries extract relevant info
        query_tokens = self.query_tokens.expand(batch_size, -1, -1)
 
        # Q-Former has self-attention on queries + cross-attention to vision
        query_output = self.q_former(
            query_embeds=query_tokens,
            encoder_hidden_states=image_features,
            encoder_attention_mask=None
        )
        # (batch, num_queries, 768) - e.g., (batch, 32, 768)
 
        # Project to LLM dimension
        visual_embeddings = self.proj(query_output)
        # (batch, num_queries, llm_dim) - e.g., (batch, 32, 4096)
 
        # Feed to frozen LLM for generation
        with torch.no_grad():
            outputs = self.llm(
                inputs_embeds=torch.cat([visual_embeddings, text_embeddings], dim=1)
            )
 
        return outputs

Two-Stage Training

Stage 1: Vision-Language Representation Learning

Train Q-Former with frozen vision encoder
Three objectives:
1. Image-text contrastive learning: Align query outputs with text
2. Image-text matching: Binary classification (match/no match)
3. Image-grounded text generation: Generate text from queries

Stage 2: Vision-to-Language Generative Learning

Connect Q-Former to frozen LLM
Train only projection layer
Generate text conditioned on images
LLM learns to interpret visual embeddings as soft prompts

Why Q-Former Works

The Q-Former acts as an information bottleneck:

Input: High-dimensional visual features (e.g., 257 tokens × 1408 dim from ViT)
Processing: Learnable queries (32 tokens) extract relevant information via cross-attention
Output: Fixed number of tokens (32 tokens × 768 dim)
Benefit:
- Reduces sequence length for LLM (257 → 32 tokens = 8× reduction)
- Extracts only task-relevant visual information
- Enables efficient frozen LLM usage

Analogy: The Q-Former is like a “compression layer” that asks the vision encoder specific questions (queries) and summarizes the answers for the LLM.

Strengths

Efficient: Doesn’t fine-tune huge encoders or LLMs (only ~200M trainable params)
Flexible: Can plug in different vision encoders and LLMs
Data-efficient: Leverages pre-trained models, needs less paired data
Strong performance: Matches or beats much larger models
Public release: Code and models available

LLaVA (Microsoft, 2023)

LLaVA takes a simpler approach: CLIP vision encoder + linear projection + LLM with instruction tuning.

Architecture


import torch
import torch.nn as nn
 
class LLaVA(nn.Module):
    def __init__(self, clip_vision, llm):
        """
        LLaVA: Large Language and Vision Assistant
 
        Simple but effective architecture.
        """
        super().__init__()
        self.vision_encoder = clip_vision  # CLIP ViT-L/14
        self.projection = nn.Linear(
            clip_vision.output_dim,  # 1024
            llm.config.hidden_size   # 4096 (for LLaMA-7B)
        )
        self.llm = llm  # Vicuna or LLaMA
 
    def forward(self, images, instruction):
        # Encode image with CLIP
        visual_features = self.vision_encoder(images)
        # (batch, 257, 1024) - 256 patches + 1 CLS token
 
        # Project to LLM space (simple linear layer!)
        visual_embeddings = self.projection(visual_features)
        # (batch, 257, 4096)
 
        # Encode instruction text
        instruction_embeddings = self.llm.embed_tokens(instruction)
 
        # Concatenate visual and text embeddings
        inputs_embeds = torch.cat([visual_embeddings, instruction_embeddings], dim=1)
 
        # Generate response
        return self.llm(inputs_embeds=inputs_embeds)

Key insight: A simple linear projection is sufficient to align CLIP visual features with LLM text space. No need for complex cross-attention or Q-Former.

Training Strategy

Instruction Tuning with GPT-4 Generated Data

LLaVA’s secret sauce is high-quality instruction-following data:

Data generation: Use GPT-4 to create diverse image-instruction pairs
- “Describe this image in detail”
- “What is unusual about this image?”
- “Answer the question about this image: [question]”
- “What are the main objects and their relationships?”
Two-stage training:
- Stage 1 (pre-training): Train projection layer only (frozen CLIP + frozen LLM)
  - Task: Image captioning
  - Data: 595K image-caption pairs from CC3M
  - Duration: 1 epoch
- Stage 2 (instruction tuning): Fine-tune LLM + projection (CLIP stays frozen)
  - Task: Instruction following
  - Data: 158K GPT-4 generated instruction-response pairs
  - Duration: 3 epochs

Strengths

Simplicity: Easiest to understand and implement among advanced VLMs
Effective: Strong instruction-following capabilities
Open-source: Code, models, and data publicly available
Flexible: Easy to adapt to new tasks and domains
Efficient fine-tuning: Only projection + LLM parameters updated
Strong community: Active development and improvements (LLaVA-1.5, LLaVA-NeXT)

Comparing VLM Architectures

Architecture	Trainable Params	Total Params	Efficiency	Performance	Complexity
CLIP	All (~400M)	~400M	Moderate	Good	Low
Flamingo	~1.5B (gates)	80B	Low	Excellent	High
BLIP-2	~200M (Q-Former)	~10B	High	Excellent	Medium
LLaVA	~7B (LLM + proj)	~7B	Moderate	Very Good	Low

Key takeaways:

BLIP-2 is most parameter-efficient (frozen components)
LLaVA is simplest and most accessible
Flamingo has best performance but hardest to reproduce
CLIP is the foundation all others build on

Key Design Choices

When building a VLM, consider these design decisions:

1. Vision Encoder

Option	Pros	Cons	Best For
CNN (ResNet)	Fast, efficient, works with small data	Limited long-range modeling	Medical imaging, small datasets
ViT	Strong performance with enough data	Requires large datasets, slower	General-purpose, large-scale
CLIP Vision	Pre-trained on image-text pairs, zero-shot	Fixed to CLIP’s resolution (224px)	When you need alignment

2. Text Encoder/LLM

Option	Pros	Cons	Best For
BERT-style	Good for alignment tasks, efficient	Not generative	Retrieval, classification
GPT-style	Generative, instruction-following	Large, expensive	Captioning, VQA, chat
T5-style	Flexible encoder-decoder	Complex architecture	Translation, summarization

3. Fusion Strategy

Strategy	Used In	Trainable Params	Best For
Contrastive	CLIP	All	Alignment, retrieval, zero-shot
Cross-attention	Flamingo	Attention layers	Few-shot, interleaved inputs
Q-Former	BLIP-2	Q-Former only	Efficient frozen model integration
Simple projection	LLaVA	Projection + LLM	Simplicity, instruction following

4. Training Approach

Approach	Pros	Cons	Best For
Joint training	End-to-end learning	Expensive, needs lots of data	Unlimited resources
Two-stage	More stable, data-efficient	May miss end-to-end optimization	Most practical scenarios
Frozen components	Very efficient	Limited adaptability	Limited compute/data

Recommended Approach for Healthcare VLMs

For building a healthcare VLM (e.g., symptom sketches + clinical notes):

Vision Encoder: ResNet or hybrid CNN-Transformer
- Pre-train on ImageNet or RadImageNet
- Fine-tune on medical images or sketches
- Why: Medical images differ from natural images, fine-tuning helps
Text Encoder: ClinicalBERT or BioClinicalBERT
- Pre-trained on clinical text (MIMIC notes)
- Understands medical terminology and abbreviations
- Why: Domain vocabulary is critical
Fusion Strategy: Multi-stage approach
- Stage 1: CLIP-style contrastive learning on (sketch, text) pairs
- Stage 2: Add Q-Former or simple projection for structured data
- Stage 3: Fine-tune for outcome prediction
- Why: Leverage contrastive pre-training, then adapt to downstream task
Training: Multi-stage with frozen components
- Start with frozen pre-trained models
- Efficient with limited data (~2,000 pairs)
- Fine-tune selectively based on performance

Example architecture:


class HealthcareVLM(nn.Module):
    def __init__(self):
        super().__init__()
        # Pre-trained on medical images
        self.vision_encoder = ResNet50(pretrained_on='radiology')
 
        # Pre-trained on clinical notes
        self.text_encoder = ClinicalBERT()
 
        # Lightweight fusion
        self.q_former = QFormer(num_queries=16, dim=768)
 
        # Outcome prediction head
        self.classifier = nn.Linear(768, num_outcomes)
 
    # Multi-stage training:
    # 1. Contrastive pre-training (sketch-text alignment)
    # 2. Q-Former training (information extraction)
    # 3. Classifier fine-tuning (outcome prediction)

Emerging Trends

1. Unified Multimodal Models

Single model handling multiple modalities (vision, language, audio, video)
Examples: ImageBind (Meta, 6 modalities), Unified-IO (AllenAI)
Enables cross-modal reasoning and generation

2. Efficient Architectures

Smaller models with strong performance
Mobile-friendly VLMs (e.g., MobileVLM, TinyLLaVA)
Quantization (4-bit, 8-bit) and distillation
Inference optimization (FlashAttention, KV caching)

3. Instruction-Following VLMs

Better at following complex instructions
Conversational interfaces (ChatGPT-style for images)
Chain-of-thought reasoning with images
Example: LLaVA-1.5 improves multi-turn dialogue

4. Multimodal Agents

VLMs that can take actions (click, type, navigate)
Planning and tool use (search, calculate, run code)
Integration with robotics (visual grounding for manipulation)
Example: Visual-ChatGPT, HuggingGPT

5. Higher Resolution and Detail

Moving beyond 224×224 and 336×336 images
Patch-level detail understanding
Multiple resolution inputs
Example: LLaVA-NeXT supports up to 672×672

CLIP - Foundation for modern VLMs
Multimodal Foundations - Core fusion strategies
Contrastive Learning - Training objective for alignment
Vision Transformers - Vision encoder architecture
Clinical VLMs - Healthcare-specific applications
VLM Applications - Real-world use cases

Advanced VLM Architectures

Flamingo (DeepMind, 2022)

Architecture

Gated Cross-Attention

Strengths

Limitations

BLIP-2 (Salesforce, 2023)

Architecture

Two-Stage Training

Why Q-Former Works

Strengths

LLaVA (Microsoft, 2023)

Architecture

Training Strategy

Strengths

Comparing VLM Architectures

Key Design Choices

1. Vision Encoder

2. Text Encoder/LLM

3. Fusion Strategy

4. Training Approach

Recommended Approach for Healthcare VLMs

Emerging Trends

1. Unified Multimodal Models

2. Efficient Architectures

3. Instruction-Following VLMs

4. Multimodal Agents

5. Higher Resolution and Detail

Related Concepts

Further Reading