VLM Architecture — Encoders, Connectors, Positional Encoding
Intuition
Modern VLMs stitch a pretrained vision encoder, a small connector, and a pretrained LLM together — three pillars. The connector is the only randomly initialised piece. Then mask attention so the image+prompt are 'free context' (bidirectional) and only the answer is generated causally.
Explanation
The modality gap: text is discrete (vocab index → embedding via lookup); images are continuous (pixels) and require a learned encoder. A VLM's job is to align these into a shared latent space where similar image/text content sits close.
Three-pillar VLM blueprint (PaliGemma): (1) Vision Encoder — pretrained, frozen, e.g., SigLIP ViT-So400m (pixels → patch embeddings); (2) Connector — the only randomly initialised piece, a single linear layer (or small MLP) reshaping from D_enc to D_llm; (3) LLM backbone — pretrained, e.g., Gemma-2B decoder, autoregressive over [visual tokens || text tokens].
SigLIP vs CLIP. CLIP uses softmax cross-entropy over the N × N batch — every image's loss depends on all other texts (synchronisation cost). SigLIP uses pairwise SIGMOID binary CE: L = −(1/n²) Σᵢⱼ [y_ij log σ(z_ij) + (1 − y_ij) log(1 − σ(z_ij))]. Each pair is independent → scales to arbitrarily large batch sizes without batch-wide softmax sync.
Prefix-LM masking in PaliGemma. Sequence: [img_1 … img_N | BOS | t_1 … t_P (prompt) | SEP | s_1 … s_M (answer)]. Mask: image and prompt tokens attend BIDIRECTIONALLY (every token can see every other); suffix tokens attend CAUSALLY (only past). Why: the image + question is 'free context' the model should fully digest before generating; causal masking is still essential for autoregressive answer generation. Loss is computed ONLY on suffix tokens.
PaliGemma supports detection and segmentation purely through text. Vocabulary is extended with 1024 location tokens (⟨loc0000⟩ … ⟨loc1023⟩) encoding normalised [y_min, x_min, y_max, x_max] coords, plus 128 segmentation codewords (⟨seg000⟩ … ⟨seg127⟩) from a learned VQ-VAE codebook. Detection: 'detect person' → ⟨loc….⟩…⟨loc….⟩ person. Segmentation: bbox + VQ codeword sequence decoded to a mask.
Qwen2-VL fixes two PaliGemma limitations. Resolution bottleneck: PaliGemma encodes at fixed 224² → fine detail (a 3 mm nodule on a 2048² X-ray) becomes sub-pixel and is irrecoverably lost. Aspect ratio distortion: a 16:9 photo forced into 1:1 distorts spatial relationships. Solution: Dynamic Resolution — process at native resolution + aspect ratio; tile-count clamped by user-controlled N_min ≤ N ≤ N_max (explicit speed/quality knob).
Positional encoding for images and video. 1D RoPE breaks for images: patches at (row=2, col=3) and (row=3, col=2) get different raster indices and look 'far apart' to attention, even though spatially nearby. 2D-RoPE splits the head dimension into two halves: rotate first half by row, second half by column → attention depends on (Δrow, Δcol). M-RoPE (Qwen2-VL) splits into thirds for (t, r, c): full spatio-temporal positioning for video. For static images t=0; for text tokens t=r=c=token_index.
Gemma 4 (native multimodal) replaces the stitched architecture: vision and language share Transformer blocks from early layers — no connector bottleneck. Both modalities trained jointly from scratch. The 'modality gap' is no longer reconciled by a thin projection; the entire stack learns joint representations.
Definitions
- Modality gap — Text is discrete-vocabulary; images are continuous pixels; a VLM must align them into a shared latent space.
- Connector — Small (linear or MLP) projection from vision encoder output dim to LLM token dim; only randomly initialised component in stitched VLMs.
- SigLIP — Sigmoid CLIP — pairwise binary CE replaces softmax; scales without batch-wide sync.
- Prefix-LM mask — Bidirectional attention over [image + prompt]; causal over answer; loss only on answer.
- Dynamic resolution (Qwen2-VL) — Tile image at native aspect ratio and resolution; token count clamped to [N_min, N_max].
- M-RoPE — Multimodal RoPE — head dim split into thirds for (time, row, col) rotations; static images: t=0; text: all three=token index.
Formulas
L_{\text{CLIP}} = \tfrac{1}{2}(L_{i\to t} + L_{t\to i})\ \text{(symmetric softmax CE)}L_{\text{SigLIP}} = -\tfrac{1}{n^2}\sum_{i,j}\!\Big[y_{ij}\log\sigma(z_{ij}) + (1-y_{ij})\log(1-\sigma(z_{ij}))\Big]z_{ij} = \tau \cdot f_I(x_i)^\top f_T(t_j) + b\text{M-RoPE position triple:}\ (t, r, c).\ \text{Static image:}\ t = 0.
Derivations
Why Prefix-LM is the best of both: causal masking on the entire sequence (decoder-only LM) limits image+prompt comprehension because patches can only see previously processed patches. Bidirectional on the prompt + causal on the answer gives encoder-like rich understanding of the input while preserving autoregressive generation.
1D RoPE breakdown for images: with raster index m = row · W + col, two spatially adjacent patches at (2, W−1) and (3, 0) have m differing by 1 — they look adjacent — while (2, 0) and (3, 0) differ by W. The 1D ordering is incompatible with 2D geometry.
Examples
- PaliGemma forward: 224² image → SigLIP-So400m → 256 patch features (1152-d) → linear → 256 patch tokens (2048-d) → concat with text → Gemma decoder → causal next-token loss on suffix.
- M-RoPE for video frame at (t=5, row=2, col=3): rotation applied to three thirds of head dim by 5·θ, 2·θ, 3·θ respectively.
- Qwen2-VL with N_min = 256, N_max = 1280: a 512×768 image at patch 14 → ~256 patches (under cap); a 2048×3072 image → clamped to 1280 patches.
Diagrams
- Three-pillar VLM: SigLIP (frozen) → linear connector (random init) → Gemma decoder. Annotate which is trained vs frozen.
- Prefix-LM attention mask: lower-triangular for suffix; full square for image+prefix.
- M-RoPE positions: video tensor → (t, r, c) per patch; head dim split in thirds.
Edge cases
- PaliGemma at 224²: fine details lost irrecoverably (downsampling).
- Aspect ratio distortion when resizing wide images to square — breaks horizontal scenes.
- Connector bottleneck: a linear map can only rotate/scale; MLP is better but still limited compared to native multimodal.
Common mistakes
- Saying 'CLIP and SigLIP have the same loss with different normalisation' — SigLIP is pairwise sigmoid BCE, fundamentally independent across pairs.
- Forgetting that loss is computed ONLY on suffix tokens in PaliGemma — image+prompt tokens don't contribute to loss.
- Treating M-RoPE as 'RoPE with three encodings stacked' — it splits head dim into thirds and rotates each by a different position.
- Claiming Qwen2-VL processes 224² fixed — it processes native resolution clamped by N_min/N_max.
Shortcuts
- Three pillars to recite: Vision Encoder + Connector + LLM.
- Connector is the only randomly initialised component in stitched VLMs.
- Prefix-LM: bidirectional on prompt, causal on answer, loss on answer only.
- M-RoPE = (t, r, c) triple, split head dim in thirds.