Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer
Intuition
Video ≠ images × T. The thing the model needs to capture is the TEMPORAL pattern — 'sit down' and 'stand up' are the same image set in reverse order. The architecture choices are about how to consume the time dimension efficiently.
Explanation
Video task families: trimmed action classification (single label per clip — Kinetics, UCF-101), temporal action localisation (start, end, class in untrimmed video), spatio-temporal localisation (per-frame boxes + actions — AVA), captioning, dense event captioning, text-to-video retrieval, long-form understanding.
Six action-recognition challenges (Prof's list): who is doing the action; when does it start; how long is it; actions vs interactions; what are the essential components; background scene importance (many models accidentally classify the scene — 'basketball court' → 'basketball').
3D convolution: input B × C_in × T_in × H × W; filter C_out × C_in × K_T × K_H × K_W. The temporal kernel K_T (typically 3) lets filters span consecutive frames → learns spatio-temporal features (motion). 2D conv input has no T dim; 3D adds it everywhere.
I3D = Inflated 3D ConvNet (Carreira & Zisserman). Problem: training a 3D CNN from scratch needs huge video datasets. Trick: take a 2D CNN pretrained on ImageNet, inflate every K×K filter to K×K×K by replicating along the new time axis and dividing by K. A 'boring video' of a static repeating image produces identical activations to the 2D net on the original image → sensible initialisation. Then fine-tune on Kinetics.
Two-Stream networks (Simonyan & Zisserman, 2014). Spatial stream: single RGB frame → CNN → action class (what — appearance). Temporal stream: stack of L optical flow fields (e.g., L=10 → 2L=20 channels) → CNN → action class (how — motion). Late fusion: average the softmax outputs. Motion-only often outperforms appearance — many actions are defined by motion pattern.
SlowFast (Feichtenhofer et al., 2019). Two parallel pathways on the same video at different frame rates: Slow pathway (low fps, many channels) for semantics; Fast pathway (high fps, few channels) for motion. Motivated by human visual processing (separate semantic/motion). Lateral connections fuse between pathways; the fast pathway is param-light because motion is low-dim.
ViViT (Arnab et al., 2021) — Video ViT. Two token-extraction strategies: (1) Uniform frame sampling — sample T frames, treat each as ViT input, concat all patch sequences; (2) Tubelet embedding — divide the video into 3D tubelets t × h × w (e.g., 2 × 16 × 16) and linearly project each tubelet into a token (carries spatio-temporal info from the start). Then standard transformer encoder. Challenge: naive global attention is O((TN)²) → use factorised variants.
TimeSformer (Bertasius et al., 2021). 'Where to attend?' The four variants tested: space-only (per frame); joint space-time (every token attends to every other — O((TN)²)); divided space-time (temporal attention first — each spatial location attends across time — then spatial attention within each frame); sparse local. Divided won — linear-ish cost, best accuracy/efficiency trade-off.
Definitions
- I3D — Inflated 3D ConvNet. Take 2D ImageNet CNN, inflate kernels K×K → K×K×K along time, divide by K, fine-tune on video.
- Optical flow — Per-pixel 2D vector (u, v) describing motion between consecutive frames. Captures pure motion, no appearance.
- Two-Stream — Spatial stream (RGB, appearance) + temporal stream (stacked optical flow, motion) with late fusion.
- SlowFast — Two parallel pathways at different fps; slow = semantics, fast = motion; lateral connections fuse.
- Tubelet embedding — ViViT token construction by projecting 3D tubelets (t × h × w) instead of 2D patches.
- Divided space-time attention — TimeSformer's factorisation: temporal attention first (per spatial location across time), then spatial attention (per frame).
Formulas
\text{3D conv:}\ B \times C_{in} \times T \times H \times W,\ \text{filter}\ C_{out} \times C_{in} \times K_T \times K_H \times K_W\text{I3D inflation:}\ W_{3D}(t, h, w) = W_{2D}(h, w) / K_T\text{Joint space-time attn cost:}\ O((T N)^2 \cdot d)\text{Two-Stream temporal channels:}\ 2L\ \text{(u, v per flow field, } L \text{ frames)}
Derivations
Why divide by K_T in I3D inflation: a 2D conv output is Σ_{h,w} W_{2D}(h,w) · x(h,w). The inflated 3D version with K_T replications outputs Σ_t Σ_{h,w} W_{2D}(h,w) · x(t,h,w) = K_T · (2D activation on each frame). To keep magnitudes matched, divide weights by K_T.
Examples
- I3D pretrained on ImageNet + Kinetics: ~75% top-1 on Kinetics-400 with 32-frame clips.
- Two-Stream temporal channels for L=10 flow fields: 2·10 = 20 input channels to the temporal CNN.
- TimeSformer with divided attention on 8 × 14×14 = 1568 tokens: temporal attention costs O(T² · N · d), spatial attention costs O(T · N² · d) — both linear in the other dim.
Diagrams
- I3D inflation: 2D filter K×K → 3D filter K×K×K via replication along time axis, with normalisation by K_T.
- Two-Stream architecture: spatial stream (RGB) and temporal stream (stacked optical flow) → late fusion.
- SlowFast pathways: slow (4 fps, high channels) and fast (32 fps, low channels) with lateral connections.
- ViViT tubelet embedding: 3D tubelet → linear projection → spatio-temporal token.
Edge cases
- Background-only models can score high on Kinetics — control via temporal-shuffling sanity checks.
- Optical flow computation is expensive; modern methods predict flow in the network (RAFT).
- Long videos: O(L²) per-frame attention infeasible; use temporal subsampling or memory tokens.
Common mistakes
- Treating video as just stacked images — temporal patterns are lost.
- Confusing C3D, I3D, and SlowFast — I3D is INFLATION of 2D CNN; C3D is 3D conv from scratch; SlowFast is two pathways at different fps.
- Forgetting that two-stream's temporal stream has 2L channels, not L.
Shortcuts
- I3D = inflation trick. Bridges image pretraining to video.
- SlowFast: slow = semantics, fast = motion. Lateral fusion.
- TimeSformer winner: divided space-time attention.