Computer Vision
CSE471Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits
Unit 11 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)
Vision-Language Models: the 3-pillar blueprint, Prefix-LM masking, SigLIP vs CLIP loss, dynamic resolution + M-RoPE for video, and the move toward native multimodal models.