Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

ML — Logistic, NN+Backprop, Ensembles, Density, RNN, Metrics, kNN, Regression, PCA/SVD, Clustering

NotesStory
Unit 3 — Machine Learning Recap

What ML Brings To The Vision Party

Every modern computer vision system has ML inside it. Sometimes obviously (a CNN is a stack of learnable filters), sometimes subtly (a softmax head is a multinomial logistic regression; a feature normalisation step is implicit feature engineering). The ML toolbox is the universal language of the modern field — when you see "logistic regression" inside a detection head, it's the exact same logistic regression you'd write for a binary tabular classifier.

The course assumes you've seen most of this; this unit is the rapid 90-minute review.

The Three Task Families

Supervised — given pairs, learn . *Classification* (discrete ) and *regression* (continuous ) are the two main flavours.

Unsupervised — given only , find structure. Clustering, density estimation, dimensionality reduction.

Reinforcement — agent takes actions, gets rewards. Not the focus of this course.

The Splits Nobody Should Mess Up

Train fits weights. Val tunes hyperparameters (number of layers, learning rate, regularisation strength, augmentation policy). Test reports final performance, exactly once. If you tune on test, test becomes a second val and you have no unbiased estimate of generalisation. This rule is sacred — exam graders will dock marks for "we tuned the model based on the test set's performance".

A related warning: the curse of dimensionality. In high dimension, distances concentrate (nearest and farthest become close to equal), volumes blow up, and methods like kNN or kernel density estimation need exponential data to fill the space.

Logistic Regression — The Canonical Classifier

For binary classification, the simplest model that respects "output should be a probability" is logistic regression: where squashes any real number to .

Train with binary cross-entropy: . Not MSE. Two reasons: (a) MSE on top of sigmoid is non-convex — local minima everywhere. (b) Gradients vanish at saturation; CE has the clean form because the sigmoid derivative cancels.

The decision boundary is linear: . So logistic regression can only separate classes that are linearly separable in feature space. (Apply a non-linear feature map first — that's what kernel methods and neural networks do.)

For classes, generalise to softmax: , loss . This is what sits on top of every CNN and ViT — the final "MLP head + softmax" is multinomial logistic regression.

Neural Networks — Just Layered Logistic Regression

A neuron is . Stack into layers. Activations: sigmoid (saturates both ends, vanishing gradient), tanh (zero-centred but still saturates), ReLU (fast, sparse, gradient = 1 in positive half, can "die" if always negative), Leaky ReLU and GELU fix the dying problem.

Backpropagation is just the chain rule walked backwards through the network. Modern frameworks (PyTorch, TF) build a computation graph at forward time and run autodiff at backward time. You will be asked to compute backprop by hand on a 2-layer MLP — practise this once, it's mostly bookkeeping.

Initialisation matters. Don't init to zero — all neurons in a layer would be identical and stay identical forever (symmetry). Use Xavier/Glorot () for sigmoid/tanh and Kaiming/He () for ReLU. The factor of 2 in Kaiming compensates for the half-zeroed activations.

Six Tricks That Speed Up NN Training

(1) Batch Normalisation — normalise activations to zero-mean unit-variance per mini-batch, then learnable scale and shift. Faster training, lets you use higher learning rates, less init-sensitive, slight regularisation. (2) Higher LR (BN enables this). (3) Dropout — randomly zero fraction of activations during training (~0.5 in FC, ~0.1–0.3 in CNN); at inference, turn off. Acts as ensemble + regulariser. (4) Shuffle data between epochs. (5) Less L2 when BN is present. (6) LR decay / scheduling — start higher, decay over epochs.

Ensembles — Trade Variance For Bias Or Vice Versa

Bagging = Bootstrap Aggregating. Sample with replacement, train independent models, average predictions. Reduces VARIANCE. Random Forest is the canonical example.

Boosting = Sequential. Each model focuses on the previous model's errors via reweighting. Reduces BIAS. AdaBoost is textbook; Viola-Jones face detection (2001) used AdaBoost over Haar features to make real-time face detection practical.

Mix-up: confusing bagging and boosting is a guaranteed mark loss.

Density Estimation And GMM

To model from data: histograms are bin-dependent and break in high . KDE puts a Gaussian on every point. GMM is the parametric workhorse: .

Train with EM. *E-step*: compute responsibilities . *M-step*: re-estimate by weighted MLE. Iterate. The likelihood is guaranteed to increase monotonically, but you'll only reach a local maximum — init matters.

RNNs — Sequences

. Same shared across all time steps. Train with BPTT (back-propagation through time).

The problem: , which either vanishes or explodes after many time steps. LSTM introduces a *cell state* that's updated additively: . The three gates (forget, input, output) are sigmoids; is a tanh proposal. The additive path is the constant error carousel — gradient can flow back across many timesteps without vanishing.

GRU is the simpler variant: one update gate (merges forget + input) and one reset gate. No separate cell state. Often as good as LSTM, fewer parameters.

For exploding gradients (a different problem the gating doesn't fix), use gradient clipping.

RNN I/O shapes: 1-to-1 (image classification), 1-to-many (image captioning: CNN encoder → RNN decoder), many-to-1 (video classification, sentiment), many-to-many synced (per-frame labels), many-to-many shifted (translation, video captioning).

Metrics — The Bit Students Always Lose Marks On

Confusion matrix: TP, FP, TN, FN. From these:

Precision = — of what you retrieved, how much was right. Monitor when false positives are costly (spam filter flagging real mail; criminal-conviction wrongness).

Recall = TPR = — of what was right, how much did you retrieve. Monitor when false negatives are costly (cancer screening missing a tumour; safety inspections missing a fault).

F1 = — harmonic mean. Punishes one-sided extremes — you can't be 100% precise by retrieving nothing.

Specificity (TNR) = . FPR = TNR.

Average Precision = area under the PR curve for one class. mAP = mean over classes (object detection) or queries (retrieval).

ROC vs PR. ROC plots TPR vs FPR. PR plots Precision vs Recall. For balanced data they tell similar stories. For imbalanced data with rare positives, ROC looks deceptively flattering (the abundant TNs inflate the AUC); PR is preferred — precision directly tracks how often you're right on the rare class.

kNN And Regression

kNN. Lazy. No training. To predict for , find its nearest training points and vote. Small → high variance (noisy boundary, overfits). Large → high bias (smooths over real structure, underfits). Use odd to break ties. Distances: L2, L1, cosine, Mahalanobis. Curse of dimensionality bites hard.

Linear regression. . Least squares minimises SSE. Closed-form: (normal equations). Polynomial regression is still linear regression — linear in the coefficients , just with polynomial features. Truly non-linear models (logistic, NN) need iterative methods.

L1 vs L2 regularisation. L1 (Lasso) → exact zeros → feature selection. L2 (Ridge) → smooth shrinkage, no exact zeros.

PCA / SVD — The Dimensionality Toolkit

PCA finds the orthogonal directions of maximum variance. Algorithm: centre the data; compute covariance ; eigendecompose ; keep the top- eigenvectors; project. = variance along . The top- projection captures the most variance among all -dimensional projections.

SVD decomposes any matrix: . The right singular vectors are PCA components (eigenvectors of ). Numerically stable (no need to form , which squares the condition number). Best rank- approximation in Frobenius norm is (Eckart–Young).

LoRA = Low-Rank Adaptation: write the weight update as where . Train only — saves vast amounts of memory and compute when fine-tuning huge LLMs. Spiritually the same trick as SVD/PCA: low-rank structure compresses information.

Eigenfaces apply PCA to face images: top eigenvectors look like ghostly average faces; represent any face as a combination of them, classify by nearest neighbour in the low-dim space.

Clustering — Unsupervised Grouping

k-means. Init centres; assign each point to nearest centre; recompute centres as mean of assigned; iterate. Four issues to memorise: HARD assignments, SPHERICAL bias, INIT sensitivity, OUTLIER sensitivity. Fixes: k-means++ (smart init that spreads centres), GMM (soft assignments + covariance shapes), k-medoids (use the actual data point closest to mean, robust to outliers).

Hierarchical clustering. Agglomerative (bottom-up): each point starts as a cluster, merge closest pairs. Divisive (top-down): one cluster, recursively split. Output: a dendrogram. Cut at different heights → different numbers of clusters. Linkage: single (closest pair — can chain), complete (farthest pair — compact), average, Ward (variance-minimising).

What You Walk In Carrying

Train/Val/Test discipline. Logistic regression: sigmoid + CE, SGD update , why not MSE, why no zero init. NN basics: layers, activations, backprop, BatchNorm, dropout, Kaiming init. Six speed-up tricks. Bagging vs Boosting (RF vs AdaBoost / Viola-Jones). GMM and EM (E-step responsibilities, M-step weighted MLE, monotone improvement). RNN: ; LSTM 3 gates + additive cell state; GRU 2 gates; BPTT + clipping. Metrics: confusion matrix, P/R/F1, AP/mAP, ROC vs PR (PR for imbalanced). kNN: lazy, k odd, variance/bias trade-off, curse of dim. Linear regression: normal equations ; L1 vs L2. PCA: centre → eigendecompose → top-; SVD; Eckart–Young; LoRA. Clustering: k-means + 4 issues + k-means++ fix; GMM soft; hierarchical + linkage.

End of storyUnit 3 — Machine Learning Recap · ML — Logistic, NN+Backprop, Ensembles, Density, RNN, Metrics, kNN, Regression, PCA/SVD, Clustering