Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

DIP — Filters, Histograms, Fourier/DCT, Morphology, Geometric Ops, Hough, Templates

NotesStory
Unit 2 — Digital Image Processing Recap

What DIP Gives Us

Before any 'learning' is the bag of operations that act *on* the pixels. Every CNN op has a DIP ancestor. Convolution is from DIP. Pooling is morphology in disguise. Batch normalisation is histogram equalisation, philosophically. Understanding the ancestor explains the descendant — and the exam wants you fluent in both languages.

Images As Functions

A digital image is a function . Domain = the discrete 2D grid of pixel locations. Range = intensity (one value per channel). Grayscale is one channel; RGB is three. "8 bits/pixel" means each pixel uses 8 bits of storage, giving 256 grey levels.

Image acquisition turns a continuous analog signal (light hitting a sensor) into this digital function via sampling (which pixel positions exist) and quantisation (how many gray levels per pixel). Sample more finely → higher spatial resolution. Quantise to more levels → smoother gradients.

Quick compression maths the lecture wants you to do: a 1920×1080 RGB image at 24 bits/pixel takes bits MB uncompressed. 30 seconds of 30 fps video = 900 frames GB raw. Real videos are 100–1000× smaller because of *spatial redundancy* (adjacent pixels correlated) and *temporal redundancy* (frame ≈ frame ). Compression — JPEG for images, MPEG/H.264 for video — exploits both.

Three Paradigms In The Spatial Domain

The lecture organises DIP into three by neighbourhood size.

Point → Point. Operate on one pixel at a time. Image negative (inverts intensities). Log (compresses dynamic range — brightens dark regions). Gamma (camera/display calibration; brightens, darkens). Thresholding (binarise at ).

Neighbourhood → Point. Operate on a window. This is where the bulk of DIP lives — convolution, filtering, edges, sharpening, smoothing.

Global → Point. Operate on the whole image. Histograms, equalisation, normalisation, statistics.

The Spatial Filter Zoo

You will be asked to recognise these and write the kernels.

Mean filter — all in a 3×3 window. Sums to 1 (preserves brightness). Blurs everything. Boxy.

Gaussian filter — weighted by distance, . Smoother, no boxy artefact. Separable — 2D = 1D ⊗ 1D, so instead of . Isotropic. No ringing — its Fourier transform is also a Gaussian. controls spread; kernel size ≈ captures 99% of the mass.

Sobel-x — first derivative in x, detects VERTICAL edges (where intensity changes left-to-right). Kernel sums to 0.

Sobel-y — same but for horizontal edges. Sums to 0.

From these you get the gradient . Magnitude . Direction .

Critical fact the exam loves: the edge is PERPENDICULAR to the gradient. The gradient points across the edge (from dark to bright); the edge runs along the iso-intensity contour. Picture a step from black to white: gradient points horizontally; the edge runs vertically.

The sum rule: smoothing kernels sum to 1 (so flat regions stay flat). Derivative/edge kernels sum to 0 (so flat regions give zero response).

Laplacian — second derivative, isotropic, detects edges via zero-crossings. Very noise-sensitive (a second derivative amplifies noise quadratically). LoG (Laplacian of Gaussian) combines them: smooth first, then Laplacian. Robust edge detection. The Marr–Hildreth operator.

Unsharp masking — extract detail (= original − blurred), add back. . Enhances edges. Highboost is the generalisation ; standard, also brightens.

Median filter — non-linear. Sort the 9 values, take the 5th. Kills salt-and-pepper noise (outliers occupy the sorted tails and never reach the median). Mean filter would average them in and smear them — exactly wrong.

Bilateral filter — non-linear, edge-preserving. Two Gaussians: spatial (closer = more weight, like a normal Gaussian) and range/intensity (similar intensity = more weight). The product collapses near edges (intensity jumps) → no smoothing across edges → flat regions get denoised but edges stay crisp. Used in HDR tone mapping, "beauty filters" for skin smoothing, and as a pre-step before edge detection.

Convolution vs Cross-Correlation

Cross-correlation . Convolution flips the kernel: . For *symmetric* kernels (mean, Gaussian, Laplacian) they give identical results. For *asymmetric* kernels (Sobel-x, Sobel-y) they differ — and PyTorch's nn.Conv2d actually computes cross-correlation (no flip) despite the name. Important when you implement a gradient filter and care which way is positive.

The convolution theorem is the bridge to the frequency domain: convolution in space ↔ multiplication in frequency. This justifies FFT-based filtering for large kernels.

Padding

A kernel centred at a border pixel reaches outside the image. Zero padding fills with 0s — simple, used in CNNs, produces a dark halo at borders and fake Sobel edges along the image boundary. Replicate padding copies the nearest border pixel — no halo, smoother. Reflect mirrors. Wrap treats the image as a torus (only useful for periodic signals).

"Same" padding with odd : — output same size as input. "Valid" padding: no pad, output shrinks by .

Histograms And Equalisation

A histogram counts pixels at each intensity level . It tells you contrast and dynamic range, not spatial layout — a black-and-white checkerboard and a half-black/half-white split have identical histograms.

Histogram equalisation redistributes intensities so the output histogram is approximately uniform. Algorithm: compute histogram → normalise to PDF → compute CDF → map . The CDF is exactly the function that flattens any distribution into a uniform one. Beautiful, but caveat: amplifies noise in flat regions.

Contrast stretching is the simpler linear version: . Histogram shape is preserved, just stretched to fill the axis. Gentle and predictable.

Why Frequency?

Three reasons.

Convolution theorem. Spatial convolution = frequency multiplication. A large-kernel convolution naively costs ; via FFT-multiply-IFFT it's — wins decisively for big .

Compact representation. Natural images concentrate energy in low frequencies. Throw away the high-frequency tail and lose little perceptually — the principle behind JPEG.

Periodic noise removal. Scanlines or screen-tones appear as isolated spikes in the spectrum; zero those bins and the noise vanishes. Spatial removal of the same pattern is hard.

2D DFT. Forward ; inverse with sign. FFT is the algorithm.

Ideal low-pass filter inside a disk, 0 outside. Sharp cutoff → sinc in spatial domain → ringing artefacts. Gaussian LPF smoothly attenuates, no ringing.

DCT, And Why JPEG Uses It Instead Of DFT

DCT is all real-valued, all cosines, and mirror-extends the signal so there is no boundary discontinuity. DFT assumes periodic extension; boundary discontinuities show up as high-frequency energy that wastes bits. So DCT has better energy compaction — most of the image energy concentrates in fewer low-frequency coefficients — which is exactly what you want when you're about to zero out the tail.

JPEG pipeline. RGB → YCbCr (luma + chroma) → 4:2:0 chroma subsampling (human eye is less sensitive to colour resolution) → split into 8×8 blocks → 2D DCT per block → quantise (divide by quant table; bigger divisors on high-freq coefficients) → zigzag scan (puts low-freq first) → run-length encode + Huffman. The lossy step is quantisation. To decompress, run the inverse.

Morphology — Set Theory On Binary Images

Define a structuring element (SE) — a small binary pattern with an origin. Morphology is set-theoretic operations on the foreground (1-pixels).

Erosion — the SE must fit ENTIRELY inside the foreground. Equivalent to a MIN filter. Shrinks objects, kills small noise dots.

Dilation — the SE just has to touch the foreground. Equivalent to a MAX filter. Grows objects, fills small holes.

Opening = erode then dilate. Kills small noise dots and preserves the shape of large objects. Closing = dilate then erode. Fills small holes and gaps. Both are idempotent — applying them twice gives the same result as once.

Boundary detection = (or equivalently ). Cheap, one-line edge for binary images.

Hit-or-Miss transform matches a foreground pattern AND its background pattern simultaneously — used for isolated points, line endpoints, T-junctions, corners.

Otsu's method for thresholding: pick that maximises between-class variance . Equivalently, minimises within-class variance — the two are dual because total variance is constant. Closed-form sweep.

Global threshold fails on uneven illumination (one corner brighter than the other → no single works). Adaptive methods (Sauvola, Niblack) compute from local statistics.

Geometric Transformations

Every standard transform fits in a matrix in homogeneous coordinates.

Translation (2 DoF). Rigid = R + t (3 DoF). Similarity = R + t + uniform scale (4 DoF). Affine (6 DoF) — preserves parallel lines, allows shear. Projective / homography (8 DoF) — preserves only straight lines, allows perspective foreshortening.

Forward vs inverse warping. Forward iterates over source pixels and pushes each to its destination . Two failures: HOLES (the integer destination grid is not fully covered) and OVERLAPS (multiple source pixels map to the same destination). Inverse iterates over destination pixels and pulls from — every destination filled exactly once. The catch: usually lands at a non-integer source coordinate, so you need interpolation (nearest-neighbour fast/blocky, bilinear default, bicubic smoother).

Always use inverse warping.

Colour

RGB — additive, screens, perceptually non-uniform. CMYK — subtractive, printing; the K = black is added because triple-overprinting C+M+Y produces muddy dark grey, not deep black. CIE Lab — perceptually uniform; equal Euclidean distance in ≈ equal perceived colour difference. HSI/HSV — intuitive (hue, saturation, intensity). YCbCr — luminance + 2 chrominance; JPEG uses it with chroma subsampling because the human eye is less sensitive to chroma than luma.

Hough Transform

Each edge pixel votes for every line that passes through it. To avoid the vertical-line problem (slope = ∞), parameterise the line in polar form . Build a 2D accumulator over ; for each edge pixel, increment every cell corresponding to a line through it. Peaks = detected lines.

Generalises to circles (3D accumulator over centre + radius) and arbitrary curves.

Template Matching

Slide a template over the image; at each position, score similarity.

SSD (sum of squared differences) — sensitive to brightness shifts.

NCC (normalised cross-correlation) — invariant to brightness shifts. Range .

Correlation coefficient — invariant to brightness AND contrast.

The fundamental limit: pure pixel matching has no rotation or scale invariance. Rotate the template by 5° and it fails. This motivates the move to invariant feature descriptors (SIFT, ORB) and later to learned features (CNNs).

What You Walk In Carrying

Image as ; sampling vs quantisation. Compression numbers. Three spatial paradigms. Negative, log, gamma point-ops. Mean / Gaussian / Sobel / Laplacian / LoG / unsharp / median / bilateral — what each does, its kernel, its noise behaviour. Edge ⊥ gradient. Smoothing sums to 1, derivative sums to 0. Convolution vs cross-correlation (PyTorch is correlation). Padding flavours. Histograms, equalisation algorithm, contrast stretching. Why frequency domain; DFT/IDFT formulas; ideal-LPF ringing; JPEG uses DCT not DFT (better energy compaction, no boundary discontinuity). Morphology: erosion MIN, dilation MAX, opening kills noise, closing fills holes, boundary = . Otsu maximises between-class variance. Homogeneous coordinates and the DoF ladder (2 → 3 → 4 → 6 → 8). Inverse warping ✓. Colour models. Hough polar form. Template matching distances and limitations.

End of storyUnit 2 — Digital Image Processing Recap · DIP — Filters, Histograms, Fourier/DCT, Morphology, Geometric Ops, Hough, Templates