DIP — Filters, Histograms, Fourier/DCT, Morphology, Geometric Ops, Hough, Templates
Intuition
DIP is the bag of operations that act *on* the pixels — before any 'learning'. Three big paradigms: spatial (operate on pixel neighbourhoods directly), transform (move to a different basis like Fourier/DCT, operate there, transform back), and morphological (set-theoretic operations on binary images via a structuring element). Every modern CNN op (conv, pooling, padding, BN) has a DIP ancestor; understanding the ancestor explains the descendant.
Explanation
**A digital image is a function .** Domain: discrete 2D grid of pixel locations. Range: intensity. Grayscale = 1 channel; RGB = 3. '8 bits/pixel' means each pixel uses 8 bits → values, levels. Image acquisition = analog (light hitting sensor) → digital, via sampling (which pixel positions exist) and quantisation (how many gray levels per pixel).
Compression matters. A 1920×1080 RGB image at 24 bits/pixel is bits = MB *uncompressed*. 30 s of 30 fps video → 900 frames × 6.22 MB 5.6 GB raw. Real videos are 100–1000× smaller because of (a) spatial redundancy — neighbouring pixels correlated — and (b) temporal redundancy — frame frame . Compression (JPEG, MPEG, H.264) exploits both via DCT + quantisation + entropy coding.
Three spatial-domain paradigms. *Point→Point* (operate on one pixel at a time — negatives, log, gamma, thresholding). *Neighbourhood→Point* (operate on a window — convolution, mean/Gaussian/Sobel/Laplacian, median, bilateral). *Global→Point* (operate on the whole image — histogram equalisation, normalisation, statistics).
Point operations — formulas to memorise. *Negative*: , useful for inverting X-ray contrast. *Log*: — compresses dynamic range, brightens dark. *Power-law (gamma)*: . → brighter; → darker; → identity. Used for gamma correction (CRT monitors have , so cameras pre-compensate with ). Common bug: forgetting to normalise before exponentiation.
Spatial filters — what each kernel is doing. *Mean*: — sum to 1 (preserves brightness), blurs everything including edges → boxy artefacts. *Gaussian*: — weighted, smooth, separable (2D = 1D ⊗ 1D, not ), isotropic, no ringing. controls spread; kernel size ≈ captures ~99% mass. *Sobel-x*: — first derivative, sums to 0, detects VERTICAL edges (horizontal change). *Sobel-y* = transpose, detects HORIZONTAL edges. *Laplacian*: ; kernel — second derivative, isotropic, very noise-sensitive. *LoG (Laplacian of Gaussian)*: pre-smooth then Laplacian, 'Mexican hat' — robust edge detection (Marr-Hildreth).
Edge facts. Gradient . Magnitude (or for speed). Direction . Edge ⊥ gradient — gradient points across the edge (max intensity change), edge is the iso-intensity contour. Smoothing kernels sum to 1; edge/derivative kernels sum to 0. Coefficient sign rule: any kernel that detects change must sum to zero on a flat region (else flat regions would fire).
Sharpening — unsharp masking & highboost. Detail = original − Gaussian-blurred. enhances edges. Highboost: . → standard unsharp; → also brightens. Used in printing and photography.
Median + bilateral — non-linear, edge-preserving. *Median*: sort the 9 values in a 3×3 window, take the middle. KILLS salt-and-pepper noise because outliers occupy the tails of the sorted list and never reach the median. Mean filter, being linear, smears outliers across the neighbourhood. Median's weakness: doesn't preserve fine 1-pixel lines (3 of 9 = minority → median picks background). *Bilateral*: two Gaussians — one spatial , one range/intensity . . Across an edge, intensity-different neighbours get weight ≈ 0 → no blur across edges → edge-preserving smoothing.
Convolution vs cross-correlation. Cross-correlation . Convolution flips the kernel: . For SYMMETRIC kernels (mean, Gaussian, Laplacian) the flip is invisible — same result. For ASYMMETRIC kernels (Sobel-x, Sobel-y) the two DIFFER. PyTorch's nn.Conv2d actually computes CROSS-CORRELATION internally despite the name — a quiz favourite. Convolution theorem: convolution in space ↔ multiplication in frequency.
Padding. A kernel centred at a border pixel reaches outside the image. *Zero padding* — virtual pixels = 0, simple, common in CNNs, causes a dark halo at borders + fake Sobel edges along image boundary. *Replicate padding* — virtual pixels = nearest border pixel; no halo. Other: *reflect*, *wrap* (periodic). 'Same' padding = for odd ; 'valid' = no pad, output shrinks by .
Histograms. = number of pixels with intensity . Tells you contrast & dynamic range — NOT spatial layout (a checkerboard and a half-black/half-white split have identical histograms). Low-contrast underexposed: mass in . High-contrast: bimodal at extremes. Well-exposed: spread across .
Histogram equalisation. Goal: redistribute intensities so output histogram is ≈ uniform → maximise contrast. Algorithm: (1) compute ; (2) normalise to PDF ; (3) compute CDF ; (4) map . The CDF is precisely the function that flattens any distribution to uniform. Caveat: enhances noise in flat regions; non-linear, can produce unnatural results.
Contrast stretching is the linear cousin: . Preserves histogram shape, just stretches it horizontally. Gentle, predictable. Use stretching for gentle boost; equalisation for aggressive correction.
Why frequency domain? Three reasons: (1) Convolution theorem — spatial convolution = frequency multiplication. Large-kernel convolution via FFT: vs naive — wins for big . (2) Compact representation — natural images concentrate energy in low frequencies; discard the tail and lose little perceptually (JPEG). (3) Periodic noise removal — scanlines/screen-tones appear as isolated spectrum spikes, trivially zeroed.
2D DFT. Forward . Inverse . Complex-valued. FFT is the algorithm (vs direct). Ideal LPF: if , else 0. Sharp cutoff → sinc spatial response → ringing artefacts. Gaussian LPF: . Smooth, no ringing. High-pass = edges, low-pass = smoothing.
DCT vs DFT — why JPEG uses DCT. DCT is real-valued, cosines-only. DFT assumes periodic signal → boundary discontinuities → high-frequency energy (ringing). DCT mirror-extends → no boundary discontinuity → better energy compaction (most energy concentrates in fewer low-frequency coefficients) → fewer artefacts at 8×8 block boundaries in JPEG.
JPEG pipeline. RGB → YCbCr → chroma subsample (4:2:0; human eye less sensitive to colour resolution) → split into 8×8 blocks → 2D DCT per block → quantise (divide by quant table; low-freq small divisors, high-freq big divisors — kills the high-freq tail) → zigzag scan → run-length encode + Huffman/arithmetic. Lossy step = quantisation. Decompression is the inverse.
Morphology — set-theoretic operations on binary images. Structuring element (SE) = small binary pattern with origin and possibly 'don't-care' cells. Erosion (SE must fit ENTIRELY inside A) → MIN filter → shrinks foreground. Dilation (SE touches A) → MAX filter → grows foreground. Opening — erosion then dilation; kills small noise, preserves shape of large objects. Closing — fills small holes. Both are idempotent. Boundary = (or equivalently ). Hit-or-miss locates isolated points, line endpoints, T-junctions.
Thresholding & Otsu. *Global* threshold: chosen once for the whole image (fails for uneven illumination). *Variable / adaptive* (Sauvola, Niblack): depends on local statistics. Otsu (bimodal): pick that MAXIMISES between-class variance (equivalently, MINIMISES within-class variance). Closed-form optimal threshold.
Geometric transformations. All expressed as matrices in homogeneous coordinates. *Translation* (2 DoF), *Rigid* (R+t, 3 DoF), *Similarity* (R+t+s, 4 DoF), *Affine* (6 DoF — preserves PARALLEL lines), *Projective/homography* (8 DoF — preserves only STRAIGHT lines, allows perspective). Forward vs inverse warping: forward iterates over source, pushes pixel to → holes + overlaps (bad). Inverse iterates over destination, pulls from → every dest filled exactly once + need INTERPOLATION (nearest / bilinear / bicubic) at non-integer source coords. Always use inverse warping in practice.
Colour models. *RGB* (additive, screens, non-uniform perceptually). *CMYK* (subtractive, printing; K = black for deep darks instead of triple-overprint). *CIE Lab* (perceptually uniform — equal Δ in = equal perceived difference). *HSI/HSV* (intuitive: hue + saturation + intensity). *YCbCr* (luminance + 2 chrominance, used in JPEG with chroma subsampling).
Hough transform — line detection. Each image point votes for all lines passing through it. Parameterise lines in polar form (not slope-intercept — vertical lines have infinite slope). Build a 2D accumulator over ; each edge pixel adds 1 to every corresponding to a line through it; peaks = detected lines. Extensible to circles (3D accumulator) and arbitrary curves.
Template matching — three distance measures. *SSD* — sum of squared differences; sensitive to brightness. *NCC* (normalised cross-correlation) — invariant to brightness shifts, range . *Correlation coefficient* — invariant to brightness AND contrast. Limitation: not affine-invariant — fails under rotation/scale. Motivates affine-invariant features (SIFT, ORB) and learned features (CNN, ViT).
Definitions
- Sampling vs quantisation — Sampling = discretising spatial position (which pixels exist). Quantisation = discretising intensity (how many gray levels per pixel).
- Spatial domain — Operations on pixel values directly. Three flavours: Point→Point, Neighbourhood→Point, Global→Point.
- Transform domain — Operations after transforming to another basis (Fourier/DCT/wavelet); useful for convolution, compression, periodic-noise removal.
- Convolution theorem — Convolution in space ↔ multiplication in frequency. Justifies FFT-based filtering for large kernels.
- Separable filter — A 2D filter — can be applied as two 1D passes. Gaussian is separable; Laplacian is not.
- Bilateral filter — Edge-preserving smoothing. Weights = spatial Gaussian × range (intensity) Gaussian. Across an edge, intensity differs → range weight collapses → no blur across edge.
- Otsu's method — Automatic threshold for bimodal histograms. Picks T that maximises between-class variance (equivalently minimises within-class). closed-form sweep over all intensity levels.
- Morphological erosion / dilation — Erosion : SE fits inside A → MIN filter, shrinks. Dilation : SE touches A → MAX filter, grows. Duals.
- Opening / Closing — Opening = erode then dilate (kills noise dots, preserves shape). Closing = dilate then erode (fills small holes). Both idempotent.
- Hit-or-Miss transform (HAM) — Match a foreground pattern AND its background simultaneously. Used to locate isolated points, line endpoints, T-junctions, corners.
- Affine vs projective — Affine (6 DoF) preserves parallel lines. Projective/homography (8 DoF) preserves only straight lines — allows perspective.
- Forward vs inverse warping — Forward: iterate over source pixels, push to T(x,y) — produces holes + overlaps. Inverse: iterate over destination, pull from T⁻¹(x', y') — every dest filled exactly once + needs interpolation.
- Hough transform — Voting in parameter space. Each image edge pixel votes for all lines through it; peaks in accumulator = detected lines. Polar form avoids the vertical-line infinity problem.
- Template matching — Slide a template over the image; score similarity (SSD, NCC, correlation coefficient). Fast but NOT rotation/scale invariant — motivates feature descriptors.
- Histogram equalisation — Map intensity via CDF: . Output histogram is approximately uniform; contrast maximised; can amplify noise in flat regions.
- DCT vs DFT — DCT is real-valued (cosines only) and mirror-extends → no boundary discontinuity → better energy compaction. JPEG uses DCT; spectrum analysis uses DFT.
Formulas
Derivations
**Why — no wait, that's YOLO; here it's why Gaussian is separable.** . Therefore — two 1D passes instead of one 2D pass, vs .
Otsu's optimal threshold. Define within-class variance and between-class variance . Total variance is constant (independent of ). Therefore . Otsu sweeps all 256 thresholds and picks the max — closed-form.
Convolution theorem (statement + sketch). . Proof sketch: substitute the convolution sum into the DFT definition; swap sums; the inner sum is the DFT of a shifted kernel, which factors. Practical consequence: large-kernel filtering by FFT-multiply-IFFT.
Why median kills salt-and-pepper but mean smears it. Salt-and-pepper replaces a small fraction of pixels with extreme values (0 or 255). In a 9-pixel window with 1 outlier (255) and 8 reasonable values, the sorted order is ; the 5th element (median) is one of the reasonable values — outlier ignored. The mean is , which is biased upward by . Median is a rank statistic, mean is linear; outliers don't survive ranking but do survive averaging.
Bilateral preserves edges. At an edge, neighbours on the OTHER side have very different intensity. The range Gaussian collapses to for those neighbours. Effective kernel becomes one-sided → smoothing happens only WITHIN the side belongs to → edge preserved.
Examples
- 3×3 mean filter on center. Patch . Output center = .
- Sobel on a vertical edge. Patch with values in each row → . (no horizontal change). Gradient points right; edge runs vertical. ✓
- Median on salt-and-pepper. Patch . Sorted: . Median = 25 — outliers 0 and 255 ignored. Mean would be — distorted.
- Histogram equalisation on 16-pixel image. for . PDF = . CDF = . Map: . . . . Output uses instead of — contrast stretched.
- Otsu on a bimodal histogram. Two clean clusters at and , with . . Maximised when classes are well separated — algorithm picks between the modes.
- JPEG quality knob. Quality 90 → quantisation table has small divisors → most DCT coefficients survive → file size large, artefacts invisible. Quality 30 → big divisors → only low-freq coefficients survive → small file, blocky artefacts visible at 8×8 grid boundaries.
Diagrams
- Cartesian map of filter taxonomy: linear/non-linear × smoothing/sharpening/edge. Place mean, Gaussian, median, bilateral, Sobel, Laplacian, LoG, unsharp.
- Sobel/Laplacian kernels side-by-side with a 3×3 input patch showing the convolution arithmetic.
- Histogram equalisation pipeline: histogram → PDF → CDF → mapped image, with arrows.
- Forward vs inverse warping: source-image grid mapped to dest with arrows; show 'holes' under forward and 'interpolation' under inverse.
- Erosion vs dilation: small binary shape with SE shown sliding; show shrunk vs grown output.
- JPEG block pipeline: image → YCbCr → 8×8 split → DCT → quant table → zigzag → entropy coded bitstream.
- Hough accumulator: image with 3 collinear points → 3 sinusoids in space intersecting at one bin = the line.
Edge cases
- Mean filter at borders with zero padding produces a dark halo — the filter sees mostly zeros, output dims. Use replicate padding for image-processing tasks.
- Median on 1-pixel-wide lines removes them (only 3 of 9 pixels are line; median picks background).
- Ideal LPF rings. Sharp cutoff in frequency → sinc in space → oscillations. Always use a smooth filter (Gaussian, Butterworth).
- Otsu on unimodal histograms is meaningless — it will pick some arbitrary T. Use a different method when the histogram has no valley.
- Global threshold on uneven illumination fails — one half of the image may be entirely above , the other below. Use local/adaptive (Sauvola, Niblack).
- Forward warping produces holes even on simple transforms because non-integer destinations leave gaps. Always invert.
- **Hough with too-coarse bins** merges adjacent lines; too fine bins spread votes and miss the peak. Use about 1° in and 1 pixel in .
- Template matching under rotation/scale fails entirely — pure pixel matching has no invariance.
Common mistakes
- **Confusing and .** BRIGHTENS dark regions; DARKENS them. Forgetting to normalise before exponentiation gives nonsense results.
- Saying 'edge IS the gradient direction'. No — edge is PERPENDICULAR to the gradient direction. Gradient points ACROSS the edge.
- Treating cross-correlation and convolution as identical. They differ for asymmetric kernels. PyTorch's
nn.Conv2dis cross-correlation. - Mean kernel sums to 1, edge kernel sums to 1. Wrong — smoothing sums to 1, derivative kernels (Sobel/Laplacian) sum to 0.
- JPEG uses DFT. Wrong — DCT. Better energy compaction, no boundary discontinuity.
- Otsu minimises between-class variance. Wrong — MAXIMISES between-class (equivalent to MINIMISING within-class).
- 'Histogram equalisation gives a uniform output.' Only approximately — discrete intensities + rounding mean it's 'as uniform as possible'.
- Erosion = MAX, dilation = MIN. Backwards. Erosion = MIN (shrinks), Dilation = MAX (grows).
- Forward warping is preferred. Wrong — INVERSE warping is preferred (avoids holes; needs interpolation).
- **Hough in .** Slope goes to infinity for vertical lines. Use polar .
Shortcuts
- Conv output: . Same pad odd K: .
- Smoothing kernels sum to 1; derivative kernels sum to 0.
- Gaussian is SEPARABLE → , not .
- Edge ⊥ gradient. Memorise.
- Erosion = MIN (shrink), Dilation = MAX (grow). Opening kills noise; closing fills holes.
- Boundary = Dilation − Original (or Original − Erosion).
- Otsu maximises between-class variance.
- JPEG uses DCT (not DFT) — better energy compaction, no boundary discontinuity.
- Inverse warping ✓; forward warping has holes.
- **Hough polar form avoids vertical-line problem.**
Proofs / Algorithms
Separability of Gaussian. . Factor: . Therefore with 1D Gaussians (after normalising constants). Two 1D convolutions replace one 2D convolution.
Total variance = within + between. . Expand using . After algebra (Bishop §9): . Since is fixed, maximising ≡ minimising .
Convolution theorem (statement). For with DFTs : . Inverse: . Therefore an convolution with a large kernel costs via FFT vs direct — wins for .