NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering
Intuition
NeRF encodes a scene as a neural network — query (x, y, z, view) and get density + colour. 3DGS keeps the rendering equation but replaces the neural query with a sorted bag of explicit 3D Gaussians, getting real-time rendering at NeRF-comparable quality.
Explanation
3D representations split into explicit (discrete primitives — points, meshes, voxels) and implicit (continuous functions — signed distance fields, NeRF). NeRF MLPs map (x, y, z, view direction) → (density, RGB); rendering uses dense ray marching with many samples per pixel and a heavy network call at each sample — quality is excellent but training takes hours and rendering takes seconds per frame.
3D Gaussian Splatting (3DGS) is per-scene optimisation, NOT a neural network: there are no learned weights generalising across scenes. Each new scene is a fresh optimisation over the parameters of a set of 3D Gaussians. Implication: no train/test split — every input image is used to fit the Gaussians. This is closer to classical bundle adjustment than to standard deep learning.
The three pillars: (1) Scene modelling — represent the scene as a collection of 3D Gaussians with mean μ, covariance Σ, opacity α, and view-dependent colour via spherical harmonics; (2) Image formation — project Gaussians to 2D given a camera pose and composite via differentiable rasterisation; (3) Optimisation — gradient descent on image-space loss + Adaptive Density Control to manage the number of Gaussians.
Per-Gaussian parameter accounting: μ (3) + Σ via R·S (3 scales + 4 quaternion = 7) + opacity (1) + spherical-harmonics colour degree 3 (16 coefficients × 3 channels = 48) = 59 learnable values. The covariance is parameterised as Σ = R · S · Sᵀ · Rᵀ so the optimiser cannot escape the PSD cone — a direct symmetric-matrix parameterisation can produce invalid covariances during gradient descent.
Why spherical harmonics for colour: real surfaces have view-dependent appearance (specular highlights, anisotropic reflection). RGB is one colour regardless of view direction; SH up to degree 3 gives a 16-basis angular function per channel that's evaluated at render time using the actual view direction.
Rendering: sort all 3D Gaussians by depth from the camera (back-to-front), project each to 2D (using the projection Jacobian for Σ), then composite via front-to-back alpha blending: C = Σ_i c_i · α_i · ∏_{j<i} (1 − α_j). Pre-processing: COLMAP runs Structure-from-Motion to produce camera intrinsics + extrinsics + a sparse point cloud, which is used to initialise the Gaussian means.
Adaptive Density Control (ADC): under-reconstructed regions get more Gaussians; over-reconstructed regions are simplified. Three operations — Clone (small Gaussian, high position gradient → copy); Split (large Gaussian, high position gradient → split into two smaller ones); Prune (low opacity or excessive size → delete). Gradients on Gaussian means signal 'wants to move' → that region needs capacity.
Loss: L = (1 − λ) · L₁ + λ · L_D-SSIM, with λ ≈ 0.2. L₁ gives per-pixel colour signal; D-SSIM captures structural/perceptual fidelity. Metrics for evaluation: PSNR (↑, log-scale of MSE; ≥30 dB good), SSIM (↑, local-window structural similarity), LPIPS (↓, distance in deep feature space, closer to human perception).
Definitions
- Explicit vs implicit 3D representation — Explicit = discrete primitives (points, mesh, voxels). Implicit = continuous function (SDF, NeRF).
- Spherical harmonics (SH) — Orthonormal basis on the unit sphere; degree-3 ⇒ 16 basis functions; evaluate at view direction to get view-dependent colour.
- Adaptive Density Control — Clone / split / prune Gaussians during optimisation based on position-gradient magnitude and opacity.
- COLMAP — Open-source Structure-from-Motion + MVS pipeline; produces camera poses and sparse point cloud for 3DGS initialisation.
- PSNR / SSIM / LPIPS — Pixel-wise log-MSE / structural similarity / deep-feature perceptual distance. ↑ ↑ ↓.
Formulas
C = \sum_i c_i \alpha_i \prod_{j<i}(1 - \alpha_j)\ \ \text{(alpha compositing)}\Sigma = R\, S\, S^\top R^\top\ \ \text{(PSD by construction)}L = (1 - \lambda)\,L_1 + \lambda\,L_{D\text{-SSIM}},\ \lambda \approx 0.2\text{Per-Gaussian params} = 3 + 7 + 1 + 48 = 59\text{PSNR} = 10 \log_{10}(R^2 / \text{MSE})
Derivations
Why decompose Σ as RSSᵀRᵀ: a 3 × 3 symmetric matrix has 6 free parameters but the optimiser must keep eigenvalues non-negative. The decomposition with R from a unit quaternion (4 params) and diagonal positive S (3 params, optimised in log-space) gives 7 free parameters and enforces PSD by construction. Geometrically: S scales the ellipsoid axes, R rotates them.
Why log-PSNR makes ordering simple: PSNR is monotonic in MSE (PSNR ↑ as MSE ↓). To compare two reconstructions, simpler to compare MSEs.
Examples
- A scene with 1 M Gaussians has 59 M scalar parameters — comparable in size to a small CNN, but the parameters describe THIS scene only.
- If the camera moves slightly, the 2D projection of each Gaussian shifts; differentiable rasterisation pushes gradients from the rendered pixel back to (μ, Σ, α, SH).
- Adaptive split-vs-clone decision: large Gaussian with high gradient on its position → split into two smaller ones (under-fits high-frequency detail); small Gaussian with high gradient → clone (region needs more capacity).
Diagrams
- 3DGS pipeline: images → COLMAP (camera poses + sparse point cloud) → initialise Gaussians at sparse points → rasterise → image loss → backprop → ADC step → repeat.
- Alpha compositing illustration: 3 sorted Gaussians at different depths along a ray; transmittance product showing residual light after each.
- PSD parameterisation: 3 scale axes → axis-aligned ellipsoid; rotation by R gives oriented ellipsoid.
Edge cases
- Random init without COLMAP point cloud wastes early optimisation; quality and runtime suffer.
- Over-fitting individual training views — common when test views are sparse. ADC pruning is the safety valve.
- SH at low degree can't model very view-dependent surfaces (mirror finishes). Higher SH degree or learned BRDFs needed.
Common mistakes
- Claiming 3DGS 'learns' weights that generalise — it's per-scene optimisation, no cross-scene generalisation.
- Forgetting to sort by depth before alpha compositing — composite order matters.
- Optimising Σ as a free 3 × 3 matrix — produces invalid covariances; use the R·S·Sᵀ·Rᵀ decomposition.
- Stating PSNR is bounded — it's not; better reconstructions can have very large PSNR (limited only by numerical precision of MSE).
Shortcuts
- Per-Gaussian param count: 3 + 7 + 1 + 48 = 59.
- Loss λ ≈ 0.2 (D-SSIM weight).
- Metrics direction: PSNR ↑, SSIM ↑, LPIPS ↓.
Proofs / Algorithms
Σ = R · S · Sᵀ · Rᵀ is positive semi-definite: for any vector v, vᵀ Σ v = vᵀ R S Sᵀ Rᵀ v = ‖Sᵀ Rᵀ v‖² ≥ 0. Hence Σ is PSD by construction.