Revision Notes/Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)/3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN

3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN

Intuition

Point clouds break CNNs because CNNs assume a regular grid. The two big ideas are (1) voxelize and use 3D conv (memory-heavy, low resolution), or (2) build symmetric, permutation-invariant operators that act directly on the unordered point set.

Explanation

Point clouds are unstructured (no grid), have irregular density, are unordered (N! equivalent permutations), and are expensive to annotate. Any operator that consumes them must produce the same output regardless of input ordering — formally, it must be a symmetric function.

VoxNet voxelizes a point cloud into a 3D occupancy grid (or probabilistic density), then applies a 3D CNN ending in a softmax classifier. Voxelization makes the data regular, so 3D conv works. Drawbacks: O(N³) memory in resolution (a 1024³ grid is infeasible), most voxels are empty (compute wasted on zeros), practical resolution is capped at ~32³–64³ (loses fine detail), and the continuous → discrete mapping introduces quantization artifacts.

PointNet operates directly on the raw N × 3 (or N × D) point set. Each point is passed independently through a SHARED MLP — same weights, applied per point — producing a per-point feature. A symmetric MAX-pool aggregates over the N points into a global feature, which a final MLP turns into class logits. Symmetric functions are permutation-invariant by construction, so the architecture inherits invariance. The Universal Approximation Theorem (Qi et al.) states that any continuous Hausdorff-symmetric set function can be approximated arbitrarily well by this scheme.

Why MAX rather than sum/average: MAX picks the most salient point per feature dimension, giving a sparse, interpretable global descriptor. The 'critical points' are the subset of input points whose features survive the max-pool — empirically they trace out the object's silhouette and the network is robust to dropping non-critical points but vulnerable to targeted removal of critical ones.

PointNet limitations: no local context (each point processed in isolation before the single global pool), no hierarchy, depends on absolute coordinates. PointNet++ adds farthest-point sampling + ball queries to define local neighbourhoods, then applies PointNet locally and stacks scales — hierarchical PointNet.

DGCNN's EdgeConv computes edge features over a kNN graph using relative offsets (x_j − x_i), aggregates symmetrically (max), and crucially rebuilds the kNN graph in FEATURE space at every layer — so 'neighbours' become semantic rather than geometric (dynamic graph). MeshCNN operates on edges of a mesh: each edge has 4 neighbour edges (from its two adjacent triangles); the 5-d input feature is intrinsic (dihedral angle, two inner angles, two length ratios), pooling = edge collapse.

Definitions

Symmetric function — f(x_1, …, x_n) = f(π(x_1, …, x_n)) for every permutation π. Permutation invariant by construction.
Voxelization — Discretizing a point cloud onto a 3D grid of voxels (occupied / unoccupied or density).
PointNet — Shared per-point MLP + symmetric (max) aggregation + final MLP. Universal approximator for symmetric set functions.
PointNet++ — Hierarchical PointNet: farthest-point sampling + ball-query neighbourhoods, PointNet applied locally and stacked.
EdgeConv — Edge feature h(x_i, x_j − x_i) over kNN graph; max-aggregated; graph optionally rebuilt in feature space (dynamic).
MeshCNN — Edge-centric mesh operator; 5-d intrinsic edge features; pooling = edge collapse.

Formulas

\text{PointNet}(P) = \gamma\!\left(\max_{p \in P} h(p)\right)
\text{EdgeConv:}\ \ e_{ij} = h_\Theta(x_i,\ x_j - x_i),\ \ x_i' = \max_{j \in \mathcal{N}(i)} e_{ij}
\text{MeshCNN edge feature:}\ (\alpha_{dihedral},\ \alpha_a, \alpha_b,\ |e_a|/|e|,\ |e_b|/|e|)

Derivations

Permutation invariance of PointNet: π is any permutation. PointNet(π(P)) = γ(max_{p ∈ π(P)} h(p)) = γ(max_{p ∈ P} h(p)) because max is symmetric. Thus the output is unchanged by reordering the input points.

Universal approximation (sketch): any continuous symmetric set function f can be approximated as γ(max_p h(p)) for sufficiently expressive h and γ. The max-pool acts as a feature-wise symmetric aggregator; the shared MLP h provides the function space; γ post-processes to the final form.

Examples

Voxel grid 64³ with 4 bytes per voxel = 1 MB per object — already taxing memory at this resolution, and most cells are empty.
DGCNN dynamic graph: in early layers neighbours are geometrically close (xyz space); in deep layers neighbours can be on the opposite side of the object but semantically similar (both 'wing tips' on an airplane).
MeshCNN pooling = edge collapse: contract the edge with the lowest learned salience score; the two endpoints merge; surrounding triangles update.

Diagrams

PointNet architecture: N × 3 input → T-Net → shared MLP (N × 64 → N × 1024) → max-pool → 1024-d global feature → MLP → class logits.
EdgeConv: a center point and its k nearest neighbours; arrows from neighbours back to centre; edge features computed using (x_j − x_i); max aggregation.
MeshCNN edge with 4 neighbours: depict the two triangles sharing the edge and the four other edges of those triangles.

Edge cases

VoxNet's resolution ceiling — fine geometric detail invisible at 32³.
PointNet vulnerability: targeted removal of critical points (those that survive max-pool) collapses recognition; random dropout is fine.
DGCNN dynamic graph adds compute (recompute kNN per layer); fixed-graph variants trade off semantic adaptivity for speed.

Common mistakes

Saying PointNet uses 'sum-pooling' — it's max-pool (symmetric AND sparse-descriptor).
Confusing PointNet++ (hierarchical PointNet) with DGCNN (dynamic graph CNN). Different ideas.
Voxelizing too coarsely and reporting 'PointNet wins' — comparison unfair without matching resolution.
Treating mesh edges as vertices in MeshCNN — the operator is edge-centric, not vertex-centric.

Shortcuts

Permutation invariance = shared MLP + symmetric pool. Memorise this as the pattern.
Dynamic graph = kNN in FEATURE space (DGCNN), updated per layer.
MeshCNN intrinsic features = invariant to rigid motion (rotation/translation).

Proofs / Algorithms

PointNet is permutation invariant ⇒ approximates any continuous Hausdorff-symmetric function (Qi et al., 2017). Proof outline: the max-pool produces a sparse subset of 'critical points' that determine the output; varying these points moves output continuously; sufficient hidden dim makes the function class dense.

Computer Vision