Revision Notes/Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)/3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN/Story

3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN

NotesStory

Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)

Reading the Stranger in the Cloud

Imagine a self-driving car at night. Cameras struggle in the dark, but bolted to its roof is a LiDAR sensor. The LiDAR fires laser pulses in all directions, times their reflection, and produces — for each pulse — an $(x, y, z)$ coordinate where something solid was. A million of these, scattered through space, no order, no grid. That's a point cloud.

This is the input the heroine of this story wrestles with, and her opponent is brutal. How do you build a neural network that ingests *this*?

Why your CNN is useless here

A CNN works because images sit on a regular grid. Pixel $(i, j)$ has neighbours at $(i \pm 1, j \pm 1)$ . A $3 \times 3$ filter knows exactly which nine pixels to multiply. The whole machinery — convolution, pooling, translation equivariance — depends on that grid.

A point cloud destroys all three things a CNN relies on:

Unstructured — there is no grid. There are just floating points.
Irregularly distributed — some regions are dense (the surface of a car), others sparse (the sky above it).
Unordered — the same shape can be written as ${p_{1}, p_{2}, \dots, p_{N}}$ or ${p_{99}, p_{7}, p_{2}, \dots}$ . There are $N!$ orderings of the same object. Your network must give the *same* answer for all of them.

And there is no ImageNet for point clouds — labelling 3D data is brutally expensive. Whatever architecture you invent must be data-efficient too.

The five families

The course's mental map is a clean taxonomy of five ways to make 3D learnable:

View-based (render from many camera angles, use a 2D CNN) — Multi-view CNN.

Volumetric (discretise space into voxels, use a 3D CNN) — VoxNet.

Point-based (operate directly on raw points) — PointNet, PointNet++.

Graph-based (points as nodes, kNN as edges) — DGCNN.

Mesh-based (exploit triangle-mesh topology) — MeshCNN.

A sixth family — fuzzy explicit primitives — is 3D Gaussian Splatting from the next unit. Keep this table in your head; it is the spine of every exam question on this lecture.

And the three tasks are equally clean: 3D classification (one label per cloud), part segmentation (label each point with a part), semantic segmentation (label each point with a scene category). Backbones are shared; only the final head differs.

Approach 1 — VoxNet, the sledgehammer

The most natural first idea: if a 2D CNN works on a 2D grid, build a 3D grid and use a 3D CNN. VoxNet does exactly this. Overlay a 3D grid of cubes on the point cloud — each voxel is occupied (a point fell inside) or empty. Pass this 3D binary tensor through two 3D convolutional layers, then fully-connected layers, then a softmax.

It works. It's clean. And it dies at high resolution.

The four problems — memorise these, they're the most-asked-about thing in the whole lecture:

1. Memory blows up cubically. A $102 4^{3}$ grid has roughly a billion voxels. A $25 6^{3}$ grid has 16 million. Neither fits on a GPU. 2. Most voxels are empty. A point cloud of a chair lives on the *surface*. The interior and the surrounding air are zeros. You're spending 99% of memory on nothing. 3. Resolution is brutally limited — most VoxNet papers use $3 2^{3}$ , postage-stamp resolution. 4. Quantisation artefacts. Snapping continuous coordinates to discrete voxels turns smooth surfaces into jagged staircases.

When VoxNet wins: dense, naturally volumetric data, like medical CT scans, where everything is on a regular grid anyway. Outside of that niche, the field needed a fundamentally different idea.

Approach 2 — PointNet, the genius idea

Charu, the lecturer, asks the research question of the entire lecture: *can we feed raw points directly into a network, with no grid in between?*

The naïve answer — flatten $[x_{1}, y_{1}, z_{1}, x_{2}, y_{2}, z_{2}, \dots]$ into one big vector and feed an MLP — fails immediately. Reshuffle the points, get a different input vector, get a different output. The model learns the *ordering*, not the shape. It also can't generalise to different $N$ .

PointNet (Qi et al., CVPR 2017) solves this with one beautiful idea: build the architecture out of symmetric functions.

A function $f (x_{1}, \dots, x_{N})$ is symmetric if reordering the inputs doesn't change the output. The two canonical examples you already know:

max {x_{1}, x_{2}, \dots, x_{N}} and x_{1} + x_{2} + \dots + x_{N}

PointNet writes its global feature as

f (x_{1}, \dots, x_{N}) = γ (i max h (x_{i}))

where $h$ is a *shared* MLP — same weights, applied independently to every point — $max$ is element-wise across all $N$ points, and $γ$ is another MLP. Because $max$ is symmetric, the whole composition is symmetric. **Permutation invariance is achieved *by construction* — the network literally cannot distinguish point orderings.**

Memorise the three-step recipe: shared MLP per point → max-pool across points → MLP.

The architecture, slightly more concretely: $N \times 3$ input → a small T-Net predicts a $3 \times 3$ alignment matrix (input rotation invariance) → shared MLP lifts each point to 64 dims → another T-Net aligns features → shared MLP to 1024 dims → max-pool over all $N$ points → a single 1024-d global feature. Classification head: MLP → $K$ scores. Segmentation head: tile the global feature back onto every per-point feature, then more shared MLPs → $N \times m$ per-point scores.

Why max, not sum or average

Max picks the most salient point per feature dimension. The points whose features *survive* the max-pool are called critical points; empirically they trace the object's silhouette. PointNet is robust to dropping non-critical points but *adversarially vulnerable* to targeted removal of critical ones — a standard exam talking point.

The Universal Approximator theorem (yes, examinable)

The PointNet paper proves that *any continuous symmetric function on a compact set can be approximated arbitrarily well* by a network of this form, given enough capacity in $h$ . So PointNet is not just *a* permutation-invariant architecture — it is, in principle, *the* universal one. This is why the paper became a landmark.

Compare to attention from later units: self-attention is also permutation-equivariant — it doesn't care about token order without positional encoding. PointNet's max-pool achieves *invariance* (a single global vector independent of order) using a much cheaper operation than self-attention. Different tools, same fundamental property.

PointNet's fatal weakness — and PointNet++

Each point is processed independently until the max-pool. There is no notion of a local neighbourhood — no equivalent of a $3 \times 3$ receptive field. PointNet captures *global* shape but misses *local* geometric texture: the curvature where a chair-leg meets the seat, the crisp edge of a desk, the subtle bend of an airplane wing.

PointNet++ patches this. Sample anchor points with Farthest Point Sampling (FPS) — greedy: pick a point, then iteratively pick the point farthest from everything you've already chosen, producing evenly-spaced anchors. Group neighbours around each anchor via ball query (all points within radius $r$ ). Apply PointNet *inside each group*. Repeat hierarchically. You now have a CNN-like receptive-field hierarchy — early layers see local geometry, later layers see global structure.

Approach 3 — DGCNN, the social network of points

DGCNN says: *a point cloud is really just a graph.* Make every point a node; connect each point to its $k$ nearest neighbours; those are the edges. Now apply something convolution-like on the graph.

The operation is EdgeConv: for point $i$ , find its $k$ nearest neighbours $j_{1}, \dots, j_{k}$ . For each neighbour $j$ , form an edge feature by concatenating $[x_{i}, x_{j} - x_{i}]$ — the point itself, plus the *relative offset* to its neighbour. This relative offset is what gives DGCNN its locality, encoding geometry the way image convolutions encode "this pixel relative to its neighbours". Pass through an MLP, then aggregate the $k$ edge features symmetrically (max or sum) into a new feature for point $i$ .

The aggregation has to be symmetric — same reason as PointNet — so the order of neighbours doesn't matter.

The dynamic part is the cleverest bit. After each layer, *rebuild the kNN graph in feature space, not in coordinate space.* Early layers connect points that are physically close; late layers connect points that are *semantically* similar — all "wing-tip" points form a group even if they're spatially on opposite sides of an airplane. This is why it's *Dynamic* GCNN.

Limitations to mention: all-pairs distance is $O (N^{2})$ memory; fixed $k$ can't handle wildly varying density; there's no spatial downsampling across layers.

Approach 4 — MeshCNN, when you actually have a mesh

Sometimes your data isn't a point cloud at all — it's a mesh of vertices joined into triangles. A mesh is strictly richer than a point cloud because it encodes *surface topology* explicitly.

MeshCNN's insight: treat edges as the convolutional unit, because every edge in a triangle mesh sits between exactly two faces. That gives a fixed-size neighbourhood of 4 surrounding edges for any edge — just like a $3 \times 3$ patch around a pixel.

Each edge gets a 5-D input feature: the dihedral angle between its two faces, the two inner angles opposite the edge in each face, and the two length ratios of the edge to the opposite edge in each face. All five are intrinsic — invariant to rigid motion.

There's a subtlety. The four neighbouring edges $(a, b, c, d)$ have two valid orderings — $(a, b, c, d)$ or $(c, d, a, b)$ — depending on which face you start labelling from. To make the conv invariant to that choice, MeshCNN feeds in symmetric combinations: $(a + c, ∣ a - c ∣, b + d, ∣ b - d ∣)$ instead of the raw edges. Sums and absolute differences are unchanged when the pair is swapped. The conv kernel has 5 weights $(k_{0}, \dots, k_{4})$ — one for the centre edge, one per symmetric channel.

Pooling = learned edge collapse. Edges with the smallest learned activations are collapsed (their two endpoints merge, the surrounding mesh topology updates, features average). This is task-driven simplification — the network learns to keep edges that matter and discard the rest. *Mesh unpooling* stores the collapse history so the simplification can be reversed for tasks like segmentation that need original resolution.

What you carry into the exam

The three properties of point clouds — unstructured, irregular, unordered — and why each one breaks a CNN. The five families and their hero methods. The three tasks and their output shapes. VoxNet's four problems and the medical-CT niche where it still wins. PointNet's symmetric-function recipe in three steps, plus the Universal Approximator theorem in one sentence. PointNet's weakness and PointNet++'s FPS + ball-query fix. EdgeConv's relative-offset feature and DGCNN's dynamic feature-space graph. MeshCNN's five intrinsic edge features, its symmetric-channel conv, and its learned edge-collapse pooling.

Now you've seen one family of 3D representation per family. The next unit returns to 3D with the *fuzzy explicit* primitives of Gaussian Splatting — the bridge between the discrete world of points and the implicit world of NeRF.

Computer Vision