Conventional methods of rendering 3D objects and spaces specify geometry and material properties in some format, like a polygon mesh -- made out of vertices and edges -- for geometry, a texture map for color and a normal map for light reflection properties. You then simulate a viewpoint using that information and physics simulations.
A Neural Radiance Field (NeRF), designed by Mildenhall et al.1, takes over the role of the file format and part of the rendering step in the form of a neural network. Once trained, it takes a world coordinate and a viewpoint, and returns an RGB tuple and a density. If you interrogate the NeRF enough you can render any traditional 2D image or 3D from it by combining all the data points.
Here's a video from the website that accompanied the original paper showing 360 views generated from a number of different NeRFs:
The advantages a NeRF versus a traditional 3D model are:
The NeRF design implements a function that maps a world coordinate $(x,y,z)$ and a viewing direction or pose2 $(\theta, \phi)$ to a color $(r, g, b)$ and a density $(\sigma)$.
Density is what you would expect it to be: an abstract measurement of how dense the material at the specified location is. In mathematical terms, the function is:
$$F_\Theta: (x, y, z, \theta, \phi) \rightarrow (r, g, b, \sigma) $$
Since density does not depend on the viewing direction, the NeRF is actually made up out of two separate, linked networks. The first maps a world coordinate to a density and an abstract, 256-dimensional feature vector. The second network maps this feature vector and the viewing direction (pose) to a color.
A fully trained NeRF holds an abstract representation of a scene in its weights. It returns colors and densities as if you were actually looking at the scene from the given direction. An algorithm to generate a regular image from a NeRF is volume ray casting:
A way to build a 3D model based on a NeRF is:
Neural Radiance Fields are trained with images of the scene and information on where the training images were taken from. You compare the training images with identically positioned ones rendered from the NeRF and minimize the reconstruction error.
Obtaining camera pose information for training images is not part of the NeRF algorithm. I suggest COLMAP, an open source toolkit that matches points across images in a SIFT-like way to retrieve camera matrices from a set of images. With COLMAP, I had more success with relatively wide-angle shots (35mm) compared to portrait-style images (50mm).
Here are a few open problems and some approaches to solving them.
The original NeRF algorithm is slow, especially when compared to photogrammetry, a non-machine learning object scanning method. In a project I was involved in, a state of the art proprietary photogrammetry method could process ~80GB worth of 24MP photos into a micrometer-level accurate 3D model in about 8 hours, while the fastest NeRF implementation took the same time to train a model on just 46 pictures at 0.2MP.
On the other hand, it may not be necessary to use equally high-res training images to reach competitive image quality. A NeRF is continuous function, so it may correctly fill in details based on relatively little data. In addition, photogrammetry requires a much more controlled image capturing setting and is more prone to errors in the data gathering step than NeRF.
Generating views directly from a NeRF is quite slow. For a 2D image, multiple inferences are needed to find the color for a single pixel. It takes 20 to 30 seconds to render a single 800x800 image, depending on the hardware and implementation used.
There have been a number of papers suggesting algorithmic improvements in the rendering step. The most recent ones are those by Hedman et al.5, Yu et al. 6 and Garbin et al. 7, which each propose their approach to baking the NeRF into another representation that can be queried at higher framerates.
Once trained, a NeRF contains an implicit representation of a scene in its weights. That means all the principles of neural networks being black boxes apply. The geometry, color, lighting, reflections, and any other properties that affect appearance are all baked in. Here are some challenges and proposed directions to solve them:
There's a lot more to explore. You can find a curated list of papers that build on the NeRF design on GitHub: yenchenlin/awesome-NeRF. If you thought this was cool, you might be a nerd (in a good way). Come back later for more, and follow/DM me on Twitter! 🤓
Specified in the spherical coordinate system without the distance from the origin $(r)$, because only the direction is of interest.
In other words: define the internal and external camera matrix.