NeRFimplicitly represents a 3D scene with a multi-layer perceptron (MLP) F:(x,d)β(c,Ο) for some position x, view direction d, color c, and "opacity" Ο. Renders results are spectacular.
There have been a number of articles introducing NeRF since its publication in 2020, but most of them neglect a subtle operation in implementation β scenes from the LLFF dataset are projected to NDC space before modeled by the MLP.
This post elaborates on NDC space and corresponding projection operations. Both mathematical derivation and implementation will be analyzed.
In a graphics pipeline, viewing transformation is responsible for mapping each 3D location x in the canonical coordinate ("world" coordinate) system to image space, measured in pixels. Such a procedure typically includes three components:
camera transformation
projection transformation
viewport transformation
which is illustrated below.
"A camera transformation is a rigid body transformation that places the camera at the origin in a convenient orientation. It depends only on the position and orientation, or pose, of the camera."
A projection transformation maps points in camera space to a [β1,1]3 cube whose center e=0 lies the camera. Such a cube is called a canonical view volume or the normalized device coordinates (NDC).
A viewport transformation "flattens" the [β1,1]3 NDC and maps the 2Γ2 square to a raster image. The image measures H in height and W in width; the unit is pixel.
NeRF does not adopt camera transformation because camera position x (w.r.t. world coordinates) is an input to the multi-layer perceptron (MLP). Neither does it use viewport transformation since information is queried implicitly from the MLP rather than constructed from a gauged object. NeRF performs projection transformation directly on world coordinates for the LLFF dataset. Let's figure out the mechanism of projection transformation.
Frame of reference
Although NeRF performs NDC conversion in world space, this post derives it w.r.t. camera frame. Transition to world coordinates is implemented by a matrix multiplication with c2w, which is comprised of a rotation matrix RβR3Γ3 and a translation vector tβR3.
c2w=[R0βt1β]
Projection tranfromation is decomposed into perspection projection followed by orthographic projection.
Perspection projection converts a camera frustum into a cuboid bounded by
Planes
Coordinates
left
x=l<0
right
x=r>0
bottom
y=b<0
top
y=t>0
far
z=fβ²<0
near
z=nβ²<0
Camera coordinates
In a camera coordinate system, z-axis points backwards by the right hand rule. Consequently, 0>nβ²>fβ².
Focus on 1D perspective transformation first. Suppose the gaze direction g coincides with the βz-axis, then the object appears on the image plane
Coordinates are "homogeneous" in that they can be translated, rotated, and scaled via a single matrix multiplication. By appending a fourth entry to a 3D coordinate x and multiplying it with a transformation M, homograhpy dominates render operations and computer vision.
Generalizing perspective projection to 3D, the perspective matrix is
"Perspective projection leaves points on the z=βnβ² plane unchanged and transforms the large and (potentially very) far rectangle at the back of the perspective volume to the z=βfβ² rectangle at the back of the orthographic volume." Essentially, such projection "maps any line through the camera (or eye) to a line parallel to the z-axis without moving the point on the line at z=βnβ²."
Perspective projection matrice are scalable.
Any scaled perspective projection matrix cP with cβR\{0} is equivalent in that the coordinates of an projected point xprojβ is the ratio of entries to cPx.
Orthographic projection scales the [l,r]Γ[b,t]Γ[fβ²,nβ²] volume to a 2Γ2Γ2 cude, and further shifts it to a [β1,1]3 one. Now, the camera e lies at its center. Such an operation is characterized by
There remains one subtle difference between the official projection transformation matrix and our derived one β the third row, i.e., the sign of z coordinates. The third entry of xprojβ must be positive in graphics frameworks like OpenGL. The final matrix is
In retrospect, reversing the sign of z coordinates amounts to turning z-axis to its opposite direction. This affects both marched rays and the object. Nonetheless, an object in NeRF is modeled implicitly by an MLP. Simplly warping sampling rays, in this case ray origins o and directions d, is enough.
The camera frustum is generally symmetric along x- and y-axes, then
with oβ², tβ², and dβ² undetermined. There are infinitely many solutions to the above system of 3 equations with 7 degrees of freedom. Focuses on ray origins first. Suppose world coordinates of oβ² are consistent with those of o's even after projection. Let t=tβ²=0, then
then tβ²=ozβ+tdzβtdzββ=1βozβ+tdzβozββ, and [dxβΒ dyβΒ dzβ]T=βaxβ(dzβdxβββozβoxββ)ayβ(dzβdyβββozβoyββ)βbzβozβ1βββ.
Why does NDC space work?
Note that limtβββtβ²=1, which means an infinite depth range is mapped to [0,1] after NDC transformation. The inverse depth, tβ², is called disparity.
This is particularly useful for the LLFF dataset, where rays from front-facing cameras may not "hit" any objects β an infinite depth. The infinite camera frustum is warped into a bounded cube. "NDC effectively reallocates NeRF MLP's capacity in a way that is consistent with the geometry of perspective projection."
Projection transformation isn't omnipotent.
The LLFF dataset contains scenes where the camera frustum is unbounded in a single direction. NeRF does not perform well on unbounded 360Β° scenes. This problem is explored in further works.
Suppose the camera is modeled in a way that the image plane lies exactly on the near plane (z=βn), and the far plane (z=βf) extends to infinity, then n is the focal length fcameraβ, and r and t are 2Wβ and 2Hβ respectively. Hence,
if __name__ == '__main__': torch.set_default_tensor_type('torch.cuda.FloatTensor') train()
Subroutines of train(β¦) is illustrated below.
The code for NDC transformation is encapsulated into a function, ndc_rays(β¦), in run_nerf_helpers.py. It is called by render(H, W, K[0][0], 1., rays_o, rays_d).
K is a calibration matrix K=βfcameraβ00β0fcameraβ0β2Wβ2Hβ1ββ, also the camera intrinsics. K[0][0] is the top left element of the K, essentially the focal length of a camera.
We first present the function in its entirety, it will then be analyzed step by step.
Initially, origin o of rays locates at the origin of camera coordinates. The above highglighted lines shift ray origins to intersection points of rays and the near plane, onβ, before NDC conversion.
To determine the shifted distance tnβ, let onβ=o+tnβd. Consider only the z axis, βn=ozβ+tnβdzβ, then tnβ=βdzβ(n+ozβ)β.
Now that [n,f) is mapped to [0,1] after NDC transformation, locating onβ's on the near plane enables convenient sampling (of disparity) along a ray by picking tiβ²ββ[0,1].
o0, o1, and o2 correspond to oxβ²β, oyβ²β, and ozβ²β. Similarly, d0, d1, and d2 are respectively dxβ²β, dyβ²β, and dzβ²β. Such assignment is exactly what is derived previously.
Viewing transformation is a component of the graphics pipeline. This post introduces audience to viewing transformation, in which NDC transformation plays a crucial role. We dicuss its derivation in NeRF and analyze its implementation.