NeRF: A Volume Rendering Perspective

YU YueOriginalAbout 32 min

NeRF: A Volume Rendering Perspective

Overview

NeRFopen in new window implicitly represents a 3D scene with a multi-layer perceptron (MLP) F:(x,d)(c,σ)F: (\boldsymbol{x}, \boldsymbol{d}) \rightarrow (\boldsymbol{c}, \sigma) for some position xR3\boldsymbol{x} \in \mathbb{R}^3, view direction d[0,π)×[0,2π)\boldsymbol{d} \in [0, \pi) \times [0, 2\pi), color c\boldsymbol{c}, and "opacity" σ\sigma. Rendered results are spectacular.

There have been a number of articles introducing NeRF since its publication in 2020. While most posts mention general methods, few of them elaborate on why the volume rendering procedure

C(r)=C(z;o,d)=znzfT(z)σ(r(z))c(r(z),d) dz, T(z)=exp(znzσ(r(s)) ds) \mathbf{C}(\boldsymbol{r}) = \gray{\mathbf{C}(z; \boldsymbol{o}, \boldsymbol{d}) =} \int_{z_n}^{z_f} T\gray{(z)} \sigma \left( \boldsymbol{r}\gray{(z)} \right) \boldsymbol{c} \left(\boldsymbol{r}\gray{(z)}, \boldsymbol{d} \right) \ dz, \ T(z) = \exp \left(-\int_{z_n}^z \sigma \left(\boldsymbol{r} \gray{(s)} \right) \ ds \right)

for a ray r=o+zd\boldsymbol{r} = \boldsymbol{o} + z\boldsymbol{d} works and how the equation is reduced to

C^(r)=C^(z1,z2,,zN;o,d)=i=1NTi(1eσiδi)ci, Ti=exp(j=1i1σjδj) \hat{\mathbf{C}}(\boldsymbol{r}) = \gray{\hat{\mathbf{C}}(z_1, z_2, \ldots, z_N; \boldsymbol{o}, \boldsymbol{d}) =} \sum_{i=1}^{N} T_i \left(1 - e^{-\sigma_i \delta_i} \right) \boldsymbol{c}_i, \ T_i = \exp \left(-\sum_{j=1}^{i-1} \sigma_j \delta_j \right)

via numerical quadrature, let alone exploring its implementation via Monte Carlo methodopen in new window.

This post delves into the volume rendering aspect of NeRF. The equations will be derived; its implementation will be analyzed.

Prerequisites

Background

The rendering formula

Ray model

A ray with origin o\boldsymbol{o} and direction d\boldsymbol{d} casts to an arbitrary region of bounded space. Assume, for simplicity, that the cross section area AA is uniform along the ray. Let's focus on a slice of the region with thickness Δz\Delta z.

Slice of region

Occluding objects are modeled as spherical particles with radius rr. Let ρ(z;o,d)\rho(z\gray{; \boldsymbol{o}, \boldsymbol{d}}) denote the particle density — number of particles per unit volume — in that orientation. If Δz\Delta z is small enough such that ρ(z)\rho(z) is consistent on the slice, then there are

AΔzslice volumeρ(z) \underbrace{A \cdot \Delta z}_\text{slice volume} \cdot \rho(z)

particles contained in the slice. When Δz0\Delta z \rightarrow 0, solid particles do not overlap, then a total of

AΔzslice volumeρ(z)πr2particle area \underbrace{A \cdot \Delta z}_\text{slice volume} \cdot \rho(z) \cdot \underbrace{\pi r^2}_\text{particle area}

area is occluded, which amounts to a portion of AΔzρ(z)πr2A=πr2ρ(z)Δz\frac{A \Delta z \cdot \rho(z) \cdot \pi r^2}{A} = \pi r^2 \rho(z) \Delta z of the cross section. Let I(z;o,d)I(z\gray{; \boldsymbol{o}, \boldsymbol{d}}) denote the light intensity at depth zz from origin o\boldsymbol{o} along direction d\boldsymbol{d}. If a portion of πr2ρ(z)Δz\pi r^2 \rho(z) \Delta z of all rays are occluded, then the intensity at depth z+Δzz + \Delta z decreases to

I(z+Δz)=(1πr2ρ(z)Δzoccluded portion)I(z) I(z + \Delta z) = (1 - \underbrace{\pi r^2 \rho(z) \Delta z}_\text{occluded portion})I(z)

The difference in intensity is

ΔI=I(z+Δz)I(z)=πr2ρ(z)Δz I(z) \begin{align*} \Delta I &= I(z + \Delta z) - I(z) \\ &= -\pi r^2 \rho(z) \Delta z \ I(z) \end{align*}

Take a step from discrete to continuous, we have

dI(z)=πr2ρ(z)σ(z)I(z) dz dI(z) = - \underbrace{\pi r^2 \rho(z)}_{\sigma(z)} I(z) \ dz

Define volume density (or voxel "opacity") σ(z;o,d):=πr2ρ(z;o,d)\sigma(z\gray{; \boldsymbol{o}, \boldsymbol{d}}) := \pi r^2 \rho(z\gray{; \boldsymbol{o}, \boldsymbol{d}}). This makes sense because the amount of ray reduction depends on both the number of occluding particles and the size of them, then the solution to the ODE

dI(z)=σ(z)I(z) dz dI(z) = -\sigma(z) I(z) \ dz

is

I(z)=I(z0) ez0zσ(s) dsT(z) I(z) = I(z_0) \ \underbrace{e^{\int_{z_0}^{z} -\sigma(s) \ ds}}_{T(z)}

Step-by-step solution

Exchange the terms at both sides of the ODE:

1I(z) dI(z)=σ(z) dz \frac{1}{I(z)} \ dI(z) = -\sigma(z) \ dz

which is a separable DE. Integrate both sides

1I(z) dI(z)=σ(z) dzlnI(z)=σ(z) dz+CI(z)=Ceσ(z) dz \begin{align*} \int \frac{1}{I(z)} \ dI(z) &= \int -\sigma(z) \ dz \\ \ln I(z) &= \int -\sigma(z) \ dz + C \\ I(z) &= C e^{\int -\sigma(z) \ dz} \end{align*}

Suppose II takes I(z0)I(z_0) at depth z0z_0, then

I(z)=I(z0) ez0zσ(s) ds I(z) = I(z_0) \ e^{\int_{z_0}^{z} -\sigma(s) \ ds}

Define accumulated transmittance T(z):=ez0zσ(s) dsT(z) := e^{\int_{z_0}^{z} -\sigma(s) \ ds}, then I(z)=I(z0)T(z)I(z) = I(z_0)T(z) means the remainning intensity after the rays travels from z0z_0 to zz. T(z)T(z) can also be viewed as the cumulative density function (CDF) that a ray does not hit any particles from z0z_0 to zz. But no color will be observed if a ray passes empty space; radiance is "emitted" only when there is contact between rays and particles. Define

H(z):=1T(z) H(z) := 1 - T(z)

as the CDF that a ray hits particles from z0z_0 to zz, then its probability density function (PDF) is

phit(z)=ez0zσ(s) dsT(z) σ(z) p_\text{hit}(z) = -\underbrace{e^{\int_{z_0}^{z} -\sigma(s) \ ds}}_{T(z)} \ \sigma(z)

CDF to PDF

Differentiate CDF H(z)H(z) (w.r.t. zz) to get phit(z)p_\text{hit}(z)

phit(z)=dHdz=dTdz=ddzez0zσ(s) ds=ez0zσ(s) ds ddzz0zσ(s) ds=ez0zσ(s) ds σ(z) \begin{align*} p_\text{hit}(z) &= \frac{dH}{dz} \\ &= -\frac{dT}{dz} \\ &= -\frac{d}{dz} e^{\int_{z_0}^{z} -\sigma(s) \ ds} \\ &= -e^{\int_{z_0}^{z} -\sigma(s) \ ds} \ \frac{d}{dz} \int_{z_0}^{z} -\sigma(s) \ ds \\ &= -e^{\int_{z_0}^{z} -\sigma(s) \ ds} \ \sigma(z) \end{align*}

Let a random variable a random variable R\mathbf{R} denote the emitted radiance, then

pR(r)= pR(z;o,d):= P[R=c(z)]= phit(z) \begin{align*} p_{\mathbf{R}}\gray{(\boldsymbol{r})} =& \ p_{\mathbf{R}}\gray{(z; \boldsymbol{o}, \boldsymbol{d})} \\ :=& \ P[\mathbf{R} = \boldsymbol{c}\gray{(z)}] \\ =& \ p_\text{hit}\gray{(z)} \end{align*}

Hence, the color of a pixel is the expectation of emitted radiance:

C(r)=C(z;o,d)=E(R)=0R pR dz=0c phit dz=0T(z)σ(z)c(z) dz \begin{align*} \mathbf{C}(\boldsymbol{r}) \gray{= \mathbf{C}(z; \boldsymbol{o}, \boldsymbol{d})} &= \mathbb{E} (\mathbf{R}) \\ &= \int_0^\infty \mathbf{R} \ p_{\mathbf{R}} \ dz \\ &= \int_0^\infty \boldsymbol{c} \ p_\text{hit} \ dz \\ &= \int_0^\infty T(z) \sigma(z) \boldsymbol{c}(z) \ dz \end{align*}

concluding the proof.

Integration bounds

In practice, c\boldsymbol{c}, obtained from MLP query, is a function of both position zz (or coordinate x\boldsymbol{x}) and view direction d\boldsymbol{d}. Also different are the integration bounds. A computer does not support an infinite range; the lower and upper bounds of integration are znear=nearz_\text{n\gray{ear}} = \verb|near| and zfar=farz_\text{f\gray{ar}} = \verb|far| within the range of floating point representation:

C(r)=nearfarT(z)σ(z)c(z) dz \mathbf{C}(\boldsymbol{r}) = \int_\verb|near|^\verb|far| T(z) \sigma(z) \boldsymbol{c}(z) \ dz

In NeRF, near=\verb|near| = 0. and far=\verb|far| = 1. for scaled bounded scenes and front facing scenes after conversion to normalized device coordinates (NDC).

Numerical quadrature

We took a step from discrete to continuous to derive the rendering integral. Nevertheless, integration on a continuous domain is not supported by computers. An alternative is numerical quadrature. Sample near<z1<z2<<zN<far\verb|near| \lt z_1 \lt z_2 \lt \cdots \lt z_{N} \lt \verb|far| along a ray, and define differences between adjacent samples as

δi:=zi+1zi i{1,,N1} \delta_i := z_{i+1} - z_i \ \forall i \in \{1,\ldots,N-1\}

then the transmittance is approximated by

Ti:= T(zi) ej=1i1σjδj \begin{align*} T_i :=&\ T(z_i) \\ \approx&\ e^{-\sum_{j=1}^{i\blue{-1}} \sigma_j \delta_j} \end{align*}

where T1=1T_1 = 1 and σj=σ(zj;o,d)\sigma_j = \sigma(z_j\gray{; \boldsymbol{o}, \boldsymbol{d}}). Meanwhile, differentiation in phit(z)p_\text{hit}(z) is also substituted by discrete difference. That is,

pi:=phit(zi)=dHdzziH(zi+1)H(zi)=1T(zi+1)(1T(zi))=T(zi)T(zi+1)=Ti(1eσiδi) \begin{align*} p_i := p_\text{hit}(z_i) &= \left. \frac{dH}{dz} \right|_{z_i} \\ &\approx H(z_{i+1}) - H(z_i) \\ &\gray{= 1- T(z_{i+1}) - \left(1- T(z_i)\right)} \\ &= T(z_i) - T(z_{i+1}) \\ &= T_i \left(1 - e^{-\sigma_i \delta_i}\right) \end{align*}

Step-by-step solution

T(zi)T(zi+1)=T(zi)(1T(zi+1)T(zi))=T(zi)(1ej=1iσjδjej=1i1σjδj)=T(zi)(1ej=1iσjδj+j=1i1σjδj)=Ti(1eσiδi) \begin{align*} T(z_i) - T(z_{i+1}) &= T(z_i) \left(1 - \frac{T(z_{i+1})}{T(z_i)}\right) \\ &= T(z_i) \left( 1 - \frac{e^{-\sum_{j=1}^{\green{i}} \sigma_j \delta_j}}{e^{-\sum_{j=1}^{\blue{i-1}} \sigma_j \delta_j}} \right) \\ &= T(z_i) \left(1 - e^{-\sum_{j=1}^{\green{i}} \sigma_j \delta_j + \sum_{j=1}^{\blue{i-1}} \sigma_j \delta_j}\right) \\ &= T_i \left(1 - e^{-\sigma_{\green{i}} \delta_{\green{i}}}\right) \end{align*}

Hence, the discretized emmitted radiance is

C^(r)=C^(z;o,d)=E(R)=nearfarR pR dzi=1NciTi(1eσiδi) \begin{align*} \hat{\mathbf{C}}(\boldsymbol{r}) \gray{= \hat{\mathbf{C}}(z; \boldsymbol{o}, \boldsymbol{d})} &= \mathbb{E} (\mathbf{R}) \\ &= \int_\verb|near|^\verb|far| \mathbf{R} \ p_{\mathbf{R}} \ dz \\ &\approx \sum_{i=1}^N \boldsymbol{c}_i T_i \left(1 - e^{-\sigma_i \delta_i}\right) \end{align*}

where ci:=c(zi;o,d)\boldsymbol{c}_i := \boldsymbol{c}(z_i\gray{; \boldsymbol{o}, \boldsymbol{d}}) is the output RGB upon MLP query at {zi  o,d}\{z_i \ | \ \boldsymbol{o}, \boldsymbol{d}\}.

Note that if we denote αi:=1eσiδi\alpha_i := 1 - e^{-\sigma_i \delta_i}, then C^(r)=i=1NαiTici\hat{\mathbf{C}}(\boldsymbol{r}) = \sum_{i=1}^N \alpha_i T_i \boldsymbol{c}_i resembles classical alpha compositing.

Alpha compositing

Consider the case where a foreground object is inserted ahead of the background. Now that a pixel displays a single color, we have to blend the colors of the two. Compositing is applied when there are partially transparent regions within the foreground object, or the foreground object partially covers the background.

Partial coverage

In alpha compositing, a parameter α\alpha determines the extent to which each object contributes to what is displayed in a pixel. Let α\alpha denote the opacity (or pixel coverage) of the foreground object, then a pixel showing foreground color cf\boldsymbol{c}_f over background color cb\boldsymbol{c}_b is composited as

c=αcf+(1α)cb \boldsymbol{c} = \alpha \boldsymbol{c}_f + (1 - \alpha) \boldsymbol{c}_b

Alpha compositing

When blending colors of multiple objects, one can adopt a divide-and-conquer approach. Each time, cope with the unregistered object closest to the eye and treat the remaining objects as a single entity. Such a strategy is formulated by

c=α1c1+(1α1)(α2c2+(1α2)(α3c3+(1α3)(α4c4+(1α4)())))=α1c1+(1α1)α2c2             +(1α1)(1α2)(α3c3+(1α3)(α4c4+(1α4)()))=α1c1+(1α1)α2c2             +(1α1)(1α2)α3c3             +(1α1)(1α2)(1α3)(α4c4+(1α4)())=α1c1+(1α1)α2c2             +(1α1)(1α2)α3c3             +(1α1)(1α2)(1α3)α4c4             +(1α1)(1α2)(1α3)(1α4)()= \begin{align*} \boldsymbol{c} &= \red{\alpha_1 \boldsymbol{c}_1} + (1 - \red{\alpha_1})\biggl( \blue{\alpha_2 \boldsymbol{c}_2} + \left( 1 - \blue{\alpha_2} \right)\Bigl( \green{\alpha_3 \boldsymbol{c}_3} + \left( 1 - \green{\alpha_3} \right)\bigl( \orange{\alpha_4 \boldsymbol{c}_4} + (1 - \orange{\alpha_4})\left( \cdots \right) \bigr) \Bigr) \biggr) \\ &= \red{\alpha_1 \boldsymbol{c}_1} + (1 - \red{\alpha_1})\blue{\alpha_2 \boldsymbol{c}_2} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ + (1 - \red{\alpha_1})( 1 - \blue{\alpha_2})\Bigl( \green{\alpha_3 \boldsymbol{c}_3} + \left( 1 - \green{\alpha_3} \right)\bigl( \orange{\alpha_4 \boldsymbol{c}_4} + (1 - \orange{\alpha_4})\left( \cdots \right) \bigr) \Bigr) \\ &= \red{\alpha_1 \boldsymbol{c}_1} + (1 - \red{\alpha_1})\blue{\alpha_2 \boldsymbol{c}_2} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ + (1 - \red{\alpha_1})(1 - \blue{\alpha_2})\green{\alpha_3 \boldsymbol{c}_3} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ + (1 - \red{\alpha_1})(1 - \blue{\alpha_2})(1 - \green{\alpha_3})\bigl( \orange{\alpha_4 \boldsymbol{c}_4} + (1 - \orange{\alpha_4})\left( \cdots \right) \bigr) \\ &= \red{\alpha_1 \boldsymbol{c}_1} + (1 - \red{\alpha_1})\blue{\alpha_2 \boldsymbol{c}_2} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ + (1 - \red{\alpha_1})(1 - \blue{\alpha_2}) \green{\alpha_3 \boldsymbol{c}_3} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ + (1 - \red{\alpha_1})(1 - \blue{\alpha_2})(1 - \green{\alpha_3})\orange{\alpha_4 \boldsymbol{c}_4} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ + (1 - \red{\alpha_1})(1 - \blue{\alpha_2})(1 - \green{\alpha_3})(1 - \orange{\alpha_4})\left( \cdots \right) \\ &= \cdots \end{align*}

which is essentially a tail recursion.

Alpha compositing in NeRF

There are NN samples along each ray in NeRF. Consider the first N1N - 1 samples as occluding foreground objects with opacity αi\alpha_i and color ci\boldsymbol{c}_i, and the last sample as background, then the blended pixel value is

C~=α1c1+(1α1)α2c2             +(1α1)(1α2)α3c3             +(1α1)(1α2)(1α3)α4c4                          +(1α1)(1α2)(1α3)(1αN1)cN=i=1N(αicij=1i1(1αj)) \begin{align*} \tilde{\mathbf{C}} &= \alpha_1 \boldsymbol{c}_1 + (1 - \alpha_1) \alpha_2 \boldsymbol{c}_2 \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ + (1 - \alpha_1)(1 - \alpha_2) \alpha_3 \boldsymbol{c}_3 \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ + (1 - \alpha_1)(1 - \alpha_2)(1 - \alpha_3) \alpha_4 \boldsymbol{c}_4 \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ \cdots \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ + (1 - \alpha_1)(1 - \alpha_2)(1 - \alpha_3) \cdots (1 - \alpha_{N-1}) \boldsymbol{c}_N \\ &= \sum_{i=1}^{N} \left(\alpha_i \boldsymbol{c}_i \prod_{j=1}^{i-1} \left(1 - \alpha_j\right)\right) \end{align*}

where α0=0\alpha_0 = 0. Recall C^=i=1NαiTici\hat{\mathbf{C}} = \sum_{i=1}^N \alpha_i T_i \boldsymbol{c}_i, then it remains to show that j=1i1(1αj)=Ti\prod_{j=1}^{i-1} \left(1 - \alpha_j\right) = T_i:

j=1i1(1αj)=j=1i1eσjδj=exp(j=1i1σjδj)=Ti \begin{align*} \prod_{j=1}^{i-1} \left(1 - \alpha_j\right) &= \prod_{j=1}^{i-1} e^{-\sigma_j \delta_j} \\ &= \exp \left( \sum_{j=1}^{i-1} -\sigma_j \delta_j \right) \\ &= T_i \end{align*}

concluding the proof. This manifests the elegancy of differentiable volume rendering.

Rewrite wi:=αiTiw_i := \alpha_i T_i, then the expectation of emmitted radiance C(r)=i=1Nwici\mathbf{C}(\boldsymbol{r}) = \sum_{i=1}^N w_i \boldsymbol{c}_i is weighted sum of colors.

Why (trivially) differentiable?

Given the above renderer, a coarse training pipeline is

ground truth C(x,d)MLP(c,σ)discrete renderingprediction C^}MSEL=C^C22 \left. \begin{array}{r} \text{ground truth} \ \mathbf{C} \\ (\boldsymbol{x}, \boldsymbol{d}) \xrightarrow{\text{MLP}} (\boldsymbol{c}, \sigma) \xrightarrow{\text{discrete rendering}} \text {prediction} \ \hat{\mathbf{C}} \end{array} \right\} \xrightarrow{\text{MSE}} \mathcal L = \| \hat{\mathbf{C}} - \mathbf{C} \|_{\gray{2}}^2

If the discrete renderer is differentiable, then we can train the end-to-end model through gradient descent. No suprise, given a (sorted) sequence of random samples t={t1,t2,,tN}\boldsymbol{t} = \{t_1, t_2, \ldots, t_N\}, the derivatives are

dC^dcit=Ti(1eσiδi)dC^dσit=ci(dTidσi(1eσiδi)+Tiddσi(1eσiδi))=ci((1eσiδi)exp(j=1i1σjδj)ddσi(j=1i1σjδj)0+Ti(eσiδi)d(σiδi)dσi)=δiTicieσiδi \begin{align*} \left. \frac{d \hat{\mathbf{C}}}{d \boldsymbol{c}_i} \right|_{\boldsymbol{t}} &= T_i \left(1 - e^{-\sigma_i \delta_i} \right) \\ \left. \frac{d \hat{\mathbf{C}}}{d \sigma_i} \right|_{\boldsymbol{t}} &= \boldsymbol{c}_i \left( \green{\frac{d T_i}{d \sigma_i}} \blue{\left( 1 - e^{-\sigma_i \delta_i} \right)} + \green{T_i} \blue{\frac{d}{d \sigma_i} \left( 1 - e^{-\sigma_i \delta_i} \right)} \right) \\ &= \mathbf{c}_i \left( \blue{\left( 1 - e^{-\sigma_i \delta_i} \right)} \green{\exp \left(-\sum_{j=1}^{i-1} \sigma_j \delta_j \right)} \underbrace{\green{\frac{d}{d \sigma_i} \left( -\sum_{j=1}^{i-1} \sigma_j \delta_j \right)}}_{0} + \green{T_i} \blue{\left( -e^{-\sigma_i \delta_i} \right) \frac{d(-\sigma_i \delta_i)}{d \sigma_i}} \right) \\ &= \delta_i T_i \boldsymbol{c}_i e^{-\sigma_i \delta_i} \end{align*}

Once the renderer is differentiable, weights and biases in an MLP can be updated via the chain rule.

Coarse-to-fine approach

NeRF jointly optimizes coarse and fine network.

Analysis

Whereas NeRF is originally implementedopen in new window in Tensrorflowopen in new window, code analysis is based on a faithful reproductionopen in new window in PyTorchopen in new window. The repository is organized as

Directory organization

Let's experiment with the LLFF datasetopen in new window, which is comprised of front-facing scenes with camera poses. Pertinent directories and files are

ItemTypeDescription
configsdirectorycontains per scene configuration (.txt) for the LLFF dataset
download_example_data.shshell scriptto download datasets
load_llff.pyPython scriptdata loader of the LLFF dataset
run_nerf.pyPrthon scriptmain procedures
run_nerf_helpers.pyPython scriptutility functions

Modified identation and comments

Codes in this post deviate slightly from the authentic versionopen in new window. Dataflow and function calls remain intact whereas indentation and comments are modified for the sake of readibility.

The big picture

if __name__ == '__main__':
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
    train()


 

As shown, train(…) in run_nerf.py is the execution entry to the project. The entire training process is

def train():

    parser = config_parser()
    args = parser.parse_args()

    # load data
    K = None
    if args.dataset_type == 'llff':
        images, poses, bds, render_poses, i_test = load_llff_data(args.datadir, args.factor,
                                                                  recenter=True, bd_factor=.75,
                                                                  spherify=args.spherify)
        hwf = poses[0,:3,-1]
        poses = poses[:,:3,:4]
        print('Loaded llff', images.shape, render_poses.shape, hwf, args.datadir)
        if not isinstance(i_test, list):
            i_test = [i_test]

        if args.llffhold > 0:
            print('Auto LLFF holdout,', args.llffhold)
            i_test = np.arange(images.shape[0])[ : :args.llffhold]

        i_val = i_test
        i_train = np.array([i for i in np.arange(int(images.shape[0]))
                                  if (i not in i_test and i not in i_val)])

        print('DEFINING BOUNDS')
        if args.no_ndc:
            near = np.ndarray.min(bds) * .9
            far = np.ndarray.max(bds) * 1.
            
        else:
            near = 0.
            far = 1.
        print('NEAR FAR', near, far)

    elif args.dataset_type == 'blender':
        images, poses, render_poses, hwf, i_split = load_blender_data(args.datadir, args.half_res, args.testskip)
        print('Loaded blender', images.shape, render_poses.shape, hwf, args.datadir)
        i_train, i_val, i_test = i_split

        near = 2.
        far = 6.

        if args.white_bkgd:
            images = images[...,:3]*images[...,-1:] + (1.-images[...,-1:])
        else:
            images = images[...,:3]

    elif args.dataset_type == 'LINEMOD':
        images, poses, render_poses, hwf, K, i_split, near, far = load_LINEMOD_data(args.datadir, args.half_res, args.testskip)
        print(f'Loaded LINEMOD, images shape: {images.shape}, hwf: {hwf}, K: {K}')
        print(f'[CHECK HERE] near: {near}, far: {far}.')
        i_train, i_val, i_test = i_split

        if args.white_bkgd:
            images = images[...,:3]*images[...,-1:] + (1.-images[...,-1:])
        else:
            images = images[...,:3]

    elif args.dataset_type == 'deepvoxels':

        images, poses, render_poses, hwf, i_split = load_dv_data(scene=args.shape,
                                                                 basedir=args.datadir,
                                                                 testskip=args.testskip)

        print('Loaded deepvoxels', images.shape, render_poses.shape, hwf, args.datadir)
        i_train, i_val, i_test = i_split

        hemi_R = np.mean(np.linalg.norm(poses[:,:3,-1], axis=-1))
        near = hemi_R-1.
        far = hemi_R+1.

    else:
        print('Unknown dataset type', args.dataset_type, 'exiting')
        return

    # cast intrinsics to right types
    H, W, focal = hwf
    H, W = int(H), int(W)
    hwf = [H, W, focal]

    if K is None:
        K = np.array([
            [focal, 0    , 0.5*W],
            [0    , focal, 0.5*H],
            [0    , 0    , 1    ]
        ])

    if args.render_test:
        render_poses = np.array(poses[i_test])

    # create log dir and copy the config file
    basedir = args.basedir
    expname = args.expname
    os.makedirs(os.path.join(basedir, expname), exist_ok=True)
    f = os.path.join(basedir, expname, 'args.txt')
    with open(f, 'w') as file:
        for arg in sorted(vars(args)):
            attr = getattr(args, arg)
            file.write('{} = {}\n'.format(arg, attr))
    if args.config is not None:
        f = os.path.join(basedir, expname, 'config.txt')
        with open(f, 'w') as file:
            file.write(open(args.config, 'r').read())

    # create nerf model
    render_kwargs_train, render_kwargs_test, start, grad_vars, optimizer = create_nerf(args)
    global_step = start

    bds_dict = {'near': near,
                'far' : far}
    render_kwargs_train.update(bds_dict)
    render_kwargs_test.update(bds_dict)

    # move test data to GPU
    render_poses = torch.Tensor(render_poses).to(device)

    # short circuit if rendering from trained model
    if args.render_only:
        print('RENDER ONLY')
        with torch.no_grad():
            if args.render_test:
                images = images[i_test] # switch to test poses
            else:
                # default is smoother render_poses path
                images = None

            testsavedir = os.path.join(basedir, expname, 'renderonly_{}_{:06d}'.format('test' if args.render_test else 'path', start))
            os.makedirs(testsavedir, exist_ok=True)
            print('test poses shape', render_poses.shape)

            rgbs, _ = render_path(render_poses, hwf, K, args.chunk, render_kwargs_test, gt_imgs=images, savedir=testsavedir, render_factor=args.render_factor)
            print('Done rendering', testsavedir)
            imageio.mimwrite(os.path.join(testsavedir, 'video.mp4'), to8b(rgbs), fps=30, quality=8)

            return

    # prepare raybatch tensor if batching random rays
    N_rand = args.N_rand
    use_batching = not args.no_batching
    if use_batching:
        # random ray batching
        print('get rays')
        rays = np.stack([get_rays_np(H, W, K, p) for p in poses[:,:3,:4]], 0) # (num_img, ro+rd, H, W, 3)
        print('done, concats')
        rays_rgb = np.concatenate([rays, images[:,None]], 1) # (num_img, ro+rd+rgb, H, W, 3)
        rays_rgb = np.transpose(rays_rgb, [0,2,3,1,4]) # (num_img, H, W, ro+rd+rgb, 3)
        rays_rgb = np.stack([rays_rgb[i] for i in i_train], 0) # training set only
        rays_rgb = np.reshape(rays_rgb, [-1,3,3]) # ((num_img-1)*H*W, ro+rd+rgb, 3)
        rays_rgb = rays_rgb.astype(np.float32)
        print('shuffle rays')
        np.random.shuffle(rays_rgb)

        print('done')
        i_batch = 0

    # move training data to GPU
    if use_batching:
        images = torch.Tensor(images).to(device)
    poses = torch.Tensor(poses).to(device)
    if use_batching:
        rays_rgb = torch.Tensor(rays_rgb).to(device)


    N_iters = 200000 + 1
    print('Begin')
    print('TRAIN views are', i_train)
    print('TEST views are', i_test)
    print('VAL views are', i_val)

    # summary writers
    #writer = SummaryWriter(os.path.join(basedir, 'summaries', expname))
    
    start = start + 1
    for i in trange(start, N_iters):
        time0 = time.time()

        # sample random ray batch
        if use_batching:
            # random over all images
            batch = rays_rgb[i_batch:i_batch+N_rand] # (B, 2+1, 3*?)
            batch = torch.transpose(batch, 0, 1)
            batch_rays, target_s = batch[:2], batch[2]

            i_batch += N_rand
            if i_batch >= rays_rgb.shape[0]:
                print("Shuffle data after an epoch!")
                rand_idx = torch.randperm(rays_rgb.shape[0])
                rays_rgb = rays_rgb[rand_idx]
                i_batch = 0
        else:
            # random from one image
            img_i = np.random.choice(i_train)
            target = images[img_i]
            target = torch.Tensor(target).to(device)
            pose = poses[img_i, :3,:4]

            if N_rand is not None:
                rays_o, rays_d = get_rays(H, W, K, torch.Tensor(pose))  # (H, W, 3), (H, W, 3)

                if i < args.precrop_iters:
                    dH = int(H//2 * args.precrop_frac)
                    dW = int(W//2 * args.precrop_frac)
                    coords = torch.stack(
                        torch.meshgrid(
                            torch.linspace(H//2 - dH, H//2 + dH - 1, 2*dH), 
                            torch.linspace(W//2 - dW, W//2 + dW - 1, 2*dW)
                        ), -1)
                    if i == start:
                        print(f"[Config] Center cropping of size {2*dH} x {2*dW} is enabled until iter {args.precrop_iters}")                
                else:
                    coords = torch.stack(torch.meshgrid(torch.linspace(0, H-1, H), torch.linspace(0, W-1, W)), -1)  # (H, W, 2)

                coords = torch.reshape(coords, [-1,2])  # (H * W, 2)
                select_inds = np.random.choice(coords.shape[0], size=[N_rand], replace=False)  # (N_rand,)
                select_coords = coords[select_inds].long()  # (N_rand, 2)
                rays_o = rays_o[select_coords[:, 0], select_coords[:, 1]]  # (N_rand, 3)
                rays_d = rays_d[select_coords[:, 0], select_coords[:, 1]]  # (N_rand, 3)
                batch_rays = torch.stack([rays_o, rays_d], 0)
                target_s = target[select_coords[:, 0], select_coords[:, 1]]  # (N_rand, 3)

        #####  core optimization loop  #####
        rgb, disp, acc, extras = render(H, W, K, chunk=args.chunk, rays=batch_rays,
                                                 verbose=i < 10, retraw=True,
                                                 **render_kwargs_train)
        optimizer.zero_grad()
        img_loss = img2mse(rgb, target_s)
        trans = extras['raw'][...,-1]
        loss = img_loss
        psnr = mse2psnr(img_loss)

        if 'rgb0' in extras:
            img_loss0 = img2mse(extras['rgb0'], target_s)
            loss = loss + img_loss0
            psnr0 = mse2psnr(img_loss0)

        loss.backward()
        optimizer.step()

        # NOTE: IMPORTANT!
        ###   update learning rate   ###
        decay_rate = 0.1
        decay_steps = args.lrate_decay * 1000
        new_lrate = args.lrate * (decay_rate ** (global_step / decay_steps))
        for param_group in optimizer.param_groups:
            param_group['lr'] = new_lrate
        ################################

        dt = time.time()-time0
        # print(f"Step: {global_step}, Loss: {loss}, Time: {dt}")
        #####           end            #####

        # rest is logging
        if i%args.i_weights==0:
            path = os.path.join(basedir, expname, '{:06d}.tar'.format(i))
            torch.save({
                'global_step': global_step,
                'network_fn_state_dict': render_kwargs_train['network_fn'].state_dict(),
                'network_fine_state_dict': render_kwargs_train['network_fine'].state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
            }, path)
            print('Saved checkpoints at', path)

        if i%args.i_video==0 and i > 0:
            # test mode
            with torch.no_grad():
                rgbs, disps = render_path(render_poses, hwf, K, args.chunk, render_kwargs_test)
            print('Done, saving', rgbs.shape, disps.shape)
            moviebase = os.path.join(basedir, expname, '{}_spiral_{:06d}_'.format(expname, i))
            imageio.mimwrite(moviebase + 'rgb.mp4', to8b(rgbs), fps=30, quality=8)
            imageio.mimwrite(moviebase + 'disp.mp4', to8b(disps / np.max(disps)), fps=30, quality=8)

            #if args.use_viewdirs:
            #    render_kwargs_test['c2w_staticcam'] = render_poses[0][:3,:4]
            #    with torch.no_grad():
            #        rgbs_still, _ = render_path(render_poses, hwf, args.chunk, render_kwargs_test)
            #    render_kwargs_test['c2w_staticcam'] = None
            #    imageio.mimwrite(moviebase + 'rgb_still.mp4', to8b(rgbs_still), fps=30, quality=8)

        if i%args.i_testset==0 and i > 0:
            testsavedir = os.path.join(basedir, expname, 'testset_{:06d}'.format(i))
            os.makedirs(testsavedir, exist_ok=True)
            print('test poses shape', poses[i_test].shape)
            with torch.no_grad():
                render_path(torch.Tensor(poses[i_test]).to(device), hwf, K, args.chunk, render_kwargs_test, gt_imgs=images[i_test], savedir=testsavedir)
            print('Saved test set')

        if i%args.i_print==0:
            tqdm.write(f"[TRAIN] Iter: {i} Loss: {loss.item()}  PSNR: {psnr.item()}")
        """
            print(expname, i, psnr.numpy(), loss.numpy(), global_step.numpy())
            print('iter time {:.05f}'.format(dt))

            with tf.contrib.summary.record_summaries_every_n_global_steps(args.i_print):
                tf.contrib.summary.scalar('loss', loss)
                tf.contrib.summary.scalar('psnr', psnr)
                tf.contrib.summary.histogram('tran', trans)
                if args.N_importance > 0:
                    tf.contrib.summary.scalar('psnr0', psnr0)

            if i%args.i_img==0:

                # log a rendered validation view to Tensorboard
                img_i=np.random.choice(i_val)
                target = images[img_i]
                pose = poses[img_i, :3,:4]
                with torch.no_grad():
                    rgb, disp, acc, extras = render(H, W, focal, chunk=args.chunk, c2w=pose,
                                                        **render_kwargs_test)
                psnr = mse2psnr(img2mse(rgb, target))

                with tf.contrib.summary.record_summaries_every_n_global_steps(args.i_img):

                    tf.contrib.summary.image('rgb', to8b(rgb)[tf.newaxis])
                    tf.contrib.summary.image('disp', disp[tf.newaxis,...,tf.newaxis])
                    tf.contrib.summary.image('acc', acc[tf.newaxis,...,tf.newaxis])

                    tf.contrib.summary.scalar('psnr_holdout', psnr)
                    tf.contrib.summary.image('rgb_holdout', target[tf.newaxis])

                if args.N_importance > 0:

                    with tf.contrib.summary.record_summaries_every_n_global_steps(args.i_img):
                        tf.contrib.summary.image('rgb0', to8b(extras['rgb0'])[tf.newaxis])
                        tf.contrib.summary.image('disp0', extras['disp0'][tf.newaxis,...,tf.newaxis])
                        tf.contrib.summary.image('z_std', extras['z_std'][tf.newaxis,...,tf.newaxis])
        """
        global_step += 1

which is lengthy and potentially obscure. Function calls are visualized below, which assists comprehension of the rendering pipeline.

As captioned, let's concentrate on implementating volume rendering in this post. Our journey starts from rays generation (get_rays_np(…) at line 144\verb|144|) and culminates in learning rate update (lines 242\verb|242| to 246\verb|246|) in each iteration.

Prior to rendering is the data loader (lines 7\verb|7| to 104\verb|104|) and network initialization (lines 107\verb|107| to 116\verb|116|). We ommit the analysis of the data loader. Though loaded images and poses will be introduced upon their first appearance, feel free to peruse this postopen in new window for details. Neither will we delve into lines 119\verb|119| to 136\verb|136|, which terminates the project immediately after rendering a novel view or video. Testing NeRF is not our primary concern. Despite this, network creation and initialization will be covered in appendix.

Functions analyses are organized in horizontal tabs.

As you see in the function flow chart, procedure call is complex in NeRF. To facilitate clearity, there will be a few horizontal tabs in a section, each responsible for a single function.

Training set

Training preparation

def train_prepare(self, data):# move training data to GPU
    if use_batching:
        images = torch.Tensor(images).to(device)
    poses = torch.Tensor(poses).to(device)
    if use_batching:
        rays_rgb = torch.Tensor(rays_rgb).to(device)


    N_iters = 200000 + 1
    print('Begin')
    print('TRAIN views are', i_train)
    print('TEST views are', i_test)
    print('VAL views are', i_val)

    # summary writers
    #writer = SummaryWriter(os.path.join(basedir, 'summaries', expname))
    
    start = start + 1
    for i in trange(start, N_iters):
        time0 = time.time()

        # sample random ray batch
        if use_batching:
            # random over all images
            batch = rays_rgb[i_batch:i_batch+N_rand] # [B, 2+1, 3*?]
            batch = torch.transpose(batch, 0, 1)
            batch_rays, target_s = batch[:2], batch[2]

            i_batch += N_rand
            if i_batch >= rays_rgb.shape[0]:
                print("Shuffle data after an epoch!")
                rand_idx = torch.randperm(rays_rgb.shape[0])
                rays_rgb = rays_rgb[rand_idx]
                i_batch = 0
        else:

Lines 2\verb|2| to 6\verb|6| convert the training set (from NumPy arrays) to PyTorch tensors and "transfer" them to GPU RAM, and start = start + 1 (line 18\verb|18|) marks the commencement of training iterations. Training data were first divided into batches. Let BB denote the batch size (B=B = N_rand), then inputs batch_rays and ground truth target_s have shape (2,B,3)(2, B, 3) and (B,3)(B, 3) (line 27\verb|27|). Lines 30\verb|30| to 34\verb|34| handles out-of-bound cases where the index i_batch exceeds num_ray\verb|num_ray|. We do not care about the else block starting from line 35\verb|35| since use_batching is asserted by default.

Rendering

for i in trange(start, N_iters):#####  core optimization loop  #####
        rgb, disp, acc, extras = render(H, W, K, chunk=args.chunk, rays=batch_rays,
                                                 verbose=i < 10, retraw=True,
                                                 **render_kwargs_train)

Ensuing is volume rendering. CL arg chunk defines the number of rays concurrently processed, which impacts performance rather than correctness. render_kwargs_train is a dictionary returned upon initiating a NeRF network (line 107\verb|107| in train) with 22 more keys injected at line 112\verb|112| (in train). Its internals are

KeyElementDescription
network_query_fna functiona subroutine that takes data and a network as input to perform query
perturb1.whether to adopt stratified sampling, 1. for True
N_importance128128number of addition samples NFN_F per ray in hierarchical sampling
network_finean objectthe fine network
N_samples6464number of samples NCN_C per ray to coarse network
network_fnan objectthe coarse network
use_viewdirsTruewhether to feed viewing directions to network, indispensible for view-dependent apprearance
white_bkgdFalsewhether to assume white background for rendering

This applies to the synthetic datasetopen in new window only, which contains images (.png) with transparent background.
raw_noise_std1.magnitude of noise to inject into volume density
near0.lower bound of rendering integration
far1.upper bound of rendering integration

Note

See appendix for how a NeRF model is implemented.

Keyword arguments are passed from high-level interface to "worker" procedures.

By encapsulation with batchify_rays(…) and render(…), training options render_kwargs_train defined previously are passed to the low-level "worker" render_rays(…), which is to core of volume rendering.

Optimization

for i in trange(start, N_iters):
        …
        optimizer.zero_grad()
        img_loss = img2mse(rgb, target_s)
        trans = extras['raw'][...,-1]
        loss = img_loss
        psnr = mse2psnr(img_loss)

        if 'rgb0' in extras:
            img_loss0 = img2mse(extras['rgb0'], target_s)
            loss = loss + img_loss0
            psnr0 = mse2psnr(img_loss0)

        loss.backward()
        optimizer.step()

        # NOTE: IMPORTANT!
        ###   update learning rate   ###
        decay_rate = 0.1
        decay_steps = args.lrate_decay * 1000
        new_lrate = args.lrate * (decay_rate ** (global_step / decay_steps))
        for param_group in optimizer.param_groups:
            param_group['lr'] = new_lrate
        ################################

Radiance rgb RB×3\in \mathbb{R}^{B \times 3} is compared against the ground truth target_s to obtain the MSE loss at line 5\verb|5|. The total loss also includes that of the coarse network (line 12\verb|12|). The coarse and fine network are jointly optimized at lines 15\verb|15| and 16\verb|16|. Eventually, learning rate decays from line 20\verb|20| to 24\verb|24|.

Summary

This post derives the volmue rendering integral and its numrical quadrature. Also explained is its connection with classical alpha compositing. The second part elaborates on the implementation of the rendering pipeline. Illustrations are included to assist understanding procedures such as rays generation and Monte Carlo sampling. Most importantly, the article clearly specifies the physical meaning of each variable and provides the mathematical operation for each statement. To sum, the blog functions as a complete guide for in-depth comprehension of NeRF.

References

Chapter 2.1 in Computer Vision: Algorithms and Applicationsopen in new window
Foundamentals of Computer Graphicsopen in new window
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesisopen in new window
NeRF PyTorch implementationopen in new window by Yen-Chen Linopen in new window
Neural Radiance Field's Volume Rendering 公式分析open in new window
Optical Models for Direct Volume Renderingopen in new window
Part 1open in new window of the SIGGRAPH 2021 course on Advances in Neural Renderingopen in new window
深度解读yenchenlin/nerf-pytorch项目open in new window

Please use the following BibTeX to cite this post:

@misc{yyu2022nerfrendering,
    author = {YU Yue},
    title  = {NeRF: A Volume Rendering Perspective},
    year   = {2022},
    howpublished = {\url{https://yconquesty.github.io/blog/ml/nerf/nerf_rendering.html}}
}

Appendix

Content on the way. Stay tuned!

Errata

TimeModification
Aug 31 2022Initial release
Nov 24 2022Rectify reference list
Dec 2 2022Add BibTeX for citation
Apr 20 2023Elaborate on static compute graph
Last update:
Contributors: YU Yue,Will Yu
Loading...