TöRF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis | NeurIPS 2021

Benjamin Attal (CMU), Eliot Laidlaw (Brown), Aaron Gokaslan (Cornell), Changil Kim (Facebook), Christian Richardt (U. of Bath), James Tompkin (Brown), and Matthew O'Toole (CMU)

TöRF = ToF + NeRF. Pronounced just like ‘Turf’.

Abstract

Neural networks can represent and accurately reconstruct radiance fields for static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes captured with monocular video, with promising performance. However, the monocular setting is known to be an under-constrained problem, and so methods rely on data-driven priors for reconstructing dynamic content. We replace these priors with measurements from a time-of-flight (ToF) camera, and introduce a neural representation based on an image formation model for continuous-wave ToF cameras. Instead of working with processed depth maps, we model the raw ToF sensor measurements to improve reconstruction quality and avoid issues with low reflectance regions, multi-path interference, and a sensor's limited unambiguous depth range. We show that this approach improves robustness of dynamic scene reconstruction to erroneous calibration and large motions, and discuss the benefits and limitations of integrating RGB+ToF sensors that are now available on modern smartphones.



Paper

Video



Key idea


Novel-view synthesis (NVS) is a long-standing problem in computer graphics and computer vision, where the objective is to photorealistically render images of a scene from novel viewpoints. Given a number of images taken from different viewpoints, it is possible to infer both the geometry and appearance of a scene, and then use this information to synthesize images at novel camera poses.

The key idea behind TöRF is to tackle the NVS problem by taking advantage of the depth sensors available on many consumer devices (phones, tablets, laptops). Specifically, we use both the RGB images captured with regular cameras and the depth information captured with time-of-flight (ToF) cameras to optimize for a neural radiance field. Because the depth images produced by ToF cameras can be unreliable, TöRF instead models the raw data captured by a continuous-wave ToF camera (referred to as phasor images), which leads to better view synthesis results.


Illustration of Time-of-Flight Radiance Fields. (a) We move a handheld imaging system around a dynamic scene, capturing (b) color images and (c) raw phasor images with a continuous-wave ToF camera. (d) Then, we optimize for a continuous neural radiance field of the scene that predicts the captured color and phasor images. This allows novel view synthesis.

Benefits of using raw phasor supervision vs. derived depth


Resolving phase wrapping

Using raw phasor supervision allows us to reconstruct depth ranges that otherwise wrap around in the derived depth.

The Photocopier sequence below shows incorrect derived depth from ToF on the far wall to the right, but our approach manages to better reconstruct the object. Using derived depth in VideoNeRF produces large errors (video far right).

  


Low-reflectance noise

Depth values become unreliable (noisy) when the amount of light reflected to the camera is small. Modeling the phasor images directly makes the solution more robust to sensor noise.

The DeskBox sequence below on the back of the monitor shows noisy derived depth from ToF (bright yellow sparkles), but our approach manages to better reconstruct the object.

  


Better handling of multi-path interference

In ToF, the light detected may not travel along a single path; this results in mixtures of phasors, which can result in phase values that do not correspond to a single depth. This can occur near depth edges as well as for specular reflections and transparent surfaces. We model the response from multiple single-scattering events along a ray to provide better handling over such scenarios.

The Cupboard sequence below shows improved reconstruction of specular surfaces such as the fridge on the right, which should have depth values that reflect the surrounding environment.

  


Results

Note: This page contains many videos. Each set of results will show/hide through a button to ease the burden on the browser.

Baseline

  • NeRF: A method to represent static scenes without any depth or raw time-of-flight supervision.
  • NSFF: A method to represent dynamic scenes using depth and flow supervision as input. For this method, we follow the author's method and input depth extracted from a single-image depth estimation neural network.
    Note that as NSFF computes scene flow, it can interpolate time as well as space; we show results without time interpolation.
  • VideoNeRF: A method to represent dynamic scenes using depth supervision as input. IMPORTANT: For this method, we modified VideoNeRF to accept input depth derived from a time-of-flight sensor. Moreover, note that on our dynamic sequences (e.g., Cupboard, DeskBox, PhotoCopier), VideoNeRF sometimes fails to reconstruct accurate camera poses via COLMAP; instead, we use poses derived from our TöRF approach to be able to show a reasonable comparison.
  • TöRF: Our proposed method to represent dynamic scenes using supervision from raw phasor time-of-flight data rather than derived depth. To consider derived depth vs. raw phasor, the closest comparison is between VideoNeRF and TöRF.
Note: We originally relied on BRISQUE scores to evaluate image quality of the synthesized images when no reference image is available. We no longer report these scores, as they provide a false impression of the performance of the various baseline methods.


Code and data

Bibtex

@article{attal2021torf,
  title={T{\"o}RF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis},
  author={Attal, Benjamin and Laidlaw, Eliot and Gokaslan, Aaron and Kim, Changil and Richardt, Christian and Tompkin, James and O'Toole, Matthew},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

Sponsors

For funding, Matthew O'Toole acknowledges support from NSF IIS-2008464, James Tompkin thanks an Amazon Research Award and NSF CNS-2038897, and Christian Richardt acknowledges funding from an EPSRC-UKRI Innovation Fellowship (EP/S001050/1) and RCUK grant CAMERA (EP/M023281/1, EP/T022523/1).


Copyright © 2021 Benjamin Attal