Benjamin Attal (CMU), Eliot Laidlaw (Brown), Aaron Gokaslan (Cornell), Changil Kim (Facebook), Christian Richardt (U. of Bath), James Tompkin (Brown), and Matthew O'Toole (CMU)
Neural networks can represent and accurately reconstruct radiance fields for static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes captured with monocular video, with promising performance. However, the monocular setting is known to be an under-constrained problem, and so methods rely on data-driven priors for reconstructing dynamic content. We replace these priors with measurements from a time-of-flight (ToF) camera, and introduce a neural representation based on an image formation model for continuous-wave ToF cameras. Instead of working with processed depth maps, we model the raw ToF sensor measurements to improve reconstruction quality and avoid issues with low reflectance regions, multi-path interference, and a sensor's limited unambiguous depth range. We show that this approach improves robustness of dynamic scene reconstruction to erroneous calibration and large motions, and discuss the benefits and limitations of integrating RGB+ToF sensors that are now available on modern smartphones.
Novel-view synthesis (NVS) is a long-standing problem in computer graphics and computer vision, where the objective is to photorealistically render images of a scene from novel viewpoints. Given a number of images taken from different viewpoints, it is possible to infer both the geometry and appearance of a scene, and then use this information to synthesize images at novel camera poses.
The key idea behind TöRF is to tackle the NVS problem by taking advantage of the depth sensors available on many consumer devices (phones, tablets, laptops). Specifically, we use both the RGB images captured with regular cameras and the depth information captured with time-of-flight (ToF) cameras to optimize for a neural radiance field. Because the depth images produced by ToF cameras can be unreliable, TöRF instead models the raw data captured by a continuous-wave ToF camera (referred to as phasor images), which leads to better view synthesis results.
Illustration of Time-of-Flight Radiance Fields. (a) We move a handheld imaging system around a dynamic scene, capturing (b) color images and (c) raw phasor images with a continuous-wave ToF camera. (d) Then, we optimize for a continuous neural radiance field of the scene that predicts the captured color and phasor images. This allows novel view synthesis.
Using raw phasor supervision allows us to reconstruct depth ranges that otherwise wrap around in the derived depth.
The Photocopier sequence below shows incorrect derived depth from ToF on the far wall to the right, but our approach manages to better reconstruct the object. Using derived depth in VideoNeRF produces large errors (video far right).
Depth values become unreliable (noisy) when the amount of light reflected to the camera is small. Modeling the phasor images directly makes the solution more robust to sensor noise.
The DeskBox sequence below on the back of the monitor shows noisy derived depth from ToF (bright yellow sparkles), but our approach manages to better reconstruct the object.
In ToF, the light detected may not travel along a single path; this results in mixtures of phasors, which can result in phase values that do not correspond to a single depth. This can occur near depth edges as well as for specular reflections and transparent surfaces. We model the response from multiple single-scattering events along a ray to provide better handling over such scenarios.
The Cupboard sequence below shows improved reconstruction of specular surfaces such as the fridge on the right, which should have depth values that reflect the surrounding environment.
Please visit https://github.com/breuckelen/torf.
@article{attal2021torf,
title={T{\"o}RF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis},
author={Attal, Benjamin and Laidlaw, Eliot and Gokaslan, Aaron and Kim, Changil and Richardt, Christian and Tompkin, James and O'Toole, Matthew},
journal={Advances in Neural Information Processing Systems},
volume={34},
year={2021}
}
For funding, Matthew O'Toole acknowledges support from NSF IIS-2008464, James Tompkin thanks an Amazon Research Award and NSF CNS-2038897, and Christian Richardt acknowledges funding from an EPSRC-UKRI Innovation Fellowship (EP/S001050/1) and RCUK grant CAMERA (EP/M023281/1, EP/T022523/1).