High-quality reconstruction of controllable 3D head avatars from 2D videos is highly desirable for virtual human applications in movies, games, and telepresence. Neural implicit fields provide a powerful representation to model 3D head avatars with personalized shape, expressions, and facial parts, eg, hair and mouth interior, that go beyond the linear 3D morphable model (3DMM). However, existing methods do not model faces with fine-scale facial features, or local control of facial parts that extrapolate asymmetric expressions from monocular videos. Further, most condition only on 3DMM parameters with poor(er) locality, and resolve local features with a global neural field. We build on part-based implicit shape models that decompose a global deformation field into local ones. Our novel formulation models multiple implicit deformation fields with local semantic rig-like control via 3DMM-based parameters, and representative facial landmarks. Further, we propose a local control loss and attention mask mechanism that promote sparsity of each learned deformation field. Our formulation renders sharper locally controllable nonlinear deformations than previous implicit monocular approaches, especially mouth interior, asymmetric expressions, and facial details.
Given an input video sequence, we run a face tracker to get at each frame the following parameters of a linear 3DMM: expression code, pose code, and sparse 3D facial landmarks. An attention mask with local spatial support is also pre-computed from the 3DMM. We model dynamic deformations as a translation of each observed point to the canonical space. We decompose the global deformation field into multiple local fields, each centered around representative landmarks. We enforce sparsity of each field via an attention mask that modulates expression code. Our implicit representation is learned using RGB information, geometric regularization and priors, and a novel local control loss.
Our approach is able to reconstruct test images to great detail with accurate geometry. The facial deformation caused by expressions are decomposed into local deformations, shown in the following visualization.
We model user's expressions (top) via DECA and transfer the 3DMM parameters (middle) to the neural head model of the subject. The model produces asymmetric expressions under the user's pose with a high level of details (bottom) that surpasses the linear 3DMM. Note that none of the transferred expressions were in the training set.