If you're paying attention to algorithmically generated avatars, you're probably concerned with these questions: What does it take to create my own avatar?
The answer we give is: it's enough to shoot a ** with your mobile phone!Reconstructing a high-fidelity 3D virtual digital image shot by a mobile phone has always been a challenging task, mainly because it is difficult to accurately control expressions, especially some exaggerated expressions and some micro-expressions. This is due to the fact that existing algorithms often rely on limited linear expression coefficients to parameterize expressions, and such linear expressions are difficult to model rich facial expression details.
In fact, researchers from Tsinghua University and Xinchangyuan Technology have proposed a new method, latent**atar, which uses the latent features learned by deep networks and neural radiance field technology to solve the above two difficulties in a targeted manner. The results were published at SIGGRAPH 2023, the top conference in graphics. Convenient capture equipment, high-fidelity avatars, lifelike expression control, everything can be done by latent**atar!
Next, let's take a look at how it does it.
*Home. Expression-controllable neural radiance field.
The core idea of latent**atar is to abandon the existing expression modeling method based on linear expression base, and instead use the hidden space of the deep network to realize the modeling of expressions. To do this, latent**atar first constructs an expression saphenous space, and an encoder that maps the facial area image to the facial saphenous space. Then, a neural radiance field represented by a triplanar neural radiance field is generated from the expression latent variables, as shown in the figure below. The introduction of triplanar and neural radiation field makes the hidden space learned by the network have three-dimensional perception ability, and through simple reconstruction loss, the hidden space can capture the high-frequency facial texture details of the target task, so as to deal with exaggerated expressions and micro-expressions well. Compared with the previous solution, because the expression latent variables are learned from the input monocular in an end-to-end manner, latent**atar gets rid of the tracking and expression problems of face templates, thus achieving richer and more accurate expression modeling.
Cross-identity-driven.
Since the above-mentioned constructed human head digital avatar is controlled by the learned expression hidden variables, when another driver is used to drive, his face image needs to be mapped to the corresponding expression hidden variables of the digital avatar first.
To this end, Latent**ATAR introduces a Y-shaped network architecture consisting of a shared encoder and two individual decoders. The shared encoder receives an avatar and a new driver face image as input, learning the shared expression latent variables. Mapping multilayer perceptrons creates a bridge between two latent variable spaces.
In this way, latent**atar implements another person to drive the built digital avatar. So we see the results at the beginning of the article. Not only can it achieve high-fidelity image rendering, but it can also ensure that the driver's expression can be accurately transferred, and some exaggerated expressions, subtle expressions and emotions can also be accurately conveyed.
Experimental results. The authors also compared Latent**ATAR with other previous monocular head digital avatar reconstruction algorithms, including NerFace, IM**ATAR, DeepVideoPortraits (DVP), and the baseline Coeff+Tri-Plane designed to eliminate the improvement of triplanar expression. From the qualitative results, the avatar synthesized by latent**atar has the highest sense of realism, the best modeling ability of expression consistency and exaggerated expression, and also has stronger robustness. In addition, the quantitative results also prove that Latent**ATAR can achieve the best results in the numerical evaluation.
Recent monocular** reconstruction methods all use 3DMM face templates as the driving signal for avatar expression control, which usually introduces errors when tracking and fitting 3DMM face templates, resulting in inaccurate expressions. Therefore, when the digital avatar of the human head is animated in post-production, it is inevitable that the expression will be blurred or inconsistent. Latent**ATAR also reconstructs the head digital avatar from **, but does not need 3DMM, but learns the implicit expression latent variables directly from the training data. Therefore, the synthesized virtual image can be lifelike under various exaggerated expressions, which greatly improves the current problem of stiff, unreal and unvivid expressions driven by digital humans.
Headquartered in Hangzhou, Xinchangyuan Technology Co., Ltd. has a three-place linkage with the industry-university-research base of Tsinghua University in Beijing and the virtual digital human center of the virtual research institute of Tsinghua University in Shenzhen, focusing on digital human-related technology research and talent training. In the future, it will cover more scenarios and needs, work with many industry partners to promote the implementation of "AIGC + digital human" in multiple scenarios, gradually popularize consumer-level digital humans, and provide strong and comprehensive technical support for all walks of life.
References. yuelang xu, hongwen zhang, lizhen wang, xiaochen zhao, huang han, qi guojun, and yebin liu. latent**atar: learning latent expression code for expressive neural head **atar. in acm siggraph 2023 conference proceedings, 2023
Project homepage: liuyebincom/h**atar