Meta released the audio2photoreal technology framework, which can synthesize realistic virtual chara

Mondo Technology Updated on 2024-01-31

Meta's new artificial intelligence research Audio2Photoreal technology framework is able to generate realistic faces, bodies and gestures based on conversational speech. The researchers developed the Audio2PhotoReal framework to create real-world avatars that can make natural gestures and expressions based on what and how people speak.

The main contribution of audio2photoreal's research is the combination of vector quantization and diffusion algorithms to generate dynamic and more expressive actions. Vector quantization is a technique that compresses a large amount of data into a small amount of representative data, and when used in audio2photoreal, it can effectively select representative samples from a large number of gestures.

The role of diffusion technology is to provide high-frequency detail and improve the quality of gestures. Diffusion technology is often used to generate and improve images and **, especially in scenes that need to restore details or increase visual realism, when applied to the process of virtual character gesture generation, diffusion technology can make gestures more natural and smooth, making them closer to real human movements.

In this study, the researchers also created a multi-view two-person dialogue dataset that included dialogue scenes shot from different angles, allowing audio2photoreal to better create avatars.

In addition, compared with traditional mesh models, audio2photoreal generates highly realistic virtual characters, and accurately captures the details of dialogue gestures, such as simulating finger pointing, wrist rotation or shrugging, etc., which is more natural and realistic. The research team now makes the relevant procedures** and datasets publicly available to facilitate the development of related research areas.

Related Pages