How to use AI to see my father again (1) 1. Realize it on the app.
The author investigated the native and cross-terminal solutions, plus the reasons for controlling costs, and chose the implementation of cross-terminal Flutter.
Homepage development
The author uses Muke as a collaborative UI annotation platform, and after analyzing the layout, the layout method should be adopted:
The scaffold page body is covered with the whole sheet**, with the width as the tile, and the avatar adopts the stack layout according to the irregular position, with a preset of up to 30 positions, randomly displays the bottom bottombar, adopts the positioning method, the arc background is**, and the button is added with elevatedbutton, and the details are as follows (after the end of the series of articles, the source code will be open sourced, and I will send it in advance to leave a comment).
Collect agent pages
Or analyze the layout first, half pop-up page, the bottom can send text, and the whole is a dialogue information page.
The semi-pop-up window uses showmodalbottomsheet to pop up at the bottom to give a certain degree of transparency, and the dialogue part uses column + scrolling container to realize the native TTS component of the voice module itself The key details are as follows:
Real-time communication page
This is the beginning of stepping on the pit,When the original use of D-ID,Demo is a pure front-end solution,At the time, I thought it was all standard WebRTC,And flutter itself also has standard components that support WebRTC,I didn't think too much about the compatibility problem of the end,After implementation, it was found that different Android versions and iOS versions have more or less compatibility problems with the support of native WebRTC,Finally, it was decided to replace the dialogue with H5 pages,Use the end +H5 method to solve the compatibility problem, the key ** is as follows:
Other scattered pages, not too difficult, will not be highlighted, the following are the implementation details of the server.
2. Server-side implementation.
The author is relatively familiar with J**A, GO, and Rust, but because it needs to dock different models and third-party SDKs, it is Python, and the cost of the Python server-side language is low, and there are more advantages in small scale.
Voice TTS module, using ElevenLab, sound cloning ability compared with iFLYTEK, Microsoft TTS, Volcano Engine voice package, the effect of the better dialogue communication module, the use of GPT4, multi-round dialogue and role-playing comparison, GLM3 Wenxin Yiyan Tongyi Qianwen, the overall process with better effect, first of all, the voice to text on the APP side, transmitted to the background through WebSocket, replied with GPT4 assistant, and the text obtained was passed through ElevenLab Convert to voice, and finally call D-ID to use the voice to drive the lip shape of the ** film (I have a hunch that the finishing process will be very lengthy) The key ** logic is as follows (pay attention to it, and then open it up):
Among them, GPT4 needs to be scientifically connected to the Internet, so we need a server that can communicate with OpenAI's interface (the Linux version of Clash is not highlighted, you can communicate in the comment area if you need it).
At this point, the development of the demo version has been completed, and the next step is to start our internal complaining session.
After the development of the APP side, H5 side, and server side with a great sense of achievement, I began to conduct internal testing with my friends:
Fake friend A: Brother, after speaking, why don't you respond in 1 minute.
Fake friend B: +1
Fake friend C: +1
Me: Oops, I want to convert the voice to text first, the text is given to gpt, and the gpt will reply to me, and I also have to adjust the interface to generate voice, and I will speak through the voice drive.
Fake Friend D: Ah, and then, why is it so slow?
Fake friend e: +1
Fake friend f: +1
Fake friend n: Oops, spicy chicken, there are too many bugs, you can't do it, brother, sure enough, the product can't participate in research and development.
I:.. So began to fix all kinds of bugs and optimize all kinds of optimizations.
Finally, shorten the communication response to 30 seconds each time, Khan! It's also very slow, so now there are two ways, one is to continue to optimize, or train the model yourself, or use the MetaHuman hyper-realistic model, and the other is to change the way of thinking from the product perspective.
1. Research on different technologies of voice-driven mouth shapes.
sadtalker:
1Xi'an Jiaotong University 2Tencent Artificial Intelligence Lab 3Ant Group A model jointly released by the avatar allows the avatar to speak, and after using WebUI Colab for white prostitution, it is found that it is still relatively slow, and if the quality is not high, the effect will be worse.
w**2lip:
After the deployment of colab, the support for ** file is better, and at the same time, GFPGAN can also repair the uncoordinated mouth shape, but ** support is general, you need to transform yourself, and the project is more honest, with 3080, 4080 needs to be upgraded by yourself, and the parallel logic also needs to be enhanced (the effect is okay after seeing someone transform at station B, but it can't be completely real-time, a 500*500, about 1 minute **, about 20-30 seconds of delay).
videoretalking:
Personally, I feel that it's more like an upgraded version of sadtalker, and it's better to support the image of a fixed position, but**speak,Need to transform,At the same time, the resolution requirements are higher,In the end, it's a problem of delay.,1 minute**,4080The best result is 13 seconds。
In the end, I found that as long as the image effect is related to real people, it is not very good, so I changed the research idea and measured whether I can meet my requirements from the perspective of game modeling.
MetaHuman: A hyper-realistic digital human released by Unreal Engine, the entire body and space can be driven. Can't repeat the mistakes of D-ID, an iPhone 12 (or newer) and a desktop computer can be transformed into a complete facial capture and animation solution, when my father died, he didn't leave too much** and voice, the facial expression and physical characteristics extracted from** and voice, it is still more troublesome, others have to start again, it is more friendly to people who are still alive, it can be used as an alternative, such as: will be a person's**, 3D after completion, Optimize the details and import them into the MetaHuman model.
nvidia omniverse audio2face:
Introducing the official website: Generative AI can be used to instantly create facial expression animations from an audio**. Isn't this what I'm thinking about?,I looked at the difficulty of getting started.,A little drumming.,And I contacted the local version of the enterprise.,Okay,I was wrong.。
The mainstream solutions on the server side can be searched, as well as domestic and foreign forums for help, the mainstream is about these, it took about 1 week, it's time to change the way of thinking.
After the completion of the article series, the relevant prototypes, designs, source code, databases, etc. will be open sourced.
To be continued.