Become a top real time translation master, Meta s open source universal language translation model S

Mondo Technology Updated on 2024-01-27

Meta researchers announced Thursday that they have developed a new set of artificial intelligence models called "seamless communication" that aim to enable more natural and authentic cross-language communication, essentially making the concept of a universal speech translator a reality. At the same time, Meta AI has also published relevant research** and data.

The main model, known as SeamlessExpressive, SeamlessStreaming, and SeamlessM4T v2, is primarily covered by three sub-models: SeamlessExpressive, SeamlessStreaming, and SeamlessM4T v2. Seamless combines all functions into one unified system. According to research**, seamless is "the first open system that unlocks expressive cross-language communication in real time".

Research**:hugging face:

github:

Seamless Translator represents a new frontier in blogging communication using artificial intelligence. It combines three complex neural network models to enable real-time translation between more than 100 spoken and written languages, while preserving the vocal style, emotion, and prosody of the speaker's voice.

seamlessexpressiveFocus on preserving the vocal style and emotional nuances of the speaker's voice when translating between languages. As stated in **, "Translation should capture the nuances of human expression." While existing translation tools are proficient at capturing the content of a conversation, they often rely on monotonous robotic text-to-speech systems for output. ”

In order to preserve the speaker's voice style in different languages, the researchers incorporated the expressive encoder into the SeamlessM4T V2 base model. This process ensures that unit generation is guided by the expected pace and rhythm of speech. In addition, replacing the Hifi-Gan unit vocoder in SeamlessM4T V2 with an expressive unit-to-speech generator conditional on the source voice allows for seamless transmission of tone, emotional expression, and sound style.

seamlessstreamingNear-real-time translations are possible with a latency of only about two seconds. According to the researchers, this is "the first large-scale multilingual model" that can provide such fast translation speeds in nearly 100 spoken and written languages. SeamlessStreaming is able to intelligently decide when there is enough context to output the next target text or speech fragment. It achieves this through a Xi read-write strategy that determines whether it should be "written" and produce an output based on a portion of the audio input or "read" and continue to wait for more input. The model automatically adapts to different language constructs, resulting in stronger performance on many different language pairs.

The third model,seamlessm4t v2, which is the basis for the other two models. It is an upgraded version of the original SeamlessM4T model released last year. The new architecture "improves consistency between text and speech output," the ** said.

The upgraded SeamlessM4T v2 has a non-regressive text-to-cell decoder. w2v-bert 2.The 0 encoder was trained on 4.5 million hours of speech data, compared to 1 million hours for the previous version. In addition, SeamlessM4T v2 complements more data from SeamlessAlign for low-resource languages.

SeamlessM4T V2 is comprehensively evaluated for all tasks and languages using automated metrics (BLEU, ASR-BLEU, BLASER 2, etc.) and its performance is significantly better than the previous state-of-the-art models. It also tested for robustness, bias, and hallucinogenic toxicity.

"All in all, seamless has given us a critical understanding of the technical fundamentals needed to transform Universal Speech Translator from a sci-fi** concept to a real-world technology," the researchers wrote. ”

The capabilities of these models enable new voice-based communication experiences, from real-time multilingual conversations using smart glasses to auto-voiced** and podcasts. Researchers say it can also help break down language barriers for immigrants and others who have difficulty communicating.

"By publicly publishing our work, we hope that researchers and developers will be able to amplify the impact of our contributions by building technologies designed to bridge multilingual connections in an increasingly interconnected and interdependent world," the ** states. ”

However, researchers acknowledge that the technology could also be misused for voice phishing scams, deepfakes, and other harmful applications. To promote the safe and responsible use of the model, they have implemented several measures, including audio watermarks and new technologies to reduce the toxic output of hallucinations.

These seamless communication models are publicly available on Hugging Face and GitHub.

The collection includes Seamless, SeamlessExpressive, SeamlessStreaming, and SeamlessM4T v2 models with accompanying metadata.

By making these state-of-the-art natural language processing models available for free, Meta hopes to enable researchers and developers to build on this work and expand this work to help connect people across languages and cultures.

In the midst of the fierce generative AI revolution that is taking place, Meta has been working to open source its own large model research, including its top large models Llama, Llama2, and so on. This open-source reaffirmation of Meta's approach to AI development and provides a valuable new resource for the research community.

"Overall, the multidimensional experience that seamless may result in could lead to a dramatic change in the way machine-assisted cross-language communication is implemented," the researchers concluded. ”

References:

Related Pages