Meta Releases SeamlessM4T, a Multimodal Model That Seamlessly Translates and Transcribes Speech and

Mondo Technology Updated on 2024-02-01

Machine translation has always been the focus of AI research, after all, there are so many languages in the world, and there are countless words. How to connect human writing and language has always been the focus of researchers' research. In the past, machine translation has been using a similar way of looking up a table, many years ago, machine translation was directly the correspondence between words in another language and another language, and the words that were translated by machine were machine translated at a glance.

With the release of the Transformer model by Google, the power of machine translation has been greatly increased, and even now the examples of machine translation exceed the level of humans.

When the Transformer model was first released, it was a model built for the small application of machine translation, and with the popularity of the attention mechanism, the Transformer model was developed to tasks such as object detection and object classification, and the Transformer model was developed to multimodal tasks. With the release of ChatGPT, the Transformer model has been pushed to the forefront.

Although machine translation is a niche task, there are few models that cover so many speech and text for multimodal translation tasks of text and speech. SeamlessM4T – Massive Multilingual and Multimodal Machine Translation – a single model that supports speech-to-speech, speech-to-text, text-to-speech, text-to-text translation, and automatic speech recognition in up to 100 languages. is a foundational multilingual and multitasking model that seamlessly translates and transcribes speech and text.

SeamlessM4T support:

Automatic speech recognition, speech-to-text translation for nearly 100 languages, speech-to-text translation for nearly 100 input and output languages, support for nearly 100 input languages and 35 (+ English) output languages, text-to-text translation for nearly 100 languages, text-to-speech translation for nearly 100 input languages and 35 (+ English) output languages

For the SeamlessM4T model, using the multi-task Unity model architecture, it is able to directly generate translated text and speech. This new architecture also supports automatic speech recognition, text-to-text, text-to-speech, speech-to-text, and speech-to-speech translation, which are already part of the Unityy model.

The multitasking Unity model is made up of three main components.

Text and speech encoders are tasked with recognizing speech input in nearly 100 languages. The text decoder converts that speech into nearly 100 text languages. Use a text-to-audio model to decode it into speech in 36 languages.

Self-supervised voice encoder W2V-BERT 2 for the SeamlessM4T model0 is an improved version of W2V-BERT that learns the structure and meaning of speech by analyzing millions of hours of multilingual speech.

The SeamlessM4T model has a text encoder based on the NLLB model. It is trained to understand text in nearly 100 languages and generate translated text accordingly.

Data-driven models like SeamlessM4T often benefit from large volumes of high-quality, end-to-end data, i.e., speech-to-text and speech-to-speech data. Speech that relies solely on human transcription and translation cannot cope with the challenging task of translating speech into 100 languages. Meta AI has built a new large-scale multilingual and modal text embedding space for 200 languages, called Sonar, which performs significantly better than existing methods such as Laser3 or LabSe in multilingual similarity searches.

SeamlessM4T achieves state-of-the-art results in nearly 100 languages and enables multitasking in automatic speech recognition, speech-to-text, speech-to-speech, text-to-speech, and text-to-speech. The official ** of SeamlessM4T has more than 100 pages, and there are dozens of authors, so you can imagine the complexity of the SeamlessM4T model. However, Meta AI still open-sourced the SeamlessM4T model, and we can directly check the use of the SeamlessM4T model on GitHub.

SeamlessM4T model, officially announced 2 model files. A medium model with 1.2 billion parameters, and a large model with 2.3 billion parameters, each model contains the following multiple tasks, which we can directly use github's official ** to implement.

speech-to-speech translation (s2st)speech-to-text translation (s2tt)text-to-speech translation (t2st)text-to-text translation (t2tt)automatic speech recognition (asr)pip install .s2st task:m4t_predict s2st --output_path t2tt task:m4t_predict t2tt --src_lang s2tt:m4t_predict s2tt t2st:m4t_predict t2st --src_lang --output_path asr:m4t_predict asr

The SeamlessM4T model requires an audio file of 16kHz, and the official also provides ** to format the input voice.

import torchaudioresample_rate = 16000w**eform, sample_rate = torchaudio.load()resampler = torchaudio.transforms.resample(sample_rate, resample_rate, dtype=w**eform.dtype)resampled_w**eform = resampler(w**eform)torchaudio.s**e(, resampled_w**eform, resample_rate)

Once the voice is formatted, we can use the following ** to use the SeamlessM4T model.

import torchimport torchaudiofrom seamless_communication.models.inference import translatortranslator = translator("seamlessm4t_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)# s2sttranslated_text, w**,sr = translator.predict(, "s2st", )# t2sttranslated_text, w**,sr = translator.predict(, "t2st", ,src_lang=)w**,sr = translator.synthesize_speech(, torchaudio.s**e( ,w**[0].cpu(),sample_rate=sr,)# s2tttranslated_text, _= translator.predict(, "s2tt", )# asrtranscribed_text, _= translator.predict(, "asr", )# t2tttranslated_text, _= translator.predict(, "t2tt", ,src_lang=)

Of course, the official demo trial address has been announced, and we can experience SeamlessM4T directly on the official **.

The same Hugging Face above,SeamlessM4T model can already be used,On the Hugging Face experience link,We can also experience the SeamlessM4T model,Personally, I think the interface of Hugging Face is clear and easy to understand,It's easier for everyone to get started。

We can choose the tasks we need according to the content prompted by the interface, whether it is sound recognition, speech-to-text, speech-to-speech, text-to-speech and text-to-speech, machine translation, etc., we can try it out.

Hugging Face Experience LinkOfficial Experience Link Open Source Link.

Related Pages