The reporter of "Science and Technology Innovation Board**" noticed that from the final form, compared with single-modality, multi-modal large models process text, audio and other types of information at the same time, which is more in line with the way humans receive, process and express information, and can also become a human intelligent assistant.
At present, Google has launched the multi-modal large model Gemini 15 pro;Meta has successively open-sourced multimodal large models such as imagebind and anymal; OpenAI's recent intensive spoilers GPT-5 will focus on breakthroughs in voice input and input, image output, and the final input direction, or will achieve true multimodality.
Shi Xiaojun, an analyst of the computer team of Huafu ** Research Institute, said on February 18 that multimodality is a new round of revolution in AI large models. Multi-modality improves the generalization ability of large models, and realizes "multi-specialization and multi-ability" in a multi-information environment, which has broad application scenarios and market value in vertical fields, Shi Xiaojun said.
The reporter of the Science and Technology Innovation Board noticed that compared with text generation, the generation of large models and applications due to data, computing power and other reasons, resulting in a small number of products.
On the Internet, Google Gemini and Sora conduct adversarial training**, which can also make people find that some scenes generated by Sora are suspected of not conforming to basic common sense, and the effect does not seem to be perfect.
Overall, it is unclear how multimodal models will change the industry. For the follow-up development of the business of the above-mentioned listed companies, the reporter of "Science and Technology Innovation Board**" will continue to pay attention to the report.