Key points:Monkey is a high-performance multi-modal large model jointly launched by Huazhong University of Science and Technology and Kingsoft, which solves the challenges of existing models in complex scenes and visual detail processing by improving input resolution and introducing multi-level description generation methods. Monkey can be built on top of existing visual editors without the need for pre-training from zero, which greatly improves R&D efficiency.1. Monkey is a high-performance multi-modal large model that can provide excellent performance in complex scenes and visual detail processing.
2. Monkey does not need to be pre-trained from 0, it can be built based on the existing visual editor, and the input resolution capability of the large model can be increased to 896x1344 pixels.
3. Monkey adopts a multi-level description generation method, which can provide rich contextual information for the model to guide the model to learn Xi associations between scenes and objects.
Monkey's multi-level description generation method can provide rich contextual information for the model to guide the model to learn Xi associations between scenes and objects. By testing on 16 different datasets, Monkey achieved excellent results on multimodal tasks such as image captioning, visual question answering, document classification, and more. Monkey shows super subtle visual information perception and complex scene understanding ability, and has a wide range of application spaces.
Open source address: *address:
The quality of Monkey's training dataset is the key to its ability to improve, and the researchers generated hundreds of thousands of high-quality image description data, and used multiple models to automatically generate text descriptions, and fused the output of different models to improve the large model's ability to understand image details.
In terms of model selection, Monkey uses the open-source model qwen-vl as the language decoder and the 2 billion parameter VIT-Bighuge as the visual encoder to avoid the waste of resources of repeated pre-training. In order to improve the recognition ability and input resolution of monkey, as well as the ability to generate richer image descriptions and understand complex scenes, three training stages were adopted: multi-level description generation, high-resolution encoding and multi-task training.
Monkey was fully validated on 16 different datasets, including tasks such as image captioning, generic visual Q&A, and document-oriented Q&A. On the general visual Q&A task, Monkey shows a clear advantage on multiple datasets. On the image captioning task, Monkey also performed well on the textcaps dataset, demonstrating its multimodal understanding of Chinese elements.
On the document-oriented Q&A task, Monkey achieved good results on multiple document image understanding datasets. According to the researchers, MONKEY has a wide range of applications in medical imaging, satellite imagery, and other fields, and will continue to optimize the perception, association, reasoning, and generalization capabilities of the MONKEY model.
In summary, MONKEY is a high-performance multimodal large model that solves the challenges of complex scenes and visual detail processing by improving input resolution and introducing multi-level description generation methods. Monkey does not need to be pre-trained from 0 and can be built on top of existing visual editors, with high efficiency and a wide range of application space. By testing on multiple datasets, Monkey achieved excellent results on multimodal tasks, demonstrating superior visual information perception and scene understanding. In the future, Monkey will continue to optimize the perception, association, inference, and generalization capabilities of the model, and further enhance its application value in various fields.