Commercially available, over 12,000 stars! Microsoft open source multimodal model LLaVA 1 5

Mondo Technology Updated on 2024-02-01

With the release of GPT-4V by OpenAI, multimodal functions have gradually become mainstream, and excellent multimodal open-source models such as MiniGPT-4 and LL**A have emerged.

Researchers from Microsoft Research and the University of Wisconsin continued to open source LL**A-1 on top of LL**AVersion 5. Compared to the previous generation, ll**a-15. Cross-modal connectors and academic visual Q&A datasets of specific formats are introduced to comprehensively improve the ability of multimodal understanding and generation.

To evaluate ll**a-15. The researchers tested visual question answering, natural language processing, image generation, etc. in 11 well-known data platforms such as MMEMM, BENCHMM, SQA, and POPE. The results showed that ll**a-15All of them have achieved the highest level in the open source model, which is comparable to the GPT-4V effect.

Open source address: *demo:

*Address:

ll**a-1.5. Brief introduction.

ll**a-1.5 continues to use the previous ll**a overall architecture, which consists of three parts: visual model, large language model and visual language connector. An MLP connector was also used to replace the original linear projection, which greatly improved visual comprehension and generation capabilities.

1) Visual model: ll**a-15. A pre-trained vision model on large-scale data, Clip VIT-L 336px, was used to extract the feature representation of the image.

After clip encoding, a fixed-length vector representation can be obtained to represent the semantic information of the image. Compared with the previous version of LL**A, the number of parameters and input resolution of the clip model have been greatly improved.

2) Large language model: A Vicuna v1 with 13 billion parameters is used5 large language models to help ll**a-15. Understand the text content entered by the user, and at the same time can capture the semantic information of the text, and have strong reasoning and generation capabilities.

Unlike the method of only image encoder tuning, in ll**a-15. During the training process, the parameters of the large language model will also be updated. In this way, the language model can directly learn how to effectively integrate visual information for inference, without relying on other modules to control its output, which improves the autonomy of the model.

3) Visual Language Connector: ll**a-15. A two-layer MLP connector is used to replace the previous linear projection, which can fully map the image features output by the clip encoder to the word vector space of the large language model.

Tuning of training methods, datasets, and instructions.

In terms of training process, ll**a-15. Followed the two-stage training method of ll**a. In the first stage, the pre-training of visual language representation is carried out, using about 600,000 image-text pairs, and the training takes about 1 hour. In the second stage, tuning was performed on 650,000 multimodal instruction data, which took about 20 hours.

This efficient two-stage training method ensures the convergence of the model and allows the entire process to be completed in a single day. Compared with models that need to train millions or even hundreds of millions of samples, the cost of AI computing power and time is reduced by several orders of magnitude.

In terms of training data, ll**a-15 Integrates 6 categories of datasets, covering typical applications such as visual question answering and language dialogue. Including: image Q&A dataset VQA, which provides image-question-answer triples; OCR dataset, which needs to extract information from image text;

The regional visual question and answer dataset needs to pay attention to and answer the local content of the image; Language dialogue dataset to provide multi-round chat corpus; Wait a minute.

In addition, the researchers deliberately designed matching response format prompts to guide the model to adjust the output form according to the interaction type to meet the needs of specific scenarios.

In terms of visual instruction tuning, ll**a-15 Different types of datasets were used, including VQA, OCR, regional VQA, visual dialogue, language dialogue, etc., with a total of about 650,000 pieces of data. This data provides a rich way for the model to reason about the visual scene and interact with it.

The material of this article is **ll**a-15**, if there is any infringement, please contact to delete.

end

Related Pages