Researchers from Anhui Polytechnic University, Nanyang Technological University, and Lehigh University have open-sourced a large multimodal model, TinyGPT-V.
TinyGPT-V uses Microsoft's open-source PHI-2 as the basic large language model, and uses the vision model EVA to achieve multimodal capabilities. Although TinyGPT-V has only 2.8 billion parameters, its performance is comparable to that of models with tens of billions of parameters.
In addition, TinyGPT-V training only needs 24G GPU to complete, and there is no need for high-end graphics cards such as A100 and H100 to train.
Therefore, it is very suitable for small and medium-sized businesses and individual developers, and can be deployed on mobile devices such as mobile phones and laptops.
Open source address: *address:
TinyGPT-V main architecture.
TinyGPT-V is mainly composed of three parts: the large language model PHI-2, the visual encoder and the linear projection layer.
The developers chose Microsoft's latest open-source PHI-2 as the underlying large language model for TinyGPT-V. With only 2.7 billion parameters, PHI-2 has very strong comprehension and reasoning capabilities, and has shown that it has matched or exceeded the larger 13 billion parameter model in a number of complex benchmarks.
The vision encoder uses the same architecture as the MinigPT-V2 and is based on VIT's EVA model. This is a pre-trained vision base model that remains frozen throughout the training of tinygpt-v.
The function of the linear projection layer is to embed the image features extracted by the visual encoder into the large language model, so that the large language model can understand the image information.
The first linear projection layer in TinyGPT-V adopts the Q-Former structure from BLIP-2, which can maximize the reuse of BLIP-2's pre-training results.
The second linear projection layer is initialized with a new Gaussian distribution in order to bridge the dimensional gap between the output of the previous layer and the language model embedding layer.
TinyGPT-V training process.
The training of tinyGPT-V went through four stages, and the datasets and experimental processes used in each stage were different.
The first stage is a warm-up training with the aim of adapting the PHI-2 model to the input of the image mode. The training data used in this phase consists of three datasets: Conceptual Caption, SBU, and Laion, with a total of about 5 million images and corresponding description text.
The second stage is pre-trained with the aim of further reducing the loss on the image-text pair. This phase also uses the Conceptual Caption, SBU, and Laion datasets from Phase 1. The experiment was set up in 4 phases, each with 5,000 iterations.
In the third stage, the instruction tuning is carried out, and the model is trained using some image-text pairs with instructions from miniGPT-4 and ll**a, such as "describe the content of this **".
In the fourth stage, multi-task tuning is performed. In this stage, more complex and abundant multimodal datasets are used, such as complex semantically aligned sentences in ll**a, object parsing datasets in Flickr30K, multi-task mixed corpus, plain text corpus, etc.
At the same time, a similar learning rate strategy was adopted as in the second stage, which ultimately led to a loss of 2720 dropped to 1399。
To test the performance of TinyGPT-V, the researchers evaluated its performance on multiple visual language tasks such as visual question answering, visuospatial reasoning, and subtitle generation from multiple perspectives.
The results show that the TinyGPT-V has a very small parameter but a very strong performance, for example, on the VSR spatial inference task, with a score of 532% accuracy, which is more than all models tested.
The material of this article **tinygpt-v**, if there is any infringement, please contact to delete.
end