AI Large Model Topic With the rapid development of large model capabilities, AI will reshape all wal

Mondo Technology Updated on 2024-01-30

Shared todayAI seriesIn-depth Research Report:AI Large Model Topic: With the rapid development of large model capabilities, AI will reshape all walks of life

Report Producer: Guoxin**).

Report total: 68 pages.

Featured Report**: The School of Artificial Intelligence

There are three main ways to improve the capabilities of AI large models: the number of model parameters, the amount of training data, and the number of training rounds.

Improvements in the number of model parameters:Taking OpenAI's GPT model as an example, the first-generation GPT model GPT-1 has only 11.7 billion, the number of GPT-2 parameters increased to 1.5 billion, the number of GPT-3 parameters further increased to 175 billion, and the number of GPT-4 parameters should reach the trillion levelThe number of parameters between generations of the model increases exponentially, and the ability of large models also increases significantly with the growth of model parameters

Improvement in the amount of training data:1) The amount of language modal training data has been improved2) The addition of multimodal training data: for example, **, etc. are also used as training data, and the size of the training dataset is greatly increased

Improvement in the number of training rounds:1) New model: Through multiple rounds of training, the model capability is improved, but too much training will also lead to overfitting of the model2) Existing models: Regular training (weekly, monthly) to improve the capabilities of the model and the timeliness of the data.

A multimodal model refers to a model that combines different types of data (such as images, text, and speech) for analysis and processing. Through the correlation and combination of different data types, the accuracy and robustness of the model can be greatly improved, and the application scenarios can be further expanded. Secondly, multimodality is closer to the mode of human learning, in the physical world, human cognition of a thing is not through a single modality, such as the recognition of a pet, from vision (pet appearance), hearing (pet meow), smell (pet body odor), touch (pet hair, pet body temperature, etc.) and other multimodal comprehensive three-dimensional cognition, is the future development direction of artificial intelligence.

Multimodal large model classification: single-tower structure and double-tower structure. 1) Single-tower structure: only a deep neural network is used to complete the interactive fusion between image and text, which essentially belongs to the pre-information fusion scheme;2) Double-tower structure: different neural networks are used to complete the information extraction of different modes, and then only the last layer of information interaction and fusion is done, which belongs to the information post-fusion scheme, which has the advantages of strong model independence and high training efficiency.

Leading manufacturers are optimistic about the future of multi-modal and multi-model models. In the second half of 2023, the frequency of multimodal large model releases will accelerate, OpenAI will take the lead in releasing GPT-4V, and then in December, Google will release the native multimodal large model Gemini, and multimodal large models have become the focus of major model manufacturers.

The commercialization of large-scale models has accelerated, and the market scale has grown rapidly. At present, the large model landing model can be divided into three types, namely, large model, large model + computing power, large model + application, of which large model refers to enterprise users can directly buy out large model products, or rent large models (such as Chinasoft International's model factory);Large model + computing power refers to the combination of model and computing power sold by manufacturersLarge model + application refers to the upper-layer application that the vendor sells to enterprise users that integrates the capabilities of the large model, and requires the user to pay software license fees. With the further improvement of model application and ecology, the proportion of large model + application mode is expected to gradually increase.

Global Market:According to the data of the Titanium ** International Think Tank, the global large model market size will be 10.8 billion US dollars in 2022 and is expected to reach 109.5 billion US dollars in 2028, corresponding to a CAGR of 47% in 22-28 years, and the global large model market size is growing rapidly.

Chinese Market:According to the data of the titanium ** international think tank, the size of China's large model market will be 7 billion yuan in 2022 and is expected to reach 117.9 billion yuan in 2028, corresponding to a CAGR of 60% in 22-28 years.

GPT4V expands the boundaries of language models to handle a wide range of tasks. OpenAI's GPT-4V model represents an important extension of the language model, which not only integrates traditional text processing capabilities, but also adds image input processing functions, thereby providing richer interactive interfaces and new functions. This advancement allows the GPT-4V to handle a more diverse range of tasks, enhancing the user experience.

GPT-4V's technical foundation remains GPT-4, maintaining the same training process, using large amounts of text and image data from the Internet, combined with human feedback reinforcement learning (RLHF) to optimize its output.

In terms of security, OpenAI has implemented rigorous evaluation and mitigation measures for GPT-4V, including expert red team testing and security restrictions with a high rejection response rate to prevent the generation of sensitive content. GPT4V's OCR feature has also been improved to read text from pixels more accurately.

GPT-4V still presents challenges in dealing with multimodal attacks and the scientific literature, especially in the medical field, and it is not advisable to accept any medical advice from the model as it may hallucinate, in part due to the inaccuracy of the OCR. Overall, GPT-4V is an important step forward in expanding the reach of AI and improving the user experience, but it still needs to be improved in terms of security and accuracy.

Compared with traditional multimodal models, Gemini has been designed to be natively multimodal from the beginning, integrating different types of information more intuitively and effectively.

In the past, the conventional approach to creating multimodal models was to train independent components of different modalities and then combine them together to simulate a subset of functionality. These models perform well when performing certain tasks, such as describing images, but often struggle with more conceptual and complex inferences.

Gemini is designed from the beginning as a native multimodal model, pre-trained on different modalities. Subsequently, we fine-tuned it with additional multimodal data to further improve its effectiveness. This allows Gemini to understand and reason about inputs from the ground up seamlessly, with state-of-the-art capabilities in nearly every domain. Such as:

1) Complex Reasoning Ability:gemini 1.0 has complex multimodal reasoning skills that can help understand complex written and visual information. This gives them a unique skill set in uncovering indecipherable knowledge from large amounts of data. Its astonishing ability to extract insights from hundreds of thousands of documents, by reading, filtering, and understanding information, will help achieve new breakthroughs in digital speed in science, finance, and many other fields.

2) Understand text, images, audio, and moregemini 1.0 is trained to recognize and understand text, images, audio, and other content at the same time, so it is better able to understand subtle information and answer questions that involve complex topics. This makes it particularly good at explaining reasoning in complex subjects such as mathematics and physics.

3) Advanced Coding Skills:gemini 1.0Ability to understand, interpret, and generate high quality in the world's most popular programming languages such as Python, J**A, C++, and Go. Its capabilities in multilingual processing and complex information reasoning make it one of the world's leading foundational coding models.

Report total: 68 pages.

Featured Report**: The School of Artificial Intelligence

Related Pages