Beyond GPT4? The strongest Claude3 perturbation large model pattern

Mondo Technology Updated on 2024-03-07

Media Investment Views:After the release of Claude3, the industry evaluation effect is outstanding, and it is an event worthy of attention in the recent wave of large model iteration. The emergence of the Claude3 model confirms the two recent views: 1) In the short term, the iteration of global large models is still accelerating, and catching up with OpenAI and the iteration of multimodal large models are two important main lines. There have been multiple models that claim to be close to or benchmark GPT4. 2) The domestic large model has a chance to approach the GPT4 level this year. We believe that investors may be too pessimistic about the iteration speed and ability of large models in China, and from a pragmatic point of view, it is recommended to pay more attention to the application aspect, and the future margin and space of application may be larger than the boundary and space of the model. During the Spring Festival, we have begun to sort out the views of AI large models and applications, and continue to convey our positive and optimistic attitude to the market for this wave of ** within a day or two after the holiday. We expect that with the release of SORA this year, related applications in the ** field may appear. We recommend focusing on investment opportunities related to large-scale models and film and television applications in China.

Experts share:Claude3 released three models: OPUS, Sonnet, and Haiku. OPUS is the strongest model, with the largest number of parameters and the largest inference cost, Sonnet is the compromise version, and Haiku is equivalent to the reduced version, but the inference cost is low, which is suitable for C-end applications. Compare claude21. Claude3 has a strong reasoning ability for more difficult questions, and the correct rate of solving has a great leap. 【Technical Indicators】Comparison with GPT4, GEMINI and other model technical indicators: In the past, many large model manufacturers compared their own indicators with GPT4, claiming to exceed GPT4 capabilities. The CLAUDE3 indicator is relatively objective and is a general indicator in the industry, indicating that it has fully caught up with GPT4 to a certain extent. 1.Proficiency test: Graduate-level reasoning, mathematical problem-solving, multilingual mathematics, coding ability (the indicators circled in the figure below) and other indicators are significantly better than GPT4, and other indicators GPT4 has reached 80+ scores, and there is almost no room for optimization.

2.4-Shot capability, GPT4 accuracy is 52%, and Claude3 has OPUS accuracy of 60%. 4-shot is to let the GPT model learn 4 more samples on the task, and get the score through the adaptation of the samples; 0-shot refers to the accuracy that the base model can achieve without training on the dataset at all. 3.Long text capability: Claude3 supports 200k context window length, GPT supports 128K. Even for some 2B custom customers, Claude3 provides a cap of 1 million context tokens. Claude3 is not only a large language model, but also supports a visual model. There is not much difference between the Claude3 and the GPT4 visual model, and it can be said that it has caught up with the GPT4 level. Claude3 is significantly optimized to lead the industry in logical reasoning and long texts; The multimodal field is about the same level as its peers.

[Logical reasoning ability].In terms of logical reasoning, the graduate level test can achieve a 50% accuracy rate, and the elementary school mathematics test can achieve a correct rate of more than 95%. It shows that Claude3 can basically engage in a job that undergraduates can do, and can achieve more than 80 points in multilingual mathematics, programming, text reasoning and knowledge quizzes. He has reached the GPT4 level in the law school entrance exam, the bar exam, and the American math competition.

Based on people's past multi-dimensional evaluations of GPT4 (see the matrix diagram below), it is speculated that Claude3's capabilities are close to those of humans.

Representative examples include: 1) The Claude3 test has been significantly improved in terms of reading pictures to recognize recipes and interpret equations. 2) It can directly solve a super problem that "only GPT-4 has solved so far". 3) Claude 3 (MID) can now also respond to questions raised with ASCII codes. This is only possible at GPT-4 level or higher.

Minutes**: Wen Bagu Research] Mini Program

We tested some questions ourselves: 1) "Why did Lu Xun beat Zhou Shuren?" claude3 was able to see that Lu Xun and Zhou Shuren were the same person, and gave a very detailed explanation, giving a good answer to the meaning and historical environment; GPT4 can also come up with the correct answer. 2) "There are 9 birds in the tree, knock out one and how many are left?" ": GPT4 answers are relatively simple, and claude3 is more detailed and comprehensive. 3) "What is yesterday's day is tomorrow?" Neither claude3 nor gpt4 answered about this brain teaser. It can be tentatively assumed that Claude3 has reached the GPT4 level, but not yet at a very strong intelligence level. Compared with the domestic large model, Wenxin answered two questions incorrectly, Xunfei Xinghuo answered one question incorrectly, and GLM answered one question incorrectly.

【Long Text Ability】The key to the differentiation of the entire Claude series lies in the long text ability. Previously claude 21 The context length has reached 200k, which is significantly larger than the 128k of gpt4, but the claude retrieval ability is not good. See the needle-in-a-haystack test results below, green indicates that 100% of the contextual information was retrieved, and red indicates that no contextual information was retrieved. claude2.1 does not significantly improve the retrieval accuracy with the increase of context length, but decreases with the increase of length, and it is easy to forget information. GPT4, while only supporting 128k, performs better in the overall context. The well-known long text ability model in China is Moonshot Kimi, which catches up with GPT4 in the context retrieval accuracy of 128K, but the logical reasoning is significantly worse than GPT4.

Claude3 also supports 200k context windows, the retrieval accuracy has been greatly improved, the needle in a haystack task is very well completed, and it really supports 200k context retrieval. As for the claimed 1 million context length, there are currently no intuitive test metrics. In terms of long text capabilities, Claude3 achieves catch-up and overtake of Kimi. On the whole, the Claude3 OPUS is still the most comprehensive large-scale model at present.

[Industry Impact].The Claude 3 is available in three versions, each corresponding to a different cost. OPUS is 50% more expensive for input and 150% more expensive for output than GPT4. Sonnet is about the same as GPT4 and much cheaper than GPT4. Haiku performance levels are between GPT4 and GPT35, however, in terms of cost-effectiveness, the Haiku model is far more cost-effective than GPT4.

Market Positioning:OPUS positioning is the automation of tasks, including scientific research and development, financial strategy formulation, etc. It is the top intelligent large model on the market. SONNET is more economical for 2B applications, such as data processing, sales tasks, etc. Haiku is biased towards 2Cs, such as customer interaction, content moderation, reducing task costs, and the fastest response time. The launch of three model series effectively makes up for the market demand and is more competitive. After testing in the industry, it is believed that Claude3 has been able to completely replace GPT4. In other words, its economic benefits and effects are better than GPT4. The current ChatGPT market is huge, and if a part of it can be eaten by Claude, the valuation of Anthropic's entire team will be very large.

Market Impact:1) GPT4 is no longer the most powerful model in the world, and Claude3 is the best and most comprehensive model in all aspects. 2) OpenAI's position was shaken by Anthropic. GPT4 was released around March 15, 2023, and Claude3 has been released in less than a year, which means that the gap between the most cutting-edge AI technology companies and OpenAI is about a year at most. Therefore, in the future, resources may not all be tilted towards OpenAI, and if Sam Altman wants to leverage the upstream and downstream of the entire industry to serve OpenAI through OpenAI's monopoly financing, he will face challenges. 3) The competition in the development of large models will intensify, and the demand for computing power on the training side is expected to increase. Although OpenAI may have GPT5 within it, GPT5 has not been released, probably because of the cost of technology or the lack of a lot of human alignment. More people are confident to catch up with the top players, and will be more convinced that star entrepreneurial teams are more suitable to do this, and it is expected that the release of Claude3 will re-stimulate the competitive demand for some training models.

【q&a】

Q: Why is Claude3 more cost-effective, and how to understand it from the perspective of technology research and development?

A: It is related to the number of parameters of the model and the architecture of the model. As far as we know, the cost of GPT5 will be higher, and it may already be so high that the inference cost is not very acceptable to ordinary 2B vendors. In the past, in order to reduce the cost of inference, GPT4 adopted the architecture of MOE hybrid expert model, and the whole GPT was composed of a task allocation model and many expert models. The difference in the cost of different versions of the Claude3 model may also involve the issue of hybrid expert models. OPUS may be a unified large model, the number of parameters that need to be activated during inference is large, there are few experts, and each expert is responsible for a large range. Sonnet does a better job in the front-end task allocation link, dismantling the expert model more finely, which can activate fewer parameters and reduce the cost of inference. Haiku is similar, but with a more granular expert model, the reasoning ability may be reduced, because the more detailed the expert is, the less each expert understands. Why Claude3 can be trained as well as GPT4, but the expert model can be dismantled more finely, this may be the difficulty and advantage of Claude3.

Q: What determines the ability to comprehend long texts? What technical direction or capability are there differences?

A: The technical difficulty lies in how to modify the input side of the model to make the model accept the number of context windows. This is actually not difficult to do, if you simply expand the long text capability, such as claude21. The number of parameters is increased at the input end, and the longer text is input to the model, which does not mean that the model can understand the text and capture the information in the text. But if you need to feed a large article, the lack of context windows will limit this kind of usage, so simply expanding the context length can solve some of the requirements.

The work of GPT4, Kimi and Claude3 is mainly to improve the retrieval accuracy, that is, the mixed attention mechanism. The attention mechanism is a feature of the Transformer model. How to design an attention model so that more attention can be dispersed everywhere in the long text, and at the same time use the pyramid model (a multi-layer attention mechanism) to capture different levels of semantic information, such as the shallowest level to capture the mutual attention of larger texts, the middle level to capture the mutual attention of semantic understanding level, and the highest level to capture the mutual attention of high-level semantic information, and store it with a pyramid structure. How to improve the retrieval accuracy depends on whether the architecture design of the attention mechanism is good enough, which is also a hot issue in the field of model algorithms, which belongs to the scope of academic research and the core competitiveness of large models.

Q: How can you see the gap between the current development stage of domestic large models and overseas ones?

A: Some domestic manufacturers are doing well, and they are also in a state of continuous catch-up.

Minutes**: Wen Bagu Research] Mini Program

Related Pages