Author | zer0
Edit |Desert Shadow
How to do it with the smallest scale and the strongest AI?
Zhidong said on February 2, this Thursday, the first start-up company in China to engage in "large model + agent".Facing the wall intelligentlyAt the beginning of the year, the largest flagship end-side model was releasedFace wall minicpm
This is a "2B performance small steel cannon", which only uses 2.4 billion parameters, but it can fight against tens of billions of large models.
Mistral AI, a European generative AI unicorn that has exploded before, has successfully challenged LLAMA 2 with the 7 billion parameter large model MISTRAL-7B with its bold route, becoming a benchmark that proves that billion-parameter models are enough to achieve high performance.
Today, the "Chinese version of Mistral" is fully opened as a dark horse, surpassing Mistral-7B in performance in a number of mainstream evaluation lists, and for the first time, it effectively realizes the deployment of multi-modality on the device side and gives measured examples, can chat and write, and can also understand image information and give accurate answers.
The team that made these achievements, both young and senior, is based in Wudaokou, Beijing, the most densely populated area of China's large-scale model enterprises, founded only one year, with a scientific research team of more than 100 people, a content of 80% in Qingbei, and an average age of only 28 years old.
During the press conference, Zhidong had in-depth exchanges with the core founding team of Facing Wall Intelligence. According to the sharing, the facewall intelligence has three large model technical advantages: 1).Algorithm optimization, self-created "model sandbox"; 2)Money saving cheats, support inference on the CPU, efficient training and fine-tuning on the consumer graphics card; 3)Data governance, forming a closed loop from data governance to multi-dimensional evaluation, driving rapid iteration of models.
Since the second half of last year, the circle of smart hardware has become more and more lively: Huawei, Xiaomi, OPPO, vivo, Honor and other large manufacturers have come down to install billions of parameters on mobile phones; The AI PC concept was unveiled at CES 2024, an international consumer electronics eventA number of start-ups have challenged the new form of AI hardware.
Using smaller models to make stronger AI has become another focus direction after the large model competition has reached 100 billion parameters. This reflects the problems faced by smart hardware products: the cloud running model is strong enough, but if the device side cannot be independent, then problems such as network disconnection and slow response delay will affect the user experience of end users.
The key to landing the large model on the end side is to do three things:The first is that the volume is small enough, the second is that sex can be used, and the third is that the cost is low enough
Due to the limited memory capacity and bandwidth of smart hardware, the smaller the device-side model, the lower the amount of computation and occupied memory, the lower the computing cost, power consumption, and inference latency, and the faster the response of the device-side AI application.
In the soaring technology race, cost has become the most competitive advantage of large models. Cost represents the profit margin of the large model and is the focus of smart terminal enterprises. The device-side model has the characteristics of all-weather and low cost, which can make up for the inherent shortcomings of the 100-billion-level parameter model in terms of large-scale deployment cost and threshold through cloud collaboration, and reduce the computing power burden of cloud data centers.
As a start-up company, Facewall Intelligence, which was only established in August 2022, previously focused on the research and development of hundreds of billions of large models and AI agents, and did not have the natural advantages of sufficient hardware products like major mobile phone manufacturers, so why did it choose to enter the end side?
This starts with the team's mission, the vision of Facing Wall Intelligence "Wisdom of Everything", and the vision of the OpenBMB open source community "Let the large model fly into thousands of households", so that it aims to let as many people as possible enjoy the general intelligence of the large model in as many places and scenarios as possible.
Just as human intelligence is divided into different tasks by the brainstem, cerebellum, and brain, in the future, models of different sizes will be responsible for tasks of different complexity, so the implementation path of general artificial intelligence (AGI) is more efficient.
So,The power end side is an important part of the wall-facing intelligence strategy.
The 2B scale model can be applied to mobile devices that are closer to the user and more portable, so as to play a role in more places and solve the problems of high cost and high threshold in the actual implementation of large models.
From the perspective of technical research and judgment, the launch of ChatGPT and GPT-4 in 2023 shows that the technical route of the large model has been basically determined, and the next step is to explore its scientific mechanism and optimize the efficiency to the extreme.
Liu Zhiyuan, a tenured associate professor of the Department of Computer Science at Tsinghua University and co-founder of Facewall Intelligence, said that he hopes that this end-to-end model can make more people realize that even if it is a 2B size model, the upper limit of the ability that can be achieved is still far beyond imagination. Just as ships and airplanes are built with the support of fluid mechanics, the team's commitment to scientific research on large models is an important driving force for truly commercialization and sustainable development.
At the same time, through the implementation of cloud collaborative catalytic applications, the device-side large model can better provide intelligence for the face wall"Large model + agent" dual-engine strategyServe. The accumulation of technology for large models on the device side is in line with the technology of continuous miniaturization of large models on the cloud, which will ultimately help accelerate the move towards AGI.
If the agent capability is applied to the device-side model, it can better serve specific scenarios and create more value, and I think these two directions can support each other and produce some wonderful chemical reactions. Zeng Guoyang, co-founder and CTO of Face Wall Intelligence, said.
In 2023, MISTRAL-7B was born, defeating the open-source large language model overlord LLAMA 2 with tens of billions of parameters with 7B parameters, becoming a model of "fighting big with small" in the field of large models, and setting a new benchmark in the field of open source with high spirits.
At the beginning of this year, Facewall Intelligence took over the burden of "miniaturizing large models": launching a "new flagship with performance".Face wall minicpm, with 2B parameter scale and 1T tokens selected data, swept a number of mainstream evaluation lists, the average score in Chinese and English exceeded MISTRAL-7B, and the combat effectiveness of Chinese and general capabilities exceeded that of Microsoft's star model PHI-2 (distilled GPT-4).
Faced with the question "Which is the tallest mountain in Shandong Province, is it taller or shorter than Huangshan?" How much is the gap? "With this hybrid test question, MiniCPM can not only give accurate altitude, but also calculate the difference, which is significantly faster than manual search and calculation.
minicpm-2b not only has stronger general and Chinese capabilities, but also has the ability to wrestle with billions or even tens of billions of parameter large models when competing with English.
It can bypass the pitfalls of multilingual mixed translation, such as being asked to translate a mixed Chinese and English sentence into French with a large model in English, understand the intent, and output the correct answer.
For role-playing, minicpm is also very proficient: playing Li Kui to ask Song Jiang for money, he can vividly grasp the tone and skills of speaking; When writing a love letter to his wife, he consciously stuffs some emojis that can express his love. Therefore, it can be used to drive some emotional chatbot end-to-end applications.
In addition, the minicpm programming capability surpasses that of MISTRAL-7B, and can realize end-side operation and writing**, which helps to save programming workload.
With the same PK as the tens of billions of large models, the miniCPM-7B can also lead in performance in most evaluations.
On MTBENCH, the closest evaluation set to human ratings, MiniCPM received a good rating.
After INT4 quantization, MiniCPM can be deployed on mobile phones for inference, and the streaming output speed is slightly higher than the speed of human speech.
minicpm open source address:
The minicpm can not only speak, but also watchThe first batch has run through the deployment of multi-modal large models on mobile phones。The evaluation performance of miniCPM-V exceeds that of other multimodal models of the same scale, reaching the same level as 96b qwen-vl-chat has comparable or even better performance, interpreting image details and understanding abstract memes.
Why do we need to bring multimodal capabilities to the end? Li Dahai, co-founder and CEO of Facewall Intelligence, gave an extreme example, such as going camping in the wild, encountering a snake when the signal is poor, how to tell if it is a poisonous snake? At this time, take a photo and send it to the large model on the end side, and you can get a timely response. If there is an emergency, you can also turn to the device-side large model first in the case of network disconnection.
The multi-modal capability does not stop there, and the larger version of the wall-facing omnilmm has achieved the leading capability of the open source community at the same scale. For example, with a large model guessing what kind of game to do, it is able to work with plain text chatgpt-3 in multimodal continuous mode5. Combine to realize the function of rock-paper-scissors.
The implementation of streaming real-time interaction is to use OmniLMM 12B to convert ** frames into text descriptions, and then based on plain text ChatGPT-35. Answer questions based on text descriptions and user questions.
Multimodal large models can understand a lot of image details. For example, the dog on the left is not wearing a guide dog identification costume, and the large model judges that it is a guide dog through the surrounding elements; In the figure on the right, the large model inferred from the TV station's logo and judged that it was a TV program.
These capabilities have been integrated on the 12B model and will be introduced to the face-wall miniCPM-V at a later date.
omnilmm open source address:
According to Liu Zhiyuan, it was sharedIn the direction of multimodal large models, the gap between domestic and international is relatively small, but the technical maturity is not as good as that of large language models, which is embodied in the inconsistency of processing modes and the fact that image generation and understanding have not yet formed a good unity. At present, the multimodal architecture is diversified, and there is room for further exploration.
All-round cost reduction is a highlight of minicPM.
As a large cost-saving model, MiniCPM supports CPU inference and consumer-grade graphics card training. After int4 quantification, only occupy2gbspace, which has the conditions for model deployment on device-side mobile phones.
To do a simple arithmetic problem, the Snapdragon 855 chip costs 600 yuan, 7 per second5 tokens, calculated in 5 years, can get minicpm1.7 million tokensThe cost of device-side inference is only 1 yuan, which is the cost of mistral-medium in the cloud, which is equivalent to a cliff**.
In addition to end-to-end inference, its cost advantage is also reflected in the low cost of continuous improvement of secondary development. Because it's the smallest enoughOnly one 1080 2080 graphics card can be used for efficient parameter fine-tuning, one 3090 4090 can achieve full-parameter fine-tuning, and one machine can continue parameter training; The quantized version is 75% compressed, and the performance is basically lossless.
At present, miniCPM is mainly implemented in mobile phones, and it is necessary to continue to explore user value in the implementation of more smart terminal scenarios. According to Li Dahai, minicPMIt has run through the international mainstream mobile phone brands and terminal CPU chips, and there is no pressure to run on the old mobile phone
At present, the facewall team has not carried out in-depth optimization and system testing for the mobile phone inference model, and only verifies the feasibility of minicpm using mobile phone chips for inference as an external developer.
To train a model, efficiency is keyIn the view of the facewall team, in the process of training the model, the whole process of efficient infra is the moat of large model entrepreneurship, which determines the technical ceiling, and may bring up good results in the short term, but in-depth work will be limited by infra.
Face wall intelligence has created a setFull-process optimization acceleration tool suite platform face wall modelforce, including: BMTRAIN, an efficient training framework developed in 2021, achieves the distributed realization of SOTA in the industry, lowering the training threshold of 100 billion models to 64 cards; The BMINF efficient inference framework adopts an efficient sampling acceleration algorithm and a sparse activation method, which can achieve 3-fold inference acceleration. The BMCOOK high-efficiency compression framework has INT4 lossless compression, which can achieve more than 5 times the inference acceleration and reduce the storage overhead by 70%. The BMTune Efficient Tuning Framework provides a variety of toolkits for fine-tuning, prompt learning, and more.
With the help of these tools, Facewall Intelligence can do just that10x faster inference and 90% lower cost
Han Xu, chief researcher of Facewall Intelligence, said that many infra jobs use various devices and computing power to accelerate training, and actively find some efficient features that match hardware at the algorithm level to achieve efficiency from the algorithm and model level, and the collaboration between the two can improve the inference performance of the device-side large model.
In the process of communication, the founding team of the core of Facewall Intelligence repeatedly mentioned a key word:Efficient
Small size is the ultimate arena of model technology, while high efficiency is the advantage of traditional technology for facing walls. The reason why we can achieve "small and big" comes from the team's multiple optimization of computing power, data, and algorithms, in addition to the aforementioned ".Saving money is king"Outside, there's more".Data governanceAlgorithm optimization"The two buffs stack.
In terms of data governance, Facewall Intelligence has built a modern "data factory" to form an effective closed loop from data governance to multi-dimensional evaluation, and drive rapid iteration of model versions through high-quality data accumulation and continuous training of friendly data strategies. Zeng Guoyang said that the experience of intelligent processing of anomalies on the face wall and the cognition of data selection are technical barriers to its continuous development in large models
There are two key points that minicpm can make high performance with 1t tokens data: first:High-quality data, training uses selected high-quality datasets; The second isThousands of pre-experimentsThis involves more efficient training techniques explored by Facewall Intelligence in algorithm optimization.
Algorithm optimizationaspect, the face wall intelligent self-created".Model sandboxtechnology, with the same amount of data to train a larger model, with a small model to maximize the performance of a large model, large and small models share hyperparameter schemes, sustainable optimal, efficient and scalable model training strategy. Liu Zhiyuan made an analogy, the technical barriers in this area are like cooking, youGetting a recipe doesn't necessarily make a three-Michelin star
A sandbox is a security mechanism that provides an isolated environment for programs that are being executed, and is often used as an experiment for programs that are untrustworthy, destructive, or cannot determine program intent. Facewall Intelligence did it before the release of MiniCPMThousands of model sandbox experimentsto explore the optimal hyperparameter configuration, which can ensure that the best results can be achieved by training models of any size.
For example, the learning rate scheduler used around the world has been optimized, and the warmup-stable-decay (WSD) scheduler that is very friendly to continuous training has been explored. The new learning rate scheduling strategy of the scheduler can achieve the best decay step count, and the continuous training efficiency is higher. This kind of learning rate scheduler helps to train a model and optimize it according to different subsequent purposes.
Except for 001 The learning rate is the best loss at any model scale, and the "model sandbox" is also implementedThe model size of the hyperparametric stabilization is expanded, some adjustments are close to CEREBRAS-GPT, and the same set of hyperparameters governs all models; Optimal batch size, the optimal balance between convergence speed and resource consumption; Fixed model multiplication caps, which can be annealed at any time, and the growth multiple of the optimal model in the stage is obtained; Data lessons, continuous training friendly, adding high-quality data to the annealing phase of the WSD scheduler for better capabilities, and also supporting continuous training.
Liu Zhiyuan said, "big model" is not only a big model, in fact, it is a technology, built-in big data, parameter governance and scientific capabilities, today's wall intelligence technology is enough to train a 2B model, so that it can play at least 4B models to do things before, the corresponding methods can also be in the same line, such as using similar models to do 80B or even 800B models.
For more details on the optimization of the minicpm algorithm, please refer to the technical report uploaded by its open source project.
Through train:
In Li Dahai's view, an important competitive advantage of making a large model lies in having a strong enough ability to explore original technology.
As one of the earliest large-scale model research teams, Facewall Intelligence is one of the few start-ups that introduces industrial managers as soon as they leave the laboratory to operate and think about commercial companies in advance.
Its co-founder Liu Zhiyuan is a tenured associate professor at Tsinghua University, co-founder and CEO Li Dahai is the CTO of Zhihu, co-founder and CTO Zeng Guoyang is a genius boy who began to learn programming at the age of 8, and the chief researcher Han Xu is a postdoctoral fellow in the Department of Computer Science of Tsinghua University.
According to reports, Facewall Intelligence was born out of Tsinghua NLP Laboratory in 2018 and released the world's first knowledge-guided pre-trained model ERNIE; In December 2020, it became the main lineup of the first Wudao large model, and released the world's first Chinese open source large model CPM with 2 billion parameters; In April 2022, the OpenBMB open source community was established.
In the era of large models, AI technology is mature enough to be standardized and productized for application in all walks of life. Liu Zhiyuan realized that the school laboratory alone could not carry out the most cutting-edge exploration, so he began to prepare the company in 2021 and established the original intention of "letting large models enter thousands of households". After that, Li Dahai, as the CTO of Zhihu, first participated in the investment, and then served as the CEO of Facewall Intelligence, directly involved in management.
In August 2022, Facewall Intelligence was corporatized, and in April 2023, it received an angel round of financing from Zhihu Investment, and launched a variety of pedestal models and representative agent products throughout 2023.
In addition to cooperating with Tsinghua NLP Laboratory, Li Dahai revealed that Facewall Intelligence also has a lot of cooperation with Zhihu, and Zhihu's data plays a very big role in multi-modal large model training, which is also the advantage of Facewall Intelligence.
At present, there are three main product lines of face wall intelligence:Large model, AI agent, AI infra
Facewall Intelligence is not obsessed with taking the route of "bigger than big", but it has not given up on the study of ultra-large-scale language models. Its 100 billion model CPM-C performance has surpassed GPT-35. The inference cost is currently GPT-3Half of 5 turbo**, and with a lot of room for cost reduction. A larger, stronger CPM-D is being trained.
Hu Shengding, Ph.D. in the Department of Computer Science at Tsinghua University and a member of the research team of Facing Wall Intelligence, explained that it is very important to expand the scale of the model, and that experimenting on smaller models is not an end, but a means, in order to ultimately serve a particularly large model and lead to superintelligence. Developing smaller models can make intelligence less costly, meet the needs of more use cases, and make intelligence accessible to more people.
It seems that we do a lot of things, but in fact, the core is very clear," Liu Zhiyuan said, and the common vision of Facewall Intelligence and Tsinghua NLP Lab is to realize AGI and make it serve the entire human society, "We will do what AGI needs." ”
Next, Facewall Intelligence will follow the dual-engine strategy of "large model + agent", explore smaller models, faster speed, and lower costs, and open source the whole family bucket of models and contribute them to the community.
The face-wall miniCPM has achieved a new benchmark for the performance of 2B large models, achieved a leading position in "extremely efficient, extremely low-cost, and extremely small scale", and introduced multi-modal capabilities to the device side for the first time in the industry.
In addition to the open source model, Facewall Intelligence also discloses various experimental results and data matching formulas in the research and development process, hoping to make progress with peers and jointly promote the move towards general artificial intelligence.
In Liu Zhiyuan's view, in order to make general artificial intelligence benefit everyone, the pursuit of larger models, stronger capabilities, and how to fully find and mine the upper limit of model performance of fixed size will be an important mission towards general artificial intelligence. The next task in this field must be to make the route more scientific and standardized. This is also an important mission of the combination of industry, university and research between Facewall Intelligence and Tsinghua NLP Laboratory.
The last little easter egg: Approaching the Chinese New Year, Facewall Intelligence has developed an application called "Heart", which has a "knock cp" function based on the ability of large models, the beta version has been opened, and everyone is welcome to try it.