On January 29, Baichuan Intelligent released Baichuan 3, a large language model with over 100 billion parameters. Baichuan 3 has been comprehensively improved in basic ability, and in a number of authoritative assessments, English ability is close to GPT-4, and Chinese ability exceeds GPT-4.
Baichuan 3 also excels in specific fields, such as in the medical field, Baichuan Intelligence has built a medical dataset of more than 100 billion tokens in the model pre-training stage, including medical research literature, real electronic medical record data, professional books and knowledge base resources in the medical field, and Q&A materials for medical issues. In terms of Chinese effect on authoritative medical evaluations such as MCMLE, MEDEXAM, and CMEXAM, which require high logical reasoning ability and professionalism, Baichuan 3 also exceeds GPT-4 and is the best large model for Chinese medical tasks.
Baichuan 3 also breaks through the "iterative reinforcement learning" technology, which further improves the semantic understanding and generation ability, and performs well in the format, rhyme, and ideogram of poetry creation.
The basic ability has been comprehensively improved
A number of authoritative evaluations of Chinese tasks have surpassed GPT-4
The Baichuan 3 performed well in several English reviews, reaching a level close to GPT-4. In a number of Chinese evaluation lists such as CMMLU, Gaokao, Humaneval and MBPP, it has surpassed GPT-4 to show its advantages in Chinese tasks.
In addition, in the evaluation of the aligned lists such as MT-bench and IFEVAL, the Baichuan 3 surpassed GPT-35. Claude and other large models are at the leading level in the industry.
Different from the training of tens of billions and tens of billions of parameter models, the requirements for high-quality data, training stability, and training efficiency of more than 100 billion parameter models are several orders of magnitude higher in the training process. In order to better solve related problems, Baichuan Intelligent has proposed a variety of innovative technical means and solutions such as "dynamic data selection", "importance maintenance" and "asynchronous checkpoint storage" in the training process, which has effectively improved the capabilities of Baicuan 3.
In terms of high-quality data, traditional data screening relies on manual definition, and filters data through methods such as filtering, quality scoring, and textbook filtering. Baichuan Intelligence believes that data optimization and sampling is a dynamic process, which should be optimized with the training process of the model itself, rather than relying solely on manual prior data sampling and screening.
In order to comprehensively improve the data quality, Baichuan Intelligent has designed a set of dynamic training data selection scheme based on causal sampling, which can dynamically select the training data during the model training process and greatly improve the data quality.
In terms of training stability, due to the huge number of parameters in the model with more than 100 billion parameters, there are often problems such as gradient**, loss and non-convergence during the training process.
In this regard, Baichuan Intelligent proposed a progressive initialization method of "importance maintenance" (salience-consistency) to ensure the stability of the initial stage of model training. In addition, the monitoring scheme of the model training process is optimized, and the method of "effective rank" of parameters is introduced in the gradient, loss and other indicators to detect the problems in the training process in advance, which greatly accelerates the positioning of the training problems and ensures the convergence effect of the final model.
In addition, in order to ensure efficient and stable training of models with more than 100 billion parameters on thousands of GPUs, Baichuan Intelligent has synchronously optimized the training stability and training framework of the model, and adopted the "asynchronous checkpoint storage" mechanism, which can increase the frequency of storage without performance loss, reduce the impact of machine failure on the training task, and make the stable training time of Baichuan 3 reach more than one month, and the fault recovery time is no more than 10 minutes.
In terms of training efficiency, Baichuan Intelligent has carried out a series of optimizations for the parallel training of models with more than 100 billion parameters, such as highly optimized ROPE and SWIGLU calculation operators; The overlap of parameter communication and computation is realized in data parallelism, and the overlap of activation value communication and calculation is realized in sequence parallelism, so as to effectively reduce the proportion of communication time. The technology of offloading the activation value to the GPU is introduced in the flow parallelism, which solves the problem of uneven memory usage in the flow parallelism, reduces the number of segments in the flow parallelism, and significantly reduces the cavitation rate.
Through these technological innovations, the performance of the Baichuan 3 training framework is improved by more than 30% compared with mainstream frameworks in the industry.
The number of tokens in medical datasets exceeds 100 billion
Medical capabilities are approaching GPT-4
From disease diagnosis, disease to patient care and drug research and development, the large model can not only help doctors improve the efficiency and quality of diagnosis and treatment, help patients get better services and experiences, but also help the society reduce medical costs and risks, and help medical resources achieve universal benefits and equal rights. In addition, the medical problems are highly professional, the knowledge update speed is fast, the accuracy requirements are high, and the individual differences are large, which can fully reflect the various capabilities of the large model, and is called "the crown jewel of the large model" by Baichuan Intelligence.
Therefore, leading large model companies such as OpenAI and Google regard medical care as the key training direction of the model and an important system for performance evaluation.
ChatGPT has passed the United States Medical Licensing Examination (USMLE) as early as February 2023, showing its strong ability in the medical field. Google attaches more importance to the medical field, and has built a large medical model MED-PALM based on the PALM model, and the iterated MED-PALM 2 scored more than 80 points in the medical exam MEDQA, reaching the expert level.
In the medical field, the all-round nature of large models plays a crucial role.
First of all, its multimodal learning capability can integrate text, image, sound and other types of medical data to provide more comprehensive and accurate analysis and diagnosis.
Second, the deep reasoning ability of large models can help make complex medical decisions. In addition, stable performance and knowledge up-to-date capabilities ensure the reliability and timeliness of medical advice. At the same time, the language understanding and generation capabilities of large models enable them to handle technical terms and complex sentence patterns.
Finally, the application of pattern recognition and learning capabilities to large models enables them to learn and identify important patterns and features from complex medical data.
Therefore, it is not easy for large models to have good results in the medical field, which requires not only rich medical knowledge, appropriate prompts, but also excellent logical reasoning ability of the model itself.
In order to inject rich medical knowledge into Baichuan3, Baichuan Intelligent has built a medical dataset of more than 100 billion tokens in the model pre-training stage, including medical research literature, real electronic medical record data, professional books and knowledge base resources in the medical field, and Q&A materials for medical problems. The dataset covers all aspects of medical knowledge from theory to practical operation, from basic theory to clinical application, ensuring the professionalism and depth of knowledge of the model in the medical field.
In response to the problem of medical knowledge stimulation, Baichuan Intelligence has done systematic research and optimization for prompt in the inference stage, and through accurate description of tasks and appropriate sample selection, the model output is more accurate and logical reasoning steps, which ultimately not only improves Baichuan 3's performance in a number of medical exams, but also provides users with more accurate and detailed feedback in real medical Q&A scenarios.
In terms of logical reasoning, Baichuan 3 has surpassed GPT-4 in many authoritative evaluations such Chinese as mathematics and **, which has fully proved its strong basic logical reasoning ability.
On the basis of having rich and high-quality professional medical knowledge, which can be fully stimulated through the optimized prompt, combined with the reasoning ability of more than 100 billion parameters, the task effect of Baichuan 3 in the medical field has been significantly improved, and the performance in various Chinese and English medical tests has increased by 2 to 14 percentage points.
Baichuan 3 has performed well in a number of authoritative medical evaluation tasks, not only the evaluation results of Chinese medical tasks such as MCMLE, MEDEXAM, and CMexam exceed GPT-4, but the evaluation results of English medical tasks such as USMLE and MEDMCQA are also close to the level of GPT-4, which is the strongest Chinese large model with the strongest medical ability.
Breakthrough in "iterative reinforcement learning" technology
The accuracy of creation has been greatly improved
Semantic understanding and text generation, as the most basic underlying capabilities of large models, are the pillars of other capabilities. In order to improve these two capabilities, the industry has carried out a lot of exploration and practice, and RLHF (reinforcement learning based on human feedback) and RLAIF (reinforcement learning based on AI feedback) introduced by OpenAI, Google, and Anthropic are the key technologies.
The aligned model based on reinforcement learning can not only understand user instructions more accurately, especially those under multiple constraints and multiple rounds of dialogue, but also further improve the quality of generated content. However, giving full play to the role of reinforcement learning in large models requires not only a stable and efficient reinforcement learning training framework and high-quality high-quality partial order data, but also a balance between "exploration and utilization" to achieve continuous climbing of model capabilities.
For the above problems, Baichuan Intelligent has conducted in-depth research and given targeted solutions.
In terms of reinforcement learning training framework, Baichuan Intelligence has developed a PPO training framework with dual-engine fusion of training and inference and multi-model parallel scheduling, which can well support the efficient training of more than 100 billion models, and the training efficiency is 400% higher than that of mainstream frameworks in the industry.
In terms of partial order data, Baichuan Intelligent innovatively uses the combination of RLHF and RLAIF to generate high-quality and high-quality partial order data, which achieves a better balance between data quality and data cost.
On this basis, for the fundamental challenge of "exploration and utilization", Baichuan Intelligent realizes "iterative reinforcement learning" (iterative rlhf&rlaif) through the synchronous upgrade of PPO exploration space and reward model evaluation space. The version ramp-up based on reinforcement learning can further exert the potential of the base model on the basis of SFT, and greatly improve the semantic understanding and generative creation capabilities of Baichuan 3.
Taking Tang and Song poems, which are the most challenging in text creation, as a treasure of traditional Chinese culture, poems not only have strict constraints in terms of format, level, duality, and rhyme, but also have highly concise content and far-reaching meanings. If only through the fine-tuning of SFT, on the one hand, the creation data of high-quality poems requires extremely high expert costs, and on the other hand, it cannot achieve better constraint understanding and compliance in many aspects such as leveling, duality, and prosody.
In addition, the traditional single-shot RLHF paradigm also encounters great challenges in the face of Tang and Song poems, and the responses generated by PPO during the training process may exceed the evaluation range of the reward model, resulting in the process of "exploration" getting out of control.
Baichuan 3 combines "RLHF&rlaif" and iterative reinforcement learning methods to bring the poetry creation ability of large models to a new level. The usability is up to 500% higher than the current best model in the industry, far exceeding GPT-4.
For the difficult style of Song Ci, which has a changeable format, deep structure, and rich rhyme, the generated content can also be neatly paired and harmonious, which can not only improve the humanistic quality of the public, but also help Chinese traditional culture to truly "live" in the era of large models.
As a large language model with a parameter scale of more than 100 billion, Baichuan 3 not only achieves a level close to GPT-4 in English, but also surpasses GPT-4 in the performance of a number of general Chinese tasks, which is a new milestone for Baichuan Intelligence.
Baichuan 3's comprehensive general capabilities and strong performance in the medical field will create a "super application" for Baichuan Intelligence, and provide strong support for the implementation of large model technology in many complex application scenarios.