The mobile phone can run!Microsoft s small model beat Llama 2, 96 A100 GPUs were trained for 14 days

Smart stuff.

Author |Cheng Qian.

Edit |Li Shuiqing.

Zhidong December 13**, yesterday evening, Microsoft showed a small model big move!

Microsoft has released a 2.7 billion parameter small language model, phi-2, which has been tested by researchersPHI-2 demonstrated state-of-the-art performance in a model with a parameter scale of less than 13 billion

In terms of performance, the PHI-2 excels in Big Bench Hard (BBH), Common Sense Reasoning, Language Comprehension, Math, and Coding benchmarksThe average performance score has surpassed the 7 billion, 13 billion parameter scale Mistral and Llama 2, surpassing Google's Gemini Nano 2 in some benchmarks

Another big advantage of phi-2 is that the parameter size is small enough that it can be usedLaptops, mobile phonesand other mobile devices.

Over the past few months, Microsoft Research's Machine Learning Xi Foundation team has released the Small Language Model (SLM) PHI series.

Among them, the first model is the 1.3 billion parameter scale Phi-1, the official blog says that Phi-1 performs best in Python coding in SLM, especially on Humaneval and MBPP benchmarks. The second model is a phi-1 with a scale of 1.3 billion parameters5. This model focuses on common sense reasoning and language comprehension.

Now Microsoft has released Phi-2, which is available to researchers from the Azure AI Studio model catalog to help researchers explore machine interpretability, security improvements, or fine-tuning experiments for a variety of tasks.

Some large models have hundreds of billions of parameters, which makes them emerge with many emerging capabilities, so can smaller parameters be realized by changing the training strategy?Microsoft's family of small language models (SLMs) may be the answer to this question.

PHI-2 is a model based on the Transformer architecture with the next word ** target, passed multiple times on a mix of synthetic datasets and web datasets for NLP and encoding4T tokens.

Phi-2 inTrained on 96 A100 GPUs for 14 days, as a base model, is not aligned with human feedback reinforcement Xi (rlhf) and is not fine-tuned with instructions.

Nonetheless, the researchers observed that PH-2 did not perform worse in avoiding generating offensive, harmful, and biased content compared to the adapted existing open-source model, LLAMA 2-7B.

The researchers calculated safety scores based on 13 demographic data from Toxigen, and they selected a subset of 6,541 sentences and rated them on a scale between 0 and 1 based on confusion and sentence "toxicity." A high score indicates that the model is less likely to produce aggressive, harmful sentences.

Comparison of the performance of Llama 2 and PHI-2 in terms of generating aggressive, harmful, and biased content (Source: Microsoft's official blog).

Microsoft uses PHI-2 to break the traditional law of scaling language models, and there are two key links:

The first is that the quality of the training data is critical to the performance of the model。Microsoft's model training data consists of specially created synthetic datasets that teach model general sense reasoning, as well as science, psychology, and other fields.

The researchers also selected some network data to further enrich the training corpus and filtered the data based on the value and quality of the content.

In addition, from 1.3 billion parameters on the scale of phi-15 years ago, Microsoft researchers realizedKnowledge transfer at scale, will phi-15 of the knowledge is embedded in the 2.7 billion parameter phi-2. This approach not only accelerates training convergence, but also improves the benchmark score of phi-2.

phi-2 and phi-15 Comparison (Source: Microsoft's official blog).

Microsoft summarizes the performance of PHI-2 compared to mainstream language models on academic benchmarks.

Its benchmarks cover Big Bench Hard (BBH dataset) as well as common sense reasoning datasets for PIQA, Winogrande, Arc Easy, Challenge, SIQA, Hellaswag, OpenBookQA, MMLU, SquadV2, GSM8K math dataset and Humaneval and MBPP coding datasets.

Phi-2 with 2.7 billion parameters surpasses MISTRAL and LLAMA 2 with 7 billion and 13 billion parameters in BBH, common sense reasoning, language comprehension, mathematics, and coding.

Compared with the 70 billion parameter LLAMA 2, which has a difference of 25 times the parameter size, PH-2 performs better on multi-step inference tasks such as coding and mathematics.

LLAMA 2, MISTRAL, and PHI-2 performance comparison (Source: Microsoft's official blog).

In addition, Microsoft also compared the PHI-2 with Google's recently released Gemini Nano 2, which published a model parameter size of 32500 million,The PHI-2 partially outperforms the Gemini Nano 2.

Comparison of the performance of the Phi-2 and Gemini Nano 2 (Source: Microsoft's official blog).

Considering that data from some public benchmarks could leak into the training data, Microsoft conducted an extensive sanitization study of the first model, phi-1, to rule out this possibility.

Based on the consideration that the best way to judge a language model is to test it on a specific use case, the researchers evaluated PHI-2 using multiple proprietary Microsoft datasets and tasks, and again compared it to MISTRAL and LLAMA 2, with the result of ,On average, PHI 2 is better than MISTRAL-7B, which is better than the LLAMA-2 model with 7 billion, 13 billion, and 73 billion parameter scales

In addition to benchmarking, the researchers tested some common cues within the community, and the performance they observed was consistent with what the results of the benchmark expected.

Among them, the researchers tested questions used to evaluate the ability of Google's Gemini Ultra model to solve physical problems.

Similar to Gemini's test, the researchers further asked the PHI-2 for the student's incorrect answers to see if it could identify the error.

However, judging from the output, this is not exactly a like-for-like comparison with the Gemini Ultra output described in the Gemini report, where the students' answers uploaded images of handwritten text, while the Phi-2 test used the original text.

The parameter size of PHI-2 is only 2.7 billion, but its performance is still not inferior compared to the larger 7 billion and 13 billion parameter models. Microsoft's focus on the layout of the small model market also confirms the value of small models in the era of large models.

The close cooperation between Microsoft and OpenAI has made the performance of GPT models unrivaled in the large-scale model market, and coupled with Microsoft's PHI series with smaller parameters, it can further seize the long-tail market of open-source models. However, for the time being, the PHI series is only allowed for research purposes.

From the perspective of the market, more and more players have begun to explore the deployment of large models on mobile devices such as mobile phones, and Microsoft's move may also accelerate the application of model capabilities on the device side.

The mobile phone can run!Microsoft s small model beat Llama 2, 96 A100 GPUs were trained for 14 days

Related Pages