Edit: Run alan large model is really getting more and more volume now!
In November, OpenAI first used GPTS to change the lives of GPT shells, and then did not hesitate to sacrifice the board of directors to fight a wave of traffic.
Google was forced to rush to release the super-large model gemini before the end of the year, rolling up multimodality, and even did not hesitate to fake it.
Just today, Microsoft officially announced the phi-2! that was previewed at the Ignite conference in November
With 2The small language model (SLM) Phi-2 of 7b has broken through almost all large models below 13b – including Google's newly released Gemini Nano 2.
Through innovations in model scaling and training data management, PHI-2 demonstrated excellent inference and language comprehension capabilities, and in complex benchmarks, PH-2 was able to match or even slightly outperform models that were 25 times larger than itself.
It is used in a very slim size, and good performance is obtained.
This makes it easy for researchers and model developers to use PHI-2 to make interpretability, security improvements, and fine-tuning for other tasks.
Phi-2 is now accessible through Azure AI Studio.
However, it is worth noting that compared to other open source models, it is basically based on Apache 20 license agreement, which can support commercial use. PHI-2 can only be used for research purposes and is not commercially available.
Microsoft's strongest small model is here!
Large language models have now grown to hundreds of billions of parameters, and their sheer scale has brought powerful performance that has changed the landscape of natural language processing. However, can small language models achieve similar capabilities with appropriate training methods, such as data selection?
Microsoft's PHI-2 provides the answer.
PHI-2 breaks the scaling law of traditional language models, and the test score can PK a model that is 25 times larger than itself.
Microsoft has made two key takeaways about the success of PHI-2:
Point 1: Training data quality plays a critical role in model performance.
As a consensus among the developers of large models, Microsoft's researchers have taken this a step further – using textbook-quality data.
At the time of the release of PHI-1, the development team came up with the idea that textbooks are all you need.
In the development of the PHI-2, the team took this to the extreme.
The training data used by PHI-2, which consists of synthetic datasets,—— is specifically designed to teach the model's common sense reasoning and general knowledge (science, daily activities, theory of mind, etc.).
In addition, the R&D team filtered carefully selected network data based on educational value and content quality to further expand the training corpus.
The second point: use innovative technology to extend the model.
Take 13b parameter of phi-15 as a base and embedding its knowledge into 27b parameter in phi-2. This large-scale knowledge transfer not only accelerates the convergence of training, but also significantly improves the benchmark score of phi-2.
The diagram above shows phi-2 and phi-15 Comparison of tests (3 and 5 COTs (Chain of Thought) were used for BBH and MMLU, respectively).
We can see that with the support of innovative technology, the performance of PHI-2 has been significantly improved.
96 pieces of A100 practiced for 14 days.
Phi-2 is a transformer-based model that uses 14T tokens for training (including synthetic dataset and web dataset for NLP and encoding). The training of the PHY-2 used 96 A100 GPUs and took 14 days.
PHI-2 is a foundational model that is not aligned with reinforcement Xi for human feedback (RLHF) and is not fine-tuned.
Nonetheless, PHI-2 has a better performance in terms of toxicity and bias compared to aligned existing open-source models. – This is due to the use of tailor-made data wrangling technology.
The graph above shows the safety score calculated from 13 demographics in Toxigen.
A subset of 6541 sentences was selected and scored on a scale of 0 to 1 based on complexity and sentence toxicity. The higher the score, the less likely the model is to produce toxic sentences.
Below, the R&D team summarizes the performance of PHI-2 against popular language models on academic benchmarks.
Benchmarks cover multiple categories, Big Bench Hard (BBH) (3 tests with COT), Common Sense Reasoning (PIQA, WinoGrande, Arc Easy and Challenge, SIQA), Language Understanding (Hellaswag, OpenBookQA, MMLU (5 times), SquadV2 (2 times), Boolq), Math (GSM8K (8 times)) and coding (Humaneval, MBPP (3 times))
Phi-2 has only 2The parameters of 7b, on various benchmarks, outperformed the models of Mistral 7b and LLAMA-2 13b.
Moreover, compared to the 25-fold LLAMA-2-70b model, it performs even better on multi-step inference tasks (i.e., coding and mathematics).
In addition, the Phi-2 also performs better than the recently released Google Gemini Nano 2, although it is still slightly smaller.
Considering that many model test benchmarks may have been polluted by training data, the research team tried to avoid the possibility of training data being polluted during the development of PHI-1.
The Microsoft research team agrees that the best way to judge the performance of a language model is to test it in real-world use cases.
In this pragmatic spirit, Microsoft also evaluated PHI-2 using several proprietary Microsoft datasets and tasks, and re-compared it to MISTRAL and LLAMA-2. The results also indicate that the average performance of PHI-2 is better than that of the MISTRAL-7B and LLAMA-2 families (7b, 13b, and 70b).
In addition to these benchmarks, Microsoft couldn't resist digging a bit into Google's now-much-criticized Gemini demo, which shows how Google's upcoming most powerful AI model, Gemini Ultra, can solve rather complex physics problems and even correct students' mistakes on them.
It turned out that the PHI-2, despite the fact that the number of parameters is much smaller than that of the Gemini Ultra, is able to answer the questions correctly and correct the student with the same prompts.
The figure above shows the output of phi-2 on a simple physics problem, including approximately correct square root calculations.
Similar to Gemini's test, Phi-2 is further questioned with the student's incorrect answers to see if Phi-2 can identify the error in **.
We can see that although phi-2 is not fine-tuned for chat or instruction tracking, it identifies the problem.
It should be noted, however, that Google's demo** uses an image of a student's handwritten text as input, while the phi-2 test uses text as input.
Magic change prompt project, GPT-4 counterattack Gemini Ultra
Microsoft has released a study on prompt engineering, MedPrompt. They use innovative LLM tips for engineering techniques that previously required specialized training or fine-tuning to achieve performance gains in the medical field.
*Address: On the basis of this prompting project, Microsoft has found that the prompting strategy can have a more general effect. Eventually, GPT-4 was booted through a modified version of Medprompt, and Microsoft achieved SOTA results on MMLU.
It's just a little bit better than when Google Gemini was released.
Microsoft used this inadvertently result to snipe Google's cot@32 defeat of GPT-4 5 Shot at the time of Gemini's release.
This is a secret competition, but it still has to show the feeling of lifting weights, like the scene of two top students in the class tearing each other down because of competition when they were studying.
Netizens are hotly discussed. Previously, Microsoft's bigwigs released the test results of several models on MT Bench:
We can see that just 2The performance of the 7B PHI-2 series is still very good.
For Phi-2's performance, netizens also did not hesitate to praise it:
Wow, the Phi-2 sounds like a game changer!It's powerful enough to rival large language models, but small enough to run on a laptop or mobile device, which is fantastic. This opens up a whole new world of natural language processing on devices with limited devices. 」
Some netizens expressed their anxiety:
Has anyone figured out how to run Microsoft's new Phi-2 on Mac?」
Of course, there are also more sharp netizens who pulled out openai:
If you don't give garbage to the model in the first place, it seems that you don't have to worry about alignment. @openai 」
There are also netizens who are hopeful about the prospects of small language models:
It is very much hoped that the Phi-3 will outperform the GPT-3 in all tasks5」。
References: List of high-quality authors