The Multi-LLM team contributed to the Qubit | qbitai
Really..." Three stinkers, one Zhuge Liang"——
The collaboration of three agents based on the open source small model is comparable to the tool invocation effect of GPT-4!
Without further ado, let's take a look at the execution records of the two systems.
The user said that he was a hobbyist and wanted to explore different genres as well as home. In his instructions, he specified that the model should use the APIs of Deezer and Shazam to search for some ** tracks and corresponding artist information.
After that, the agents who played three different roles worked together and completed the task in two steps.
To make it a little harder, don't specify the tool, let the model find a most popular landscape tutorial and upload the channel details of it.
In this case, the model often encounters a change in the state of the tool, a tool being taken off the shelf, or a change in the definition of the parameters required by the tool.
However, using the above method, the model tries to use Video for **YouTube Search to get ** details at step 0, but finds that this API has been broken and cannot be called.
Therefore, the agent playing the role of Planner changed his mind and told the agent playing the caller role that he needed to try another API, and finally found the details by trying the new API and solved the user's task.
This is a multi-model collaborative agent framework based on open source small models jointly proposed by Sun Yat-sen University and Ali Tongyi Lab - UMI.
UMI fine-tunes multiple open-source small models to achieve collaborative operations, and the effect of datasets such as tool calls is comparable to GPT-4.
Overall, the advantages of -UMI over other closed-source API-based frameworks are as follows:
Based on the -UMI multi-model collaboration framework, three small models: planner, caller, and summarizer are responsible for path planning, tool calling, and summary response, respectively, to offload the workload of the small model. Compared with the single-model agent, the prompt design is more flexible. It surpasses the single-model agent framework on multiple benchmarks such as Toolbench and Toolalpaca Corpus, and achieves performance comparable to GPT-4. A global-local multi-stage fine-tuning paradigm (GLPFT) is proposed, which successfully trains a multi-model collaboration framework on an open-source small model, and the experimental results show that this two-stage paradigm is the best training multi-model collaboration agent paradigm that has been explored so far, and can be widely used. Multi-Model Collaboration Framework - What Does UMI Look Like?
At present, tool learning agents based on large models to call APIs, functions and interpreters, such as OpenAI Code Interpreter, AutoGPT and other projects, have attracted extensive attention in industry and academia.
With the support of external tools, large models can independently complete more complex tasks such as web browsing, data analysis, and address navigation, so AI agent is also known as an important direction for the implementation of large models.
However, some of the above-mentioned mainstream projects are mainly based on closed-source ChatGPT and GPT-4 models, which are strong enough in terms of inference, step planning, call request generation and summary reply.
In contrast, due to the limitations of model capacity and pre-training capability, a single model cannot achieve comparable performance as large models in tasks such as inference and planning, tool calling, and reply generation.
To solve this problem, the researchers in this paper propose -umi.
UMI consists of three small models: Planner, Caller, and Summarizer.
The Planner model is the core brain of the system, which is responsible for activating the caller or summarizer in a certain agent execution step and giving corresponding reasoning guidance.
The caller and summarizer are respectively responsible for receiving the guidance of the planner to complete the follow-up work of the step, the caller is responsible for the instructions generated in the tool interaction, and the summarizer is responsible for summarizing the final reply and feeding back to the user.
These three models are based on the open-source small model for different types of data fine-tuning.
In addition, the researchers proposed a global-local multi-stage fine-tuning paradigm called GLPFT.
Based on the open source small model, the implementation of a multi-model collaboration framework is not a simple matter, and there are two diametrically opposed influencing factors:
First, the three tasks of rationale, action and final answer can promote each other in training, and at the same time, it can also enhance the model's global understanding of the agent task. As a result, most of the current work is to train a single model to generate rationale, action, and final answer at the same time.
Second, the model capacity, data ratio, etc., also limit the difficulty of training a single model to achieve performance peaks on three tasks at the same time.
In the figure below, the amount of data required for a single model agent to reach a peak on each metric is different, and it is difficult to find a data volume and model checkpoint that peaks on all metrics.
This problem can be solved through multi-model collaboration.
Considering the above two points, the researchers proposed a "global-local" multi-stage training method, which aims to take advantage of the mutual promotion of rationale, action and final answer in training, obtain a good single-model initialization, and then fine-tune the multi-model to improve the performance of sub-tasks.
The above diagram illustrates the process of this multi-stage fine-tuning, in the first stage, a pre-trained LLM is used to fine-tune the completion of the tool call agent task to obtain a single-model agent LLM initialization.
Then, in the second stage, the researchers reconstructed the training data of the tool invocation agent task, decomposed it into three subtasks: generating rationale, generating tool interaction action and generating final reply, and copied three copies of the single-llm agent base trained in the first stage, and further fine-tuned them on different subtasks.
Performance comparable to GPT-4
Static assessment.
In the static evaluation, in this article, the output of all comparison baselines is compared with the annotation output, as you can see:
The performance of the UMI system significantly exceeds that of ChatGPT and the open-source model of tool calling, Toolllama, and the performance is comparable to GPT-4. It is worth mentioning that toolllama requires an output length of 8192 for satisfactory results, while -umi only requires an input length of 4096, thanks to a more flexible prompt design that comes with a multi-model framework.
In the comparison of the fine-tuning schemes of the multi-model collaboration framework model, the direct fine-tuning of three models or the multi-task fine-tuning of a single model cannot make the multi-model collaboration framework effective, and the best performance can only be achieved by using multi-stage fine-tuning GLPFT, which opens up the idea for subsequent multi-model collaborative training. Evaluation of real API calls.
The authors also introduced a way to evaluate real API calls on the Toolbench dataset, and the results are as follows:
In the results of the real API call experiment, -UMI still defeated ChatGPT and Toolllama, and achieved a success rate comparable to GPT-4.
Model overhead. Seeing that this may be asked, will multi-model collaboration introduce more costs? The authors also compare the overhead of multi-model collaboration frameworks in the training, inference, and storage phases
Overall, multi-model collaboration frameworks do introduce higher overhead in training and model parameter storage, but their inference speed is comparable to that of single-model frameworks.
Of course, considering that the performance of the multi-model collaboration agent framework using the 7B base is much higher than the performance of the 13b single-model agent, the total overhead is also less. This means that you can choose a multi-model collaborative agent framework based on a small model to reduce overhead and surpass the single-model agent framework of a large model.
Finally, the researchers conclude that multi-agent collaboration is the trend of agent development in the future, and how to train and improve the multi-agent collaboration ability of open-source small models is a key part of the actual implementation, and this paper opens up a new idea for multi-agent collaboration based on open-source small models, and achieves more than single-model agent baseline on multiple tool call benchmarks, which is comparable to GPT-4 tool call results.
In the future, the generalization of Planner will be enhanced, so that it can be used in a wider range of agent task scenarios, the local privatization of the caller model will be carried out, so that it can focus on local tool invocation tasks, and the "large-small" model collaboration framework of cloud large model combined with local small model.