Reported by the Heart of the Machine.
Editors: Du Wei, Chen Ping
Faced with the current practice of fine-tuning large models to rely primarily on human-generated data, Google Deepmind has explored a more efficient way to reduce this dependency.As you and I can see, large language models (LLMs) are changing the landscape of deep learning, demonstrating superior capabilities in generating class-quality text and solving a variety of language tasks. While the industry has further improved performance on specific tasks by monitoring and fine-tuning human-collected data, there are significant bottlenecks in obtaining high-quality human data. This is especially evident for tasks that require complex problems and require significant resources and expertise.
How to solve it?Synthetic data generated by models is a potential alternative that can be scalable and cost-effective as long as the quality of the data is maintained.
While LLMs are able to self-assess the data generated, in this article, Google Deepmind explores a simpler setup that uses an external scalar feedback signal as a quality metric for each generated sample.
*Address: To study the training on model-generated data, the researchers considered a simple but powerful language model self-training method that requires only two functions, one is to generate samples based on the model, and the other is to evaluate these samples using a scoring mechanism.
To ensure clarity and consistency, the researchers adopted a reinforcement self-training method, REST, and demonstrated that this method can be used for expectation-maximization (EM) reinforcement learning. Specifically, REST alternates between the expectation and maximization steps.
Generation (e-step): The language model generates multiple output samples for each input context, and then filters these samples with binary rewards to collect the training dataset.
Improved (M-Step): The original language model is supervised fine-tuned on the training dataset from the previous e-step and then used in the next e-step.
The researchers confirmed that REST and variants have been successful in augmenting language models in a variety of domains, including machine translation, semantic analysis, preference alignment, and basic inference.
In addition, previous work has focused on REST for relatively small models (up to 7 billion parameters), which limits scalability for larger models. Therefore, this paper aims to improve the effectiveness and scalability of model-generated synthetic data and human-generated data in two challenging but lesser-studied areas, namely competitive-level mathematical problem-solving (MATH) and generative data (apps).
The empirical results show that significant capability improvements are achieved in mathematical reasoning and generative tasks when REST is used for REST at different scales of the PALM 2 model. Models fine-tuned on synthetic data generated by the model achieved greater performance gains than models trained on human-written data. Interestingly, performance is degraded beyond a certain number of REST iterations, suggesting that overfitting can occur on a small number of training problems.
In addition, the model fine-tuned using REST improves pass@k metrics and majority voting performance. These fine-tuned models also show performance enhancements on relevant but held-out benchmarks, including math problems (gsm8k and hungarian hs finals), coding (humaneval), and big-bench hard tasks.
In conclusion, the results of this paper suggest that self-training with feedback is a promising way to reduce dependence on human data.
The expected maximum (em) used to reinforce self-training
First, the study builds on previous research by Dayan and Hinton to describe an EM-based reinforcement learning framework with a language model. Specifically, they first defined a binary optimal variable o such that (= 1|,) Then for the non-decreasing function: Achieve maximum observation = 1 (get a high reward), and get the following formula:
However, solving for the sum of sequences in the above equation is tricky. Therefore, this paper considers maximizing its elbo ( instead of maximizing log ( = 1;) with respect to the parameter and the variational distribution ( instead of maximizing log ( = 1;Specifically:
The EM algorithm in Equation (2) alternates between e-step (expectation) and m-step (maximization).
REST: Inspired by the EM framework, a simplified version of the REST approach proposed by Gulcehre et al. is discussed next. For the sake of clarity, this article refers to this approach as REST, which decouples data collection (E-step) and policy optimization (M-Step) in the RL Pipeline. As shown in Algorithm 1:
Build (e-step).: In this step, the study is passed from the current strategy.
A sequence of output samples is sampled to generate a dataset.
Here, the input is from the original dataset.
resampled. Then use the binary reward function (, right.
The output sequence is scored.
Improvement (m-step).: In the first iteration, the study uses a new dataset from the e-step.
to fine-tune the strategy. Unlike Gulcehre's research, they fine-tune the basic pretrained language model to minimize task-specific overfitting and minimize bias from the base model. For fine-tuning, the study minimizes reward-weighted negative log-likelihood loss.
Once the strategy has been improved, new datasets with better quality samples can be created again.
Experiments and analysis
The main goal of the experiments in this paper is to answer the following questions:
How effective does REST compare to fine-tuning human-generated data?
How many iterations do I need to get the best performance?How long does REST cause a training set to overfit?
How does REST affect pass@k and majority voting performance?
If a user fine-tunes with model-generated data on a specific task, will it be migrated to a different task?When evaluating the fine-tuned model in this article in a wide range of tasks, does it degrade performance compared to the base model?
How much input data is needed to get most of the performance gains from REST?Is one iteration of REST sufficient?
The study used the Palm 2 model and public APIs on Google Cloud for experiments, including Palm 2-S (Bison), Palm 2-S* (Codey), and Palm 2-L (Unicorn). The training dataset uses the MATH dataset and the Apps dataset.
Figures 2 and 3 show REST, respectivelyPerformance trained on Math and Apps datasets。It can be concluded that Math has benefited from multiple iterations of REST, both in terms of performance on the Math test set and in terms of migrating to GSM8K. On the other hand, you can see that most of the benefits for apps come from the first iteration, and performing more iterations will result in degraded performance for apps and humaneval.
Gaps in training and testing performance。Figure 4 shows that while training set performance increases linearly with the number of REST iterations, test set performance does not. For math, the test performance improvement was small after the first iteration, while for apps, a performance regression was observed in the second iteration. The study speculated that the regression in performance could be due to overfitting. Since the Apps dataset is about one-third the size of the Math dataset, it is more susceptible to this issue.
Figure 5 shows the performance of the PALM-2-L model on pass@k metrics. The results show that the fine-tuned REST model is stronger for all k values, where the performance gap is usually the largest at k=1.