Reported by the Heart of the Machine.
Editors: Chen Ping, Du Wei
Joint research by MIT and Microsoft: No additional training is required to enhance the task performance and reduce the size of large language models.In the era of large models, Transformer single-handedly supported the entire field of scientific research. Since its release, Transformer-based LLMs have demonstrated superior performance on a variety of tasks, and its underlying Transformer architecture has become the most advanced technology for natural language modeling and inference, and has shown strong promise in areas such as computer vision and reinforcement learning.
However, current Transformer architectures are very large and typically require significant computing resources for training and inference.
This is intentional, as transformers trained on more parameters or data are clearly more capable than other models. Nonetheless, a growing body of work has shown that transformer-based models, as well as neural networks, do not require all fitting parameters to retain the assumptions they have learned.
In general, large-scale overparameterization seems helpful when training models, but these models can be heavily pruned before inference;Studies have shown that neural networks can often remove more than 90% of the weights without any significant degradation in performance. This phenomenon has prompted researchers to turn to the study of pruning strategies that can help model reasoning.
In "The Truth Is In There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction", researchers from MIT and Microsoft have made a surprising finding that careful pruning on specific layers of a Transformer model can significantly improve the model's performance on certain tasks.
*Address:*Home:
The study calls this simple intervention laser, which selectively reduces the higher-order components of the learning weight matrix for a particular layer in the Transformer model through singular value decomposition, thereby significantly improving the performance of the LLM, which can be performed after the model is trained and does not require additional parameters or data.
During operation, the reduction of weights is performed in model-specific weight matrices and layers, and the study also found that many similar matrices can significantly reduce weights, and performance degradation is often not observed until more than 90% of the components are completely removed.
The study also found that these reductions can significantly improve accuracy, a finding that appears to be not limited to natural language, but also performance gains in reinforcement learning.
In addition, the study attempts to deduce what is stored in higher-order components so that deletion can improve performance. The study found that the laser answered the correct question, but before the intervention, the original model mainly responded with high-frequency words (such as "the", "of", etc.), which were not even the same semantic type as the correct answer, which meant that these components would cause the model to generate some irrelevant high-frequency words without intervention.
However, by performing a certain degree of downgrading, the model's response can be transformed into correct.
To understand this, the study also explored what the rest of the components encode separately, using only their higher-order singular vectors to approximate the weight matrix. It was found that these components described different responses or generic high-frequency words in the same semantic category as the correct answer.
These results suggest that when noisy higher-order components are combined with lower-order components, their conflicting responses produce an average answer, which may be incorrect. Figure 1 provides a visual illustration of the Transformer architecture and the procedure followed by the laser. Here, the weight matrix of the multilayer perceptron (MLP) for a particular layer is replaced with its low-rank approximation.
laser at a glance
The investigator details the laser intervention. A single-step laser intervention is defined by a triplet ( containing the parameters , number of layers, and descending rank . Together, these values describe which matrices will be replaced by their low-rank approximation and how rigorous the approximation is. The investigators relied on parameter types to classify the type of matrix they would intervene in.
The investigators focused on the matrix in W = which consists of a matrix in MLP and the attention layer. The number of layers represents the layer in which the investigator intervened (the first layer was indexed from 0). For example, llama-2 has 32 layers, therefore
Ultimately, 0, 1) describes which part of the largest rank should be kept when doing a low-rank approximation. For example, set.
then the matrix has a maximum rank of d. The investigator replaces it with d - approximate.
Figure 1 below is an example of a laser, where = u in and = l represent updating the weight matrix of the first layer of MLP in the Transformer block of the L th layer. The other parameter controls the k in the rank-k approximation.
Lasers can limit the flow of certain information in a network and unexpectedly produce significant performance benefits. These interventions can also be easily combined, such as applying a set of interventions in any order.
The laser approach is simply a simple search for this type of intervention and modified to give the maximum benefit. However, there are many other ways to combine these interventions, which is the direction of future work by researchers.
Experimental results
In the experimental part, the researchers used a GPT-J model pre-trained on the Pile dataset, which has 27 layers and 6 billion parameters. The model's behavior is then evaluated on a counterfact dataset containing samples of triples (topics, relationships, and answers) that provide three paraphrased prompts for each question.
The first is the analysis of the GPT-J model on the Counterfact dataset. Figure 2 below illustrates the effect of applying a different number of downrank results for each matrix in a Transformer architecture on the classification loss of the dataset. Each of these Transformer layers consists of a small two-layer MLP with separate input and output matrices. Different colors indicate different percentages of removed components.
Regarding improving the accuracy and robustness of paraphrasing, as shown in Figure 2 above and Table 1 below, the researchers found that the factual accuracy of the GPT-J model on the Counterfact dataset ranged from 13 when downgrading on a single layer1% to 240%。It's important to note that these improvements are only the result of downgrading and do not involve any further training or fine-tuning of the model.
Which facts in the dataset will be restored by decreasing rank?The researchers found that the fact of recovery by downgrading is most likely to be rarely present in the data, as shown in Figure 3 below.
What do higher-order components store?The researchers used higher-order components to approximate the final weight matrix (unlike laser, which uses lower-order components to approximate), as shown in Figure 5(a) below. When using different numbers of higher-order components to approximate the matrix, they measured the mean cosine similarity of the real answer relative to the ** answer, as shown in Figure 5(b) below.
Finally, the researchers evaluated the generalizability of their findings on multiple language comprehension tasks for 3 different LLMs. For each task, they evaluated the model's performance by generating three metrics: accuracy, classification accuracy, and loss. As shown in Table 1 above, even a large downgrade does not result in a decrease in model accuracy, but can improve model performance.