Logical reasoning overturned! GPT 4 Gemini was exposed to major flaws, and LLM was seriously degrade

Mondo Education Updated on 2024-02-26

Key points:

1.The logical reasoning performance of large models is affected by the order of premises, and shuffling the order can lead to a 30% decrease in performance.

2.Changing the order of premise narration has a significant impact on the inference performance of large models, Gemini Pro, GPT-35-Turbo performance declined.

3.Changing the order of premises in logical reasoning greatly reduces the performance of LLM, which needs to be further studied.

The Webmaster's House (chinaz.).com) February 26 News:Recently, researchers from Google's Deepmind and Stanford found that the order in which premise information is presented has a decisive impact on the performance of large language models when processing logical reasoning tasks.

In logical reasoning and math problems, models perform better when the premises are arranged in a logical natural order. For large language models, changing the order of the premise narratives can result in significant performance degradation, especially if distraction rules are added.

*Address: Researchers have found that by shuffling the order of the problem statements in the GSM8K test set to construct the R-GSM test set, almost all major LLMs perform poorly on the new test set. Although humans also have a preference for premise order in logical reasoning, LLMs are more susceptible to order effects, which may be related to the training goals and data bias of autoregressive models.

Changing the order of the premises can reduce the accuracy of the model by more than 30%, and different orders have different effects on different models, for example, the GPT model performs better under reverse ranking. The researchers also found that the addition of more interference rules and multiple presupposition sequences would make the problem more complex and require further research to solve. In logical reasoning, the order of premises has a significant impact on the inference performance of large language models, and how to deal with this problem is still a challenge.

Related Pages