Edit: LRS Recently, UCLA and other institutions released a new multimodal mathematical reasoning benchmark dataset called Mathvista, and provided a detailed 112-page evaluation report focusing on the mathematical reasoning performance of large multimodal models.
*Address: Project Address: Dataset: Data Visualization: Mathematical reasoning ability is seen as a key step towards achieving artificial general intelligence. In addition to the traditional text-only scenarios, many mathematical studies and applications also involve rich graphic content, which puts forward higher requirements for the multi-modal processing capability of the model. Mathematical problems have a long history, dating back to Mesopotamia in 2000 BC. At that time, clay tablets were already used to record mathematical problems containing trapezoids and triangles. Studies have shown that the Greek philosopher Pythagoras mastered the Pythagorean theorem, also known as the Pythagorean theorem, long before they lived.
The masterpiece of ancient Chinese mathematics, the Zhou Ji Sutra, not only contains an elegant proof of the Pythagorean theorem, but also demonstrates the profound attainments of our ancestors in the field of mathematics.
In the mathematics education we received from an early age, we often see a variety of vivid and interesting graphics, which emphasize the importance of visual elements in mathematical understanding.
In modern scientific research, mathematical analysis of large amounts of image data has become an indispensable link. This is especially true with the development of large language models (LLMS) and large multimodal models (LMMS), which demonstrate impressive problem-solving capabilities across a wide range of tasks and domains.
However, the mathematical reasoning ability of these models in visual scenarios has not been systematically studied. To explore this area, the University of California, Los Angeles (UCLA), the University of Washington (UW), and Microsoft have jointly developed a new Mathvista benchmark dataset. This dataset combines the challenges of multiple mathematical and visual tasks and contains 6141 questions, ** from 28 existing multimodal datasets and 3 newly labeled datasets, including iqtest, functionqa, and ***qa. The abundance of task types, inference methods, and image types in Mathvista poses a great challenge to the existing large models. The study provides a comprehensive evaluation of 12 of the latest large-scale models. The experimental results show that the GPT-4V, which is currently the most powerful, reaches 49 on MathvistaThe accuracy rate of 9% is significantly better than the second-ranked Bard model, which is 151%。However, compared to human performance, GPT-4V still has 104% gap. This difference is mainly due to its inadequacy in understanding complex figures and performing rigorous reasoning.
In addition, the report further highlights GPT-4V's ability to self-validate, its self-consistency, and its potential to handle multiple rounds of dialogue. These analyses highlight multiple directions for future research, especially in improving the model's comprehension and reasoning capabilities in complex contexts. Although there are multiple text-based mathematical reasoning datasets and multimodal question answering datasets, there are still significant gaps in comprehensively evaluating the ability of large models in the field of mathematical reasoning, especially in multimodal datasets. To this end, the research team proposed the Mathvista dataset, which focuses on mathematical Q&A tasks in visual scenarios. Mathvista contains 6141 math problems from 28 existing datasets and 3 new labeled datasets – iqtest, functionqa, and ***qa.
The three newly annotated datasets have their own characteristics: iqtest focuses on intelligence test questions, functionqa focuses on the reasoning of function graphs, and ***qa focuses on in-depth understanding of the graphs in the literature, which effectively makes up for the shortcomings of the existing datasets.
Mathvista covers two main types of tasks: multiple-choice questions (55.)2%) and numerical open-ended questions (448%)。It also includes five task categories: Graphical Question Answering (FQA), Geometric Problem Solving (GPS), Math Application Problem (MWP), Textbook Question and Answer (TQA), and Visual Question Answering (VQA), which represent the current frontier challenges in the field of mathematical reasoning.
Mathematical Reasoning Ability and Image Diversity in MathvistaMathvista subdivides and defines seven areas of mathematical reasoning ability, including: arithmetic, statistics, algebra, geometry, numerical general knowledge, science, and logic. These areas cover the core elements of mathematical reasoning and reflect Mathvista's comprehensive coverage of mathematical cognition.
In terms of the diversity of image genres, Mathvista also shows its unique breadth and depth. The dataset contains more than a dozen different image types, ranging from natural images to geometric diagrams, from abstract scenes to composite scenes, as well as a variety of graphs, charts, and plots. This abundance of image types not only increases the complexity of the dataset, but also provides a comprehensive challenge for large multimodal models when processing different types of visual information.
For the first time, the comprehensive quantitative evaluation research report comprehensively evaluates the mathematical reasoning ability of the current large-scale model in the visual scene. The Mathvista dataset used in the report is divided into two subsets: MiniTest and the TestMiniTest subset contains 1000 questions and is mainly used to quickly assess model performance. The test subset contains the remaining 5141 questions, which is intended for standardized evaluation of the model, so in order to avoid test data pollution, the answer label data of this subset is not publicly available. The model evaluation process is divided into three key phases: generating answers, extracting answers, and calculating scores. In the Generate Answer phase, depending on the type of test question, the research team used a specific template to guide the model to output the answer.
Considering that the current large-scale models usually output long text responses in the form of dialogues, an answer extractor based on GPT-4 was designed for the experiments in this report. The extractor prompts GPT-4 with a few examples to extract short answers that match the type of question from the model's long text responses. This approach effectively overcomes the costly problems of traditional manual evaluation and the inaccuracies that can result from rule-based answer extraction. These extracted short-text answers are then used to calculate the overall accuracy of the model as well as the accuracy under different subcategories.
In the Large Model Evaluation Experimental Experiment on Mathvista, 12 large models were evaluated on the TestMini subset: three large language models, including ChatGPT, GPT-4, and Claude-2, and 9 large multimodal models, including LL**A, LLAMA-ADAPTER, MINIGPT-4, BARD, and GPT-4V. For the large language model, two forms are experimentally designed, the first uses only the textual information of the problem, and the second uses the captioning description of ** and OCR text as external augmentation information. In addition, two randomized benchmarks and human performance benchmarks were completed.
The experimental results show that the overall performance of the current large model on Mathvista still needs to be improved. The top-performing GPT-4V model reached 499% accuracy, but that's the same as 60 for humansThere is still a significant gap compared to the 3% performance. This is followed by the BARD model with an accuracy of 348%, while the current best open source model, LL**a, has an accuracy rate of 261%。These data show that there is still a lot of room for improvement in the mathematical reasoning ability of large models in the visual context. Interestingly, the performance of the large language model GPT-4 (33.)9%) close to the multimodal model bard (348%)。This finding shows that large language models have great potential in the multimodal domain with appropriate tool augmentation. The performance of the main model on different mathematical reasoning abilities and image type subclasses was also quantitatively evaluated. The results show that GPT-4V approaches or even outperforms humans in reasoning ability in areas such as algebra, geometry, and science, as well as in image types such as **, function graphs, geometric images, scatter plots, and scientific graphs.
In the evaluation of the test subset, the experiments compared the best two large language models (COT Pot GPT-4) and the best open-source large multimodal model (LL**A), providing a comprehensive overview of model performance.
The evaluation of BARD's performance in Mathvista shows that the overall performance of the BARD model follows GPT-4 in the immediate aftermath. Through specific case studies, the report found that the Bard model often produces what is known as hallucination, that is, the introduction of information that is not present in the text of the question and ** in the generated answers. In addition, Bard is also prone to errors when performing calculations.
For example, in the example below, Bard made a calculation error in the process of simplifying fractions 8 10. This kind of problem highlights the limitations of the model when dealing with mathematical problems.
GPT-4 Performance on Mathvista Although GPT-4 is essentially a language model, its performance on Mathvista can be comparable to that of the multimodal model Bard through tool augmentations (e.g., a combination of OCR literals and captioning descriptions). Specifically, GPT-4 is able to successfully solve many multimodal math problems when these OCR literals and captioning descriptions are introduced as auxiliary input information. This finding shows the potential of GPT-4 in multimodal problem solving. However, GPT-4 is extremely dependent on the accuracy of these enhanced messages. If there are errors or inaccuracies in these OCR texts or captioning descriptions, GPT-4 can easily go in the wrong direction during the reasoning process, leading to incorrect results. This highlights the importance of the quality of input information when using tools to augment large language models.
GPT-4V Full Analysis on MathvistaGPT-4V is currently the most advanced large-scale multimodal model, and the in-depth analysis of its capabilities is of great significance for future research. The report provides a detailed analysis of GPT-4V's capabilities in different dimensions through a large number of examples, especially in terms of self-validation, self-consistency, and the great potential of multiple rounds of dialogue. Algebraic Inference Ability: In Mathvista's algebraic problems, GPT-4V demonstrates an excellent ability to understand functions in images and infer their properties, even surpassing other large models and humans. However, GPT-4V still faces challenges when dealing with low-resolution images and multi-function images.
Numerical computational skills: Arithvitic problems in Mathvista require not only accurate basic operations, but also an understanding of diverse visual scenarios. As shown in the figure below, GPT-4V shows significant improvement over existing models in this regard.
Geometric Reasoning Ability: When it comes to geometric reasoning, GPT-4V performs on mathvista on par with humans. In the following two examples, GPT-4V gives the correct answer to both elementary and upper grade questions with detailed explanations.
Logical reasoning ability: In Mathvista's logical reasoning problem, the model needs to deduce the implicit laws of numbers or shapes from abstract figures. The GPT-4V has encountered a challenge in this regard, with an accuracy rate of only 216%, just over the random guess of 81%。
Numerical Common Sense Reasoning Ability: Numerical common sense reasoning in Mathvista involves everyday objects and celebrity knowledge. This type of problem is a challenge for large models. For example, in the problem shown in the figure below, only GPT-4V is able to correctly understand optical illusion phenomena in images.
However, in some cases, such as identifying the maximum capacity of a beaker, both the GPT-4V and BARD models performed poorly.
Scientific reasoning ability: GPT-4V is significantly better than other large models in Mathvista's scientific reasoning problems. It often accurately interprets the information in the graph that involves a particular scientific field and makes subsequent inferences.
However, the application of certain basic concepts, such as relative motion, remains a weakness of GPT-4V.
Statistical Reasoning Skills: GPT-4V demonstrates strong statistical reasoning skills in understanding the various charts, graphs, and graphs in Mathvista. It can accurately solve mathematical problems involving chart analysis more than any other large model.
GPT-4V's Self-Verification Ability**Self-verification is a socio-psychological concept that centers on the idea that individuals want others to understand them the way they perceive themselves. This leads individuals to take proactive action to ensure that others can see their stable state (Talaifar & Swann, 2020).
In experiments, GPT-4V has shown a similar ability to self-validate. This ability is reflected in GPT-4V's ability to autonomously examine its own behavior during reasoning and proactively correct possible errors.
It's important to note that this ability to self-validate is different from relying solely on external feedback or multiple rounds of conversations to improve model output.
For example, in some cases, GPT-4V is able to self-review a set of candidate answers in a single output, identifying valid answers that meet all the given criteria.
In the following multi-step inference problems, GPT-4V shows remarkable capabilities. Not only does it enable coherent reasoning, but it also verifies the validity of key steps. Especially in the event of invalid intermediate results, such as negative lengths, GPT-4V is able to proactively detect and identify these errors.
This capability allows GPT-4V to optimize its reasoning process by trying different approaches to solve the problem after identifying it.
The self-consistent application of GPT-4V and its limitations.
Self-consistency is a technique widely used in large language models to improve the accuracy of the model when dealing with complex inference tasks. This approach typically involves sampling multiple inference paths and selecting the answer that appears most frequently as the final solution.
Experiments verify the effectiveness of the self-consistent technique in improving the performance of GPT-4V on Mathvista.
The results show that self-consistency plays a significant role in correcting GPT-4V's errors in visual perception and calculation, as well as reducing hallucinations.
However, experiments have also revealed the limitations of self-consistency. Especially in the case that GPT-4V is difficult to correctly understand complex visual scenes, the improvement effect of self-consistency is not significant.
This suggests that although self-consistency is an effective method of improvement, its success depends largely on the model's basic understanding of visual information.
GPT-4V's multi-turn dialogue capability on Mathvista.
The report concludes with a report on GPT-4V's ability to conduct multiple rounds of human-machine interactive dialogue on Mathvista.
Experimental results show that GPT-4V is good at effectively using user-provided prompts to optimize its reasoning process in multiple rounds of dialogue.
This includes correcting misunderstandings in visual perception based on the user's guidance, correcting inconsistencies in reasoning logic, correcting knowledge in related fields, and even understanding and processing extremely complex diagrammatic problems with human assistance.
Major Chinese authors.
Pan Lu is a PhD student at the University of California, Los Angeles (UCLA) and a member of the UCLA Natural Language Processing Laboratory (NLP Group) and the Center for Vision, Cognition, Xi, and Autonomy (VCLA).
Prior to that, he received his master's degree in computer science from Tsinghua University. He has done Xi at Microsoft and Allen Institute for Artificial Intelligence.
He is the author of ScienceQA and Chameleon, among other things. He has been awarded Amazon PhD Fellowships, Bloomberg PhD Fellowships, and Qualcomm Innovation Fellowships.
Tony Xia is a master's student in computer science at Stanford University. Previously, he received his undergraduate degree in Computer Science from the University of California, Los Angeles.
Jiacheng Liu is a Ph.D. student at the University of Washington, where he conducts research on common-sense reasoning, mathematical reasoning, and text generation.
Previously, he received his undergraduate degree from the University of Illinois at Urbana-Champaign. He was a recipient of the Qualcomm Innovation Scholarship.
Chunyuan Li is a principal investigator at the Redmond Research Institute at Microsoft.
Previously, he received his Ph.D. in Robotics from Duke University Xi the faculty of Professor Lawrence Carin. He has served as a domain chair for NEURIPS, ICML, ICLR, EMNLP, and AAAI, as well as a guest editor for the IJCV.
He is the author of ll**a, visual instruction tuning, and instruction tuning, among others.
Hao Cheng is a senior fellow at the Redmond Research Institute at Microsoft and an adjunct professor at the University of Washington.
Previously, he received his Ph.D. from the University of Washington. He was a key member of the 2017 Alexa Prize winning team.
References: