1. The conclusion is written in the front.
The necessity of inference tasks as an alternative to evaluating LLM programming tasks is discussed. The CodeMind framework that supports several inference tasks is introduced, and CodeMind is used in large-scale fundamental theoretical research to evaluate state-of-the-art LLMs for inference. The results show that, in general, the LLM knows how the construct works, and is able to reason about the program specification and trace the evolution of input into output through execution. However, as they become more complex, i.e., control flows or data flows become more complex, contain non-primitive types, and call APIs, their capabilities are limited. It was also observed that prescriptive inference, which is critical for generating from a given program specification, does not mean that the model can also be executed by inference.
Second, a brief introduction of **.
2.1 ** Background.
Large models (LLMs) demonstrate exceptional programming skills when instructed (instruction-tuned) or prompted by chaining or thought trees (COT or TOT) and contextual learning. However, several studies have shown that LLMs struggle to generalize this anomalous ability, especially when datasets become more complex, or when tasks require comprehension rather than natural language. This is mainly because LLMs are trained to associate synthesis with natural language specifications, i.e., how inferences can combine constructs similar to the examples they see to meet the requirements of the specification explained.
To illustrate how the inference task evaluates LLMs, Figure 1-A shows GPT-35. Synthesized according to natural language specifications. The construction corresponding to the specification is highlighted, and the color matches. Due to natural language ambiguity, this returns the smallest number in the list, not the number at the index that is equal to the smallest number value. Therefore, for a given input [2,5,4,3], * returns 2 instead of 4, the assertion fails.
One way to evaluate LLM inductive reasoning is to include a specific expected program behavior and check whether the resulting behavior can be reproduced. This requires a certain level of reasoning, calling itSpecification reasoningsr。Figure 1-b shows the new specification and the corresponding builds**. Given a specified input-output pair, executing ** causes the test to pass, which indicates GPT-35. Ability to understand a given specification and generate the right one.
Including test data in prompts has been a known practice to improve the performance of models in programming tasks. However, it is only a weak indicator of reasoning, as it still involves associations with natural language. A deeper level of reasoning is reasoning about the execution output of a given input, which is called execution reasoning (ER). This task presents an even greater challenge to LLMs, requiring them to reason without any natural language cross-references**. Figure 1-C shows GPT-35 COT inference for ER tasks. Although the model can produce the expected output (and if verified to be correct by testing), it cannot properly reason that the output is executed with the same input.
2.2 ** of the program.
In order to automate inference evaluation, CodeMind is proposed. CodeMind currently offers three types of inductive reasoning tasks: Independent Execution Reasoning (IER) and Dependent Execution Reasoning (DER) to evaluate whether an LLM can reason about how a given input at any time has evolved into an output or just what it synthesizes correctly. Specification Reasoning (SR) evaluates the extent to which LLMs implement specified behaviors.
2.2.1 codemind
The program specification defines a function s:si so, where si is the set of all possible inputs to the program and so is a set of corresponding outputs. Depending on the ** that implements the synthesis, it is usually a function c:ci co. Defining a program is correct with respect to the specification if it satisfies all of the following conditions:
ci ⊆ si,co ⊆ so,∀i ∈ ci,c(i)= s(i)
This requires how the model inference input evolves into a given output by implementation (execution inference) and achieves such that it can produce the correct output for the given input (canonical inference).
2.2.1.1 Perform inference.
Considering the above formalization methods, the two performing inference tasks are defined as follows.
Definition 1: Perform inference independently (IER). Given a program c:ci co and a set of inputs i = if o = c(i) where o = l(i) is the output of l**, then llm l can reason correctly that ** is executed. Note that in this task, the specification is not processed, so the LLM can evaluate the reasoning of the LLM for arbitrary for which there are true i,o pairs.
The IER evaluates LLMs' general inductive reasoning for arbitrary **, which requires knowledge of construction, arithmetic and logical manipulation, and control flow. However, even for human developers, it's easier to reason about what they're developing than it is to arbitrary. In addition, as a measure of self-consistency, LLMs should be able to reason about what they can be properly synthesized. This requires the following inference tasks.
Definition 2: Dependent on Execution Inference (DER). Given a specification s:si so, llm l generates the program c:ci co, and a set of inputs i = if o= c(i), where o= l(i) is the output of l**, then llm l can reason correctly that ** is executed. The assumption here is that when llm l generates a **c that passes test i,o, it should be able to **o correctly.
2.2.1.2 Normative reasoning.
In addition to inductive execution reasoning, the model should understand the specification to synthesize the correct **. The formal definition of the normative reasoning task is as follows.
Definition 3:Prescriptive Inference (SR): Given a specification s:si so, arbitrary i,o specified with the natural language specification in the prompt, where i si,o so,s(i)= o, and llm l generates the program c:ci co, if c(i) = s(i), then llm can reason about the norm correctly. In other words, when they are explicitly specified in the prompt, llm l should be able to pass the test of i,o.
2.2.1.3 Evaluate ** reasoning.
The inference performance of a model on a given given is measured using the Correct Inference Score (CRS), which is 1 if the model can be reasoned correctly, and 0 otherwise. **The Correct Inference Rate (CRR) metric has also been introduced, which is a collective metric that measures the degree to which a given LLM can reason about multiple programs in a benchmark. Calculate CRR for M assemblies in benchmark P:
2.3 ** Effects.
Using Codemind, a large-scale fundamental theoretical study was conducted to evaluate the reasoning ability of LLMs. Nine models were selected, including general-purpose LLMs and specialized LLMs, and prompted to perform IER, DER, and SR tasks for 5395 programs written in J**A and Python. These programs come from five programming benchmarks, namely Humaneval, MBPP, Cruxeval, CodeNet, and Atar. **Observed:
1) LLM has a good grasp of ** construction, which may be due to consistency with concepts in the natural language specification. The trained model can be interpreted sentence by sentence and generally follows the execution of the program. However, LLM's reasoning ability is limited to simple programs. Moreover, despite GPT-3Models such as 5 and magiccoder can correctly interpret what ** does, but may fail when tracking the data flow and correctly reasoning about the execution output. The open-source LLMs that achieve comparable effectiveness with GPT models in synthesis are very different from them in terms of inference.
2) Even if it is deceptive, LLM can reason about the test data in the specification and introduce it into the synthetic inference process. However, their reasoning is constrained by their inherent limitations. They achieve higher performance when inferring that they can synthesize correctly.
3) On datasets containing complex programs, the correlation between ranking models based on synthesis (generating those that pass all tests) and inference performance is negligible or non-existent. This requires codemind tasks and metrics to complement the LLM** assessment.
4) Nested constructs, complex conditional predicates and loop conditions, non-trivial arithmetic and logical operators, and API calls can significantly challenge LLM reasoning.
2.3.1 Experimental Setup.
*The study included 9 LLMs and 5395 J**A and Python programs from 5 programming datasets. **Details of LLM and program selection are explained below.
Topic LLM:** selected 9 pre-trained or instruction-optimized models, including general-purpose and **dedicated LLMs. The choice is limited by computing resources, so a model with fewer than 20b parameters and a better performance than the other models is chosen. **The theme LLMs are GPT-4, GPT-35. Llama 2 (13B), Mistral, Codellama (13B, Instruction Optimization), Starcoder (155b), wizardcoder (15b, instruction optimization), magiccoder (7b), deepseekcoder (67b)。*Follow best practices and customize the prompt template for each model (all prompts are publicly available for further investigation). In addition to the GPT model, set the temperature to zero to ensure the reproducibility of the results. It is open source to users so that they can use CodeMind to evaluate other models and temperatures.
Theme Programs: The criteria for selecting a subject program are the presence of test data (inputs and corresponding expected outputs) and the implementation of the same program in several programming languages (to study its impact on inference). Out of several existing benchmarks, programs in Humaneval, MBPP, CodeNet, ATAR and Cruxeval were selected. Choose the J**a and Python versions of the program, as they are more widely used programming languages. Humaneval and MBPP are well-known benchmarks generated. CodeNet and ATAR are translation benchmarks. Cruxeval is a relatively simple Python program benchmark generated by Codellama (34b) to evaluate the inputs and outputs of LLMs.
Figure 2 shows the program complexity distribution, terminated as cyclomatic complexity (CC) and lines of code (LOC). CC measures the number of independently executed paths in a program Control Flow Graph (CFG). The indicator is cc = e n + 2p calculation, where e and n are the number of edges and nodes in cfg, respectively, and p is the number of methods in the class. In general, a higher cc indicates a more complex procedure. For an inference task, the model should reason which execution path a given input should take to output. Therefore, the greater the number of independent paths, the less likely the model is to succeed randomly. cc may be related to the number of lines in the program, but more lines do not result in a higher cc. For example, a 10-line program with no conditional or loop constructs has only one execution path, while an 8-line program with two nested conditional statements has 3 or 4 execution paths, depending on the conditional predicate.
2.3.Evaluation of 2 LLMs on the IER.
To evaluate the performance of the LLM on the IER task, prompts are given in two settings: Direct Answer and COT. For direct answers, each model is prompted with the output of a given input. Under the COT setting, the model is first instructed to simulate the execution step-by-step by the output of the value after the execution of each statement. The model is then asked to output for a given input. In both settings, the prompt contains a contextual example with two purposes: to introduce the IER task and to indicate the response format.
Since the IER only requires any and corresponding real i,o pairs, ** uses all 5395 subject program prompts LLM. Table 1 shows the results of this experiment with COT prompting. From these results, it can be observed:
The GPT model is better than other models in the IER task, and there is a large gap with the best open-source model, which is 3392% (GPT-4) and 1556(gpt-3.5)。In the open-source model, the average advantage of Magiccoder is 4., except for the ATAR dataset83%。
On datasets containing J**A and Python samples, there was a decrease in performance for all models (an average decrease of 291% with an average drop of 2 on *ATAR33%)。This may be because J**A implements a stricter syntax and type system than Python, making it more challenging to perform inference.
COT prompts (where the model expresses the execution process in language before the output) improved the IER performance of the model by 524%。However, even with COT prompting, the accuracy of the (open-source) model is still suboptimal and needs to be fundamentally changed.
Moving down the table, the model faces more challenges on the IER (i.e., inference execution) of the codenet and **atar programs than MBPP, Humaneval, and CruXeval. One potential reason is the complexity of these procedures, as shown in Figure 2. A detailed analysis of the model's performance (Figure 3) shows a strong negative correlation between cyclic complexity (CC) and correct reasoning rate (CRR) (Spearman's rank correlation coefficient (ROC)), confirming that the model struggles more with more complex **iers. At the same time, some models, namely Llama 2, Codellama, MagicCoder, StarCoder, and WizardCoder, perform less than Humaneval on CruxEval, which is less complex in terms of LOC and CC. This needs to be further improved to understand what factors other than CC affect the CRR performance of the model.
2.3.Evaluation of 3 LLMs on DER.
The key question is the validity of the model to correctly reason about the correct program it generates. This assessment requires a combination of generative and inference tasks. **The pipeline for evaluating the DER consists of three steps:
1) Follow the best practices and prompt the topic LLM to generate **;
2) Run the synthesis program according to the existing tests;
3) For programs that pass the test, perform inference using the selected test input and employing the COT style cue model. Note that comments have also been removed from the generated to ensure fairness.
*Excludes programs in cruxeval, codenet, and atar, as these datasets are not designed for generation and lack proper program specifications. In addition, it is not possible to replicate the generated results of llama 2, which is also excluded from the topic LLM. Similar to the IER experiment, the temperature is set to zero to account for the non-deterministic and reproducible nature of the results. As a result of this design decision, the composition result may differ from the existing leaderboard.
Table 2 shows the results of this experiment. The GPT model is still better than the open-source model in the DER task, and there is a difference with the best open-source model1797 (GPT-4) and 1313(gpt-3.5) the gap. Compared to the IER, the gap between the GPT model and the open source model has narrowed. **It can also be observed that the model has an average CRR of 6 higher than the IER on the DER task, in addition to Codellama's performance on Humaneval84%。
Before concluding that the model is more competent in performing inference on a program that is evaluated as correctly synthesized, the program in this experiment was compared to the program in the IER experiment. If true, the lower complexity may be the root cause of the higher CRR on the DER task. Figure 4 shows the CC distribution of the programs in MBPP and Humaneval, compared to the programs generated by the subject LLM. It can be observed that synthesis is as complex, if not more complex, than the basic procedures in these datasets. As a result, confirming that models can better reason about what they synthesized correctly. However, there is still a large gap between the generation and inference capabilities of LLMs, especially for open-source models.
Since generation and inference are unified in the DER, the Spearman-rank correlation coefficient based on the model rankings of the synthetic and inference rows on each dataset is first calculated. The results showed a strong positive correlation ( = 0. ) on MBPP85), but the correlation on humaneval is negligible (= 0.).17)。These results convey a strong message: LLMs ranked based on generative ability (pass@k) can be significantly different from their ranking of inference ability for the same level. This requires frameworks like CodeMind to improve other aspects of LLM**.
2.3.4 Normative Reasoning (SR) Assessment.
Normative Reasoning (SR) provides a new perspective on the process of generating LLMs, specifically how they leverage input-output specifications. In order to evaluate the SR capability of the LLM, the LLM is prompted to generate under the following three settings:
1) A natural language specification that contains a real input and output. In this setup, an existing test is randomly selected and added to the specification. Use only this test to verify the generated.
2) Natural language specifications that do not contain input and output. Delete the tests that were added to the spec in the previous setup and re-prompt the LLM for a build. Only the tests from the previous setup are used to validate the generated. Intuitively, if test data is included to help LLM generation, a degradation in LLM performance will be observed.
3) Natural language specifications that contain misleading input and output.
* Change the expected output of the test in the first setup and add it to the specification. Use the original test to verify the generated. This mutation changes the expected output to a value that is inconsistent with the specification. For example, if the expected output is true, the mutation will change it to false. Similarly, if the expected output is a positive integer, it will mutate into a negative number with a large difference. Intuitively, misleading input and output should further degrade the performance of the LLM due to deviations from the natural language specification.
Perform this experiment only on MBPP and Humaneval programs. **Hints in humaneval are also pre-processed that initially contain input and output samples. The results in Table 3 show that the inclusion of test data in the specification results in an average improvement in the generation performance of LLMs736%。Introducing deceptive testing into the specification relative to legitimate testing negatively impacts the LLM's generation performance (10% decrease on average). However, the average performance degradation on all models and programs was only 265%。Regardless, these results demonstrate the ability of LLM to reason and leverage the test data in the specification.
2.3.4 In-depth analysis of the results.
*Further analysis of the IER results, which evaluates the general ability of LLMs' **reasoning. In the first step, you want to know if the LLM knows how different constructs work. If you don't understand the logic of each construct, you can't reason about execution.
To do this, each of the 5395 programs is tagged with the following tags, according to the construct used in the implementation:for, while, if, try, switch, nested loops, nested if, and basic。Programs marked as basic do not have a special ** construction. Next, the programs were clustered by label and the CRR of the LLMs for each cluster was calculated. Figure 5 shows the results of this analysis for the five top-performing LLMs. It can be observed that the model handles conditional statements better than recursion, with the exception of try-catch or try-except statements. In addition, when it comes to nested constructs, the CRR value drops significantly.
Effect of cyclic characteristics: Given that the model struggles the most with recursive construction, focus on programs with for, while, and nested loop tags in the next step. The hypothesis is that this struggle is due to the length of the cycle or to determine the length of the cycle. The former questions whether it is more difficult for the model to track the flow of program data as the loop gets longer. The latter questions the model's ability to reason about how many times a block should be repeated, no matter how long it will be. Figure 6 plots the distribution of right vs. wrong cases and CRR values for each cycle length in the J**A program. The subplot label shows the ROC coefficient between the cycle length and the CRR. A moderate to strong negative correlation between cycle length and the CRR of the model can be observed, i.e., as the cycle length increases, the CRR decreases. By manually investigating the IER's error cases, it was also noted that LLMs were mostly unable to reason about the loop condition correctly. Without knowing the loop conditions and the number of iterations, it is impossible to reason about the execution correctly. In addition, cases have been found where, although the model can reason about the loop conditions and the number of iterations, it loses the flow of data in the loop, incorrectly outputting the output.
In the ** snippet below (P03059 from CodeNet (j**a)), the loop condition depends on the constant variable (c) and variable (time), the value of which changes inside the loop.
I'm a tech creator input: 3 5 7, expected output:10int a, b, c = scnextint();int amount = 0, time = a;for(int i = 2; time <= c + 0.5; i++)println(amount);
All models failed on the ier of this relatively simple program. See COT analysis, GPT-35 knows the loop condition and can deduce all the variable values in the first two iterations. However, it stays in the loop one more time. The starcoder couldn't reason about the loop condition correctly and iterated over the ** block six times. The magiccoder cot only explains ** (correct interpretation) and does not convey how it reasoned about the variable.
gpt-3.5, predicted output:15
analysis: enters for loop and iterates from 2 onwards
until the value of time exceeds c + 0.5. in first
iteration, amount becomes 5 and time becomes 6. in
second iteration, amount becomes 10 and time becomes
9. in third iteration, amount becomes 15 and time
becomes 12. 12 is greater than 7+0.5, loop exits.
starcoder, predicted output:105
analysis: it enters a for loop and it iterates from 2
to 7 to check each integer i in this range. in each
iteration, the value of i is passed to variable time.
the condition is met when i is 2,3,4,5,6 and 7.
magiccoder, predicted output:15
enters for loop and iterates from 2 to infinity. inside
the loop, it adds the value of b to variable amount.
then it calculates new value of time by multiplying a
with current iteration i. the loop continues until time
is greater than c+0.5. after the loop finishes, print
the final value of amount as output, which is 15.
The loop condition in the ** snippet below is simple (P02994 from CodeNet (j**a)). As a result, most models reasoned correctly about the number of iterations.
input: 5 2, expected output: 18
int n = scanner.nextint();
int l = scanner.nextint();
list diff = new arraylist();
for(int i = 1; i < n+1; i++)
diff.add(value);
removek = diff.indexof(collections.min(diff)) 1;
println(n*l + n*(n+1)/2 - n - l+ removek -1));
Still, some LLMs still have difficulty with proper reasoning** execution inside the loop. Codellama can correctly reason about the number of loop iterations, but it can't calculate the value in the diff array, resulting in an erroneous output. Magiccoder can also correctly reason about loop lengths, and similar to the previous example, it can correctly explain what ** does. However, it doesn't output correctly (the output is 10), which makes it impossible to tell if it's as well understood as it is interpreted.
if-else attributesManual investigation by :** revealed that LLMs struggled to handle conditional constructs for two reasons: (1) confusion about branches and (2) incorrect reasoning about conditional predicates. Regarding the former, it has been observed that LLMs sometimes have difficulty recognizing conditional statements. For example, no matter which branch is taken, they always go into the "if" and "else" branches to reason about execution. However, this rarely happens, and in most cases, LLMs struggle to parse the conditional predicate correctly, take the wrong branch, and output incorrectly.
The impact of mathematical and logical operators: Knowing how mathematical and logical operators work is essential for inference that input evolves into output through execution. The observations of the previous analysis also highlight their importance to the inference program control flow. **Manual investigation of failed IER cases reveals the prevalence of arithmetic and logical operators in the program, with the former dominating. LLM is more difficult to reason about bit operators (e.g., e.g., &) than arithmetic (e.g., + and -) and comparison operators (e.g., , , and &).
The following snippet is from MBPP (Task 311) and contains several arithmetic and logical operators.
Despite not being able to output, Starcoder correctly knows how most operators work. For example, it can infer the bitwise of n and n+1 versus the power used to determine whether the integer n is 2. It can also reason loop conditions to check if the least significant bit of n is set to 0. However, it doesn't seem to understand the meaning of the return statement (setting the least significant bit to 1) and therefore fails to produce the correct output.
codellama, predicted output:24
analysis: in the given inputs n is 5 and l is 2. the
sequence of numbers from 1 to 5 is [1,2,3,4,5]. the
differences between l and each number in the sequence
are [1,0,1,2,3]. the smallest difference is 0 at index
1. so, the index 1 is removed from the sequence. the
new sequence becomes [1,3,4,5]. the sum of the original
sequence [1,2,3,4,5] and the new sequence [1,3,4,5] is
calculated as (5*2+5*(5+1)/2-5-(2+1-1))=15+15-5-2=24.
The effect of the output type:** classifies programs according to output types and checks whether (1) LLMs are able to correctly ** output types (type matching) and (2) whether they are able to correctly reason about the values of the output (value matching) in the subject program to identify seven types, i.e., int (e.g. 2), decimal (e.g. 2.).34), string (e.g., "codemind"), binary (e.g., true or false), list (e.g., [1,3,4,7]), and tuple (python-specific, e.g., (2,7)). Figure 7 shows the details of these results. Overall, LLMs achieve high type matching (>80%), although they struggle to get the correct values (value matching). Among the different types, it is more difficult for the model to include output values for the tuple list and decimal types. Tuples and lists are made up of multiple items, each of which can change during program execution. So it's not surprising that the model struggles to trace the flow of inputs through potentially different execution paths and reason about complex outputs as a whole. Also, since this type of operation involves API calls, such as min(), next(), charat(), reasoning about changes requires LLMs to know how the API works, which requires extra effort.
input: 10, expected output:14
def set_left_most_unset_bit(n):
if not (n & n + 1)):return n
pos, temp, count = 0, n, 0
while temp:
if not (temp & 1): pos = count
count += 1
temp>>=1
return (n | 1
analysis: the variable n is initialized to 10. the code
checks if n is a power of 2. if it is, the code returns
n. if it is not, the code calculates the position of
the leftmost unset bit. the code then returns the
result of n | 1 <
*The contribution is:
1) The CodeMind framework is used for inference, and three inductive inference tasks are formally defined. CodeMind is open-source and accepts contributions from researchers to integrate more inference tasks into it;
2) Conduct a large-scale basic theoretical evaluation of LLM's inference using Codemind;
3) A comprehensive and in-depth analysis of the results provides a list of the root causes that negatively impact LLM** reasoning ability. This directory will be a valuable guide for developing better benchmarks to truly assess LLM programming capabilities.
*Title: Codemind: A Framework to Challenge Large Language Models for Code Reasoning