EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

arXiv:2502.12466

Anjiang Wei\(^{1}\), Jiannan Cao\(^{2}\), Ran Li\(^{1, 3}\), Hongyu Chen\(^{4}\), Yuhui Zhang\(^{1}\), Ziheng Wang\(^{1}\), Yaofeng Sun\(^{5}\), Yuan Liu\(^{3}\), Thiago S. F. X. Teixeira\(^{6}\), Diyi Yang\(^{1}\), Ke Wang\(^{7}\), Alex Aiken\(^{1}\)
\(^1\)Stanford University \(^2\)MIT \(^3\)Google \(^4\)Nanjing University \(^5\)DeepSeek \(^6\)Intel \(^7\)Visa Research

Abstract

Equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs, underpins a broad range of applications, including software refactoring, testing, and optimization. We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models (LLMs). We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories. These pairs are systematically generated through program analysis, compiler scheduling, and superoptimization, covering nontrivial structural transformations that demand deep semantic reasoning beyond simple syntactic variations. Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%. In the most challenging categories, the best accuracies are 62.3% and 68.8%, only modestly above the 50% random baseline for binary classification, indicating significant room for improvement in current models’ code reasoning capabilities.


Leaderboard

Model DCE CUDA x86-64 OJ_A OJ_V OJ_VA Overall Accuracy
o3-mini-2025-01-31 68.8 59.0 84.5 84.2 88.2 83.2 78.0
DeepSeek-R1 52.2 61.0 78.2 79.8 91.5 78.0 73.5
o1-mini-2024-09-12 55.8 50.7 74.2 80.0 89.8 78.8 71.5
claude3.5-sonnet-2024-10-22 38.5 62.3 70.0 71.2 78.0 73.5 65.6
gpt-4o-2024-11-20 43.2 49.5 65.2 71.0 87.0 73.8 65.0
DeepSeek-V3 41.0 50.7 69.2 73.0 83.5 72.5 65.0
Llama-3.1-405B-Instruct-Turbo 40.0 49.0 75.0 72.2 74.5 72.8 63.9
Qwen2.5-72B-Instruct-Turbo 42.8 56.0 64.8 72.0 76.5 70.8 63.8
gpt-4o-mini-2024-07-18 46.8 50.2 56.8 64.5 91.2 64.0 62.2
Qwen2.5-7B-Instruct-Turbo 50.5 49.2 58.0 62.0 80.8 63.0 60.6
QwQ-32B-Preview 48.2 50.5 62.7 65.2 71.2 64.2 60.3
Llama-3.1-70B-Instruct-Turbo 47.5 50.0 58.5 66.2 72.0 67.5 60.3
Mixtral-8x22B-Instruct-v0.1 46.8 49.0 62.7 63.5 76.0 62.7 60.1
Mixtral-8x7B-Instruct-v0.1 50.2 47.0 64.2 59.0 61.5 55.0 56.1
Mistral-7B-Instruct-v0.3 51.0 57.2 73.8 50.7 50.5 50.2 55.6
Llama-3.1-8B-Instruct-Turbo 41.8 49.8 50.5 57.5 75.5 56.8 55.3
Llama-3.2-3B-Instruct-Turbo 50.0 49.8 50.0 51.5 51.5 51.5 50.7
Random Baseline 50.0 50.0 50.0 50.0 50.0 50.0 50.0
Mean 47.9 52.4 65.8 67.3 76.4 67.0 62.8

Overview

EquiBench is a comprehensive benchmark designed to evaluate the code reasoning capabilities of Large Language Models (LLMs) through equivalence checking tasks. This framework helps researchers and developers assess how well different LLMs understand code semantics, reason about program functionality, and determine when two code snippets are functionally equivalent despite syntactic differences.

Key Features

  • Diverse Test Cases: Includes 2400 pairs of equivalent programs across six distinct categories (DCE for C programs, x86-64 for x86-64 programs, CUDA for CUDA programs, and OJ_A, OJ_V, OJ_VA for Python competitive programming problems)
  • Multiple Prompting Strategies: Support for zero-shot, few-shot, and chain-of-thought variations to evaluate different reasoning approaches
  • Wide Model Support: Compatible with leading LLMs from OpenAI, Anthropic, Meta, Mistral AI, Qwen, and DeepSeek
  • Standardized Methodology: Consistent evaluation framework enabling fair comparison across different model architectures

Example Program Pairs and Results

Refer to caption
Figure 1: An inequivalent pair from the DCE category in EquiBench. In the left program, c = 1 is dead code and has no effect on the program state, whereas in the right program, it is executed and alters the program state. Such cases are generated using the Dead Code Elimination (DCE) pass in compilers.


Refer to caption
Figure 2: An equivalent pair from the CUDA category in EquiBench. Both programs perform matrix-vector multiplication (y=Ax𝑦𝐴𝑥y=Axitalic_y = italic_A italic_x). The right-hand program uses shared memory tiling to improve performance. Tensor compilers are utilized to explore different scheduling strategies, automating the generation.


Refer to caption
Figure 3: An equivalent pair from the x86-64 category in EquiBench. Both programs are compiled from the same C function shown above—the left using a compiler and the right using a superoptimizer. The function counts the number of set bits in the input %rdi register and stores the result in %rax. Their equivalence has been formally verified by the superoptimizer.


Refer to caption
Figure 4: Equivalent pairs from the OJ_A, OJ_V, OJ_VA categories in EquiBench. OJ_A pairs demonstrate algorithmic equivalence, OJ_V pairs involve variable renaming transformations, and OJ_VA pairs combine both types of variations.


Refer to caption
Figure 5: Scaling Trend on EquiBench.


Supported Categories

EquiBench contains 2400 pairs of programs across six distinct categories of code equivalence tasks:

  • DCE (Dead Code Elimination for C programs): Code pairs that differ by removal of dead / live code
  • x86-64 (Superoptimizer for x86-64 program): Assembly code pairs optimized using the x86-64 framework
  • CUDA (Compiler Scheduling for CUDA programs): Code pairs optimized for tensor operations
  • OJ_A (Python Competitive Programming - Algorithm): Different algorithmic solutions to the same programming problem
  • OJ_V (Python Competitive Programming - Variable Renaming): Code pairs with variable renaming transformations
  • OJ_VA (Python Competitive Programming - Variables + Algorithms): Code pairs with both variable renaming and algorithmic differences

Each category contains 400 pairs of programs (200 equivalent and 200 inequivalent), providing a diverse range of challenges for LLMs to reason about code semantics.

Supported Prompt Types

EquiBench evaluates models using four different prompting strategies:

  • ZERO: Zero-shot prompting (directly asking the model without examples)
  • FEW: Few-shot prompting (providing example problems and solutions)
  • ZERO_COT: Zero-shot chain of thought (encouraging step-by-step reasoning)
  • FEW_COT: Few-shot chain of thought (examples with step-by-step reasoning)

Each strategy tests different aspects of a model's reasoning capabilities, from basic understanding to advanced reasoning chains.

HuggingFace Dataset

The EquiBench dataset is hosted on HuggingFace as anjiangwei/EquiBench-Datasets.

Dataset Statistics

Category Language Equivalent Pairs Inequivalent Pairs Total
DCE C 200 200 400
x86-64 x86-64 200 200 400
CUDA CUDA 200 200 400
OJ_A Python 200 200 400
OJ_V Python 200 200 400
OJ_VA Python 200 200 400
Total 1200 1200 2400

Explore EquiBench!


BibTeX

@article{wei2025equibench,
      title={EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking},
      author={Wei, Anjiang and Cao, Jiannan and Li, Ran and Chen, Hongyu and Zhang, Yuhui and Wang, Ziheng and Sun, Yaofeng and Liu, Yuan and Teixeira, Thiago S. F. X. and Yang, Diyi and Wang, Ke and Aiken, Alex},
      journal={arXiv preprint arXiv:2502.12466},
      year={2025}
    }
}