Model | DCE | CUDA | x86-64 | OJ_A | OJ_V | OJ_VA | Overall Accuracy |
---|---|---|---|---|---|---|---|
o3-mini-2025-01-31 | 68.8 | 59.0 | 84.5 | 84.2 | 88.2 | 83.2 | 78.0 |
DeepSeek-R1 | 52.2 | 61.0 | 78.2 | 79.8 | 91.5 | 78.0 | 73.5 |
o1-mini-2024-09-12 | 55.8 | 50.7 | 74.2 | 80.0 | 89.8 | 78.8 | 71.5 |
claude3.5-sonnet-2024-10-22 | 38.5 | 62.3 | 70.0 | 71.2 | 78.0 | 73.5 | 65.6 |
gpt-4o-2024-11-20 | 43.2 | 49.5 | 65.2 | 71.0 | 87.0 | 73.8 | 65.0 |
DeepSeek-V3 | 41.0 | 50.7 | 69.2 | 73.0 | 83.5 | 72.5 | 65.0 |
Llama-3.1-405B-Instruct-Turbo | 40.0 | 49.0 | 75.0 | 72.2 | 74.5 | 72.8 | 63.9 |
Qwen2.5-72B-Instruct-Turbo | 42.8 | 56.0 | 64.8 | 72.0 | 76.5 | 70.8 | 63.8 |
gpt-4o-mini-2024-07-18 | 46.8 | 50.2 | 56.8 | 64.5 | 91.2 | 64.0 | 62.2 |
Qwen2.5-7B-Instruct-Turbo | 50.5 | 49.2 | 58.0 | 62.0 | 80.8 | 63.0 | 60.6 |
QwQ-32B-Preview | 48.2 | 50.5 | 62.7 | 65.2 | 71.2 | 64.2 | 60.3 |
Llama-3.1-70B-Instruct-Turbo | 47.5 | 50.0 | 58.5 | 66.2 | 72.0 | 67.5 | 60.3 |
Mixtral-8x22B-Instruct-v0.1 | 46.8 | 49.0 | 62.7 | 63.5 | 76.0 | 62.7 | 60.1 |
Mixtral-8x7B-Instruct-v0.1 | 50.2 | 47.0 | 64.2 | 59.0 | 61.5 | 55.0 | 56.1 |
Mistral-7B-Instruct-v0.3 | 51.0 | 57.2 | 73.8 | 50.7 | 50.5 | 50.2 | 55.6 |
Llama-3.1-8B-Instruct-Turbo | 41.8 | 49.8 | 50.5 | 57.5 | 75.5 | 56.8 | 55.3 |
Llama-3.2-3B-Instruct-Turbo | 50.0 | 49.8 | 50.0 | 51.5 | 51.5 | 51.5 | 50.7 |
Random Baseline | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
Mean | 47.9 | 52.4 | 65.8 | 67.3 | 76.4 | 67.0 | 62.8 |
EquiBench is a comprehensive benchmark designed to evaluate the code reasoning capabilities of Large Language Models (LLMs) through equivalence checking tasks. This framework helps researchers and developers assess how well different LLMs understand code semantics, reason about program functionality, and determine when two code snippets are functionally equivalent despite syntactic differences.
- Diverse Test Cases: Includes 2400 pairs of equivalent programs across six distinct categories (
DCE
for C programs,x86-64
for x86-64 programs,CUDA
for CUDA programs, andOJ_A
,OJ_V
,OJ_VA
for Python competitive programming problems) - Multiple Prompting Strategies: Support for zero-shot, few-shot, and chain-of-thought variations to evaluate different reasoning approaches
- Wide Model Support: Compatible with leading LLMs from OpenAI, Anthropic, Meta, Mistral AI, Qwen, and DeepSeek
- Standardized Methodology: Consistent evaluation framework enabling fair comparison across different model architectures
Example Program Pairs and Results





EquiBench contains 2400 pairs of programs across six distinct categories of code equivalence tasks:
- DCE (Dead Code Elimination for C programs): Code pairs that differ by removal of dead / live code
- x86-64 (Superoptimizer for x86-64 program): Assembly code pairs optimized using the x86-64 framework
- CUDA (Compiler Scheduling for CUDA programs): Code pairs optimized for tensor operations
- OJ_A (Python Competitive Programming - Algorithm): Different algorithmic solutions to the same programming problem
- OJ_V (Python Competitive Programming - Variable Renaming): Code pairs with variable renaming transformations
- OJ_VA (Python Competitive Programming - Variables + Algorithms): Code pairs with both variable renaming and algorithmic differences
Each category contains 400 pairs of programs (200 equivalent and 200 inequivalent), providing a diverse range of challenges for LLMs to reason about code semantics.
EquiBench evaluates models using four different prompting strategies:
ZERO
: Zero-shot prompting (directly asking the model without examples)FEW
: Few-shot prompting (providing example problems and solutions)ZERO_COT
: Zero-shot chain of thought (encouraging step-by-step reasoning)FEW_COT
: Few-shot chain of thought (examples with step-by-step reasoning)
Each strategy tests different aspects of a model's reasoning capabilities, from basic understanding to advanced reasoning chains.
The EquiBench dataset is hosted on HuggingFace as anjiangwei/EquiBench-Datasets.
Category | Language | Equivalent Pairs | Inequivalent Pairs | Total |
---|---|---|---|---|
DCE | C | 200 | 200 | 400 |
x86-64 | x86-64 | 200 | 200 | 400 |
CUDA | CUDA | 200 | 200 | 400 |
OJ_A | Python | 200 | 200 | 400 |
OJ_V | Python | 200 | 200 | 400 |
OJ_VA | Python | 200 | 200 | 400 |
Total | 1200 | 1200 | 2400 |
Explore EquiBench!
BibTeX
@article{wei2025equibench,
title={EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking},
author={Wei, Anjiang and Cao, Jiannan and Li, Ran and Chen, Hongyu and Zhang, Yuhui and Wang, Ziheng and Sun, Yaofeng and Liu, Yuan and Teixeira, Thiago S. F. X. and Yang, Diyi and Wang, Ke and Aiken, Alex},
journal={arXiv preprint arXiv:2502.12466},
year={2025}
}
}