EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Leaderboard

Model	DCE	CUDA	x86-64	OJ_A	OJ_V	OJ_VA	Overall Accuracy
o3-mini-2025-01-31	68.8	59.0	84.5	84.2	88.2	83.2	78.0
DeepSeek-R1	52.2	61.0	78.2	79.8	91.5	78.0	73.5
o1-mini-2024-09-12	55.8	50.7	74.2	80.0	89.8	78.8	71.5
claude3.5-sonnet-2024-10-22	38.5	62.3	70.0	71.2	78.0	73.5	65.6
gpt-4o-2024-11-20	43.2	49.5	65.2	71.0	87.0	73.8	65.0
DeepSeek-V3	41.0	50.7	69.2	73.0	83.5	72.5	65.0
Llama-3.1-405B-Instruct-Turbo	40.0	49.0	75.0	72.2	74.5	72.8	63.9
Qwen2.5-72B-Instruct-Turbo	42.8	56.0	64.8	72.0	76.5	70.8	63.8
gpt-4o-mini-2024-07-18	46.8	50.2	56.8	64.5	91.2	64.0	62.2
Qwen2.5-7B-Instruct-Turbo	50.5	49.2	58.0	62.0	80.8	63.0	60.6
QwQ-32B-Preview	48.2	50.5	62.7	65.2	71.2	64.2	60.3
Llama-3.1-70B-Instruct-Turbo	47.5	50.0	58.5	66.2	72.0	67.5	60.3
Mixtral-8x22B-Instruct-v0.1	46.8	49.0	62.7	63.5	76.0	62.7	60.1
Mixtral-8x7B-Instruct-v0.1	50.2	47.0	64.2	59.0	61.5	55.0	56.1
Mistral-7B-Instruct-v0.3	51.0	57.2	73.8	50.7	50.5	50.2	55.6
Llama-3.1-8B-Instruct-Turbo	41.8	49.8	50.5	57.5	75.5	56.8	55.3
Llama-3.2-3B-Instruct-Turbo	50.0	49.8	50.0	51.5	51.5	51.5	50.7
Random Baseline	50.0	50.0	50.0	50.0	50.0	50.0	50.0
Mean	47.9	52.4	65.8	67.3	76.4	67.0	62.8

Overview

EquiBench is a comprehensive benchmark designed to evaluate the code reasoning capabilities of Large Language Models (LLMs) through equivalence checking tasks. This framework helps researchers and developers assess how well different LLMs understand code semantics, reason about program functionality, and determine when two code snippets are functionally equivalent despite syntactic differences.

Key Features

Diverse Test Cases: Includes 2400 pairs of equivalent programs across six distinct categories (DCE for C programs, x86-64 for x86-64 programs, CUDA for CUDA programs, and OJ_A, OJ_V, OJ_VA for Python competitive programming problems)
Multiple Prompting Strategies: Support for zero-shot, few-shot, and chain-of-thought variations to evaluate different reasoning approaches
Wide Model Support: Compatible with leading LLMs from OpenAI, Anthropic, Meta, Mistral AI, Qwen, and DeepSeek
Standardized Methodology: Consistent evaluation framework enabling fair comparison across different model architectures

Example Program Pairs and Results

Refer to caption — Figure 1: An inequivalent pair from the DCE category in EquiBench. In the left program, c = 1 is dead code and has no effect on the program state, whereas in the right program, it is executed and alters the program state. Such cases are generated using the Dead Code Elimination (DCE) pass in compilers.

Supported Categories

EquiBench contains 2400 pairs of programs across six distinct categories of code equivalence tasks:

DCE (Dead Code Elimination for C programs): Code pairs that differ by removal of dead / live code
x86-64 (Superoptimizer for x86-64 program): Assembly code pairs optimized using the x86-64 framework
CUDA (Compiler Scheduling for CUDA programs): Code pairs optimized for tensor operations
OJ_A (Python Competitive Programming - Algorithm): Different algorithmic solutions to the same programming problem
OJ_V (Python Competitive Programming - Variable Renaming): Code pairs with variable renaming transformations
OJ_VA (Python Competitive Programming - Variables + Algorithms): Code pairs with both variable renaming and algorithmic differences

Each category contains 400 pairs of programs (200 equivalent and 200 inequivalent), providing a diverse range of challenges for LLMs to reason about code semantics.

Supported Prompt Types

EquiBench evaluates models using four different prompting strategies:

ZERO: Zero-shot prompting (directly asking the model without examples)
FEW: Few-shot prompting (providing example problems and solutions)
ZERO_COT: Zero-shot chain of thought (encouraging step-by-step reasoning)
FEW_COT: Few-shot chain of thought (examples with step-by-step reasoning)

Each strategy tests different aspects of a model's reasoning capabilities, from basic understanding to advanced reasoning chains.

HuggingFace Dataset

The EquiBench dataset is hosted on HuggingFace as anjiangwei/EquiBench-Datasets.

Dataset Statistics

Category	Language	Equivalent Pairs	Inequivalent Pairs	Total
DCE	C	200	200	400
x86-64	x86-64	200	200	400
CUDA	CUDA	200	200	400
OJ_A	Python	200	200	400
OJ_V	Python	200	200	400
OJ_VA	Python	200	200	400
Total		1200	1200	2400

EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

arXiv:2502.12466

Abstract