SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Abstract

We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a story context and conditions using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-assisted and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. SATBench exposes fundamental limitations in the search-based logical reasoning abilities of current LLMs and provides a scalable testbed for future research in logical reasoning.

Leaderboard

Model	SAT			UNSAT			Overall			Avg.
Model	Easy	Medium	Hard	Easy	Medium	Hard	Easy	Medium	Hard	Avg.
Random Baseline	50.0	50.0	50.0	50.0	50.0	50.0	50.0	50.0	50.0	50.0
LLaMA3.1-8B	57.9	60.0	48.9	30.4	14.8	17.5	44.1	37.4	33.2	38.2
DeepSeek-Distill-7B	63.9	27.6	16.8	69.1	43.8	42.1	66.5	35.7	29.5	43.9
Qwen3-1.7B	77.1	65.7	53.2	53.4	30.5	42.5	65.3	48.1	47.9	53.7
gpt-4o-mini	82.1	82.4	90.7	42.3	12.9	13.2	62.2	47.6	52.0	53.9
LLaMA4-Scout	84.3	76.7	66.4	52.0	24.3	37.5	68.1	50.5	52.0	56.9
LLaMA3.1-70B	82.0	55.7	45.4	55.2	59.0	48.9	68.6	57.4	47.1	57.7
gpt-4o	85.5	83.3	78.6	54.3	27.1	18.9	69.9	55.2	48.8	58.0
LLaMA3.3-70B	90.7	89.0	75.7	39.5	27.1	30.0	65.1	58.1	52.9	58.7
DeepSeek-Distill-14B	82.9	51.4	41.1	85.7	59.0	51.8	84.3	55.2	46.4	62.0
LLaMA4-Maverick	80.2	86.2	86.1	76.8	25.7	17.9	78.5	56.0	52.0	62.1
Qwen3-4B	84.1	78.1	78.6	80.7	31.9	22.1	82.4	55.0	50.4	62.6
Qwen3-8B	82.7	76.7	67.5	81.6	34.8	32.1	82.1	55.7	49.8	62.6
DeepSeek-Distill-32B	84.5	53.8	42.1	90.0	68.1	58.6	87.2	61.0	50.4	66.2
Qwen3-14B	87.1	72.9	80.0	88.9	47.6	22.1	88.0	60.2	51.1	66.4
Qwen3-235B-Int8	90.0	83.3	83.2	86.1	46.2	19.6	88.0	64.8	51.4	68.1
Qwen-QwQ-32B	92.5	75.7	59.3	84.1	51.9	46.4	88.3	63.8	52.9	68.3
Claude-3.7-Sonnet	88.4	77.6	83.6	93.8	63.3	42.1	91.1	70.5	62.9	74.8
DeepSeek-V3	93.6	83.8	71.4	97.5	83.3	74.3	95.5	83.6	72.9	84.0
DeepSeek-R1	94.8	87.1	73.6	98.2	89.5	83.6	96.5	88.3	78.6	87.8
o4-mini	97.0	96.7	91.1	98.2	88.1	65.0	97.6	92.4	78.0	89.3
Average	84.1	73.2	66.7	72.9	46.4	39.3	78.5	59.8	53.0	63.8

Table 1: Model accuracy on SATBench using zero-shot prompting for satisfiability prediction. Difficulty levels are categorized as follows: Easy (4-19 clauses), Medium (20-30 clauses), and Hard (31-50 clauses). All open-source models are instruction-tuned.

Dataset

Metric	Value
Number of Instances	2100
Average Number of Variables	36.0
Average Number of Clauses	20.6
Average Number of Words	546.2
Average Number of Sentences	55.2

Table 2: Dataset statistics for SATBench.

Reasoning Trace Evaluation

Model	SAT		UNSAT		Overall Trace
Model	Pred.	Trace	Pred.	Trace	Overall Trace
Qwen-QwQ	75.5	52.3	60.7	52.4	52.4
Claude-3.7-Sonnet	83.2	47.4	66.4	61.1	54.2
DeepSeek-V3	82.9	65.7	85.0	71.1	68.4
o4-mini	94.7	74.6	83.6	74.1	74.4
DeepSeek-R1	85.2	73.8	90.3	82.1	78.0

Table 3: Accuracy in prediction and reasoning trace evaluation.

SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

arXiv:2505.14615

Abstract

Leaderboard

Overview

Dataset

Benchmark Curation Pipeline Example

Scaling Trend

Difficulty Analysis

Reasoning Trace Evaluation

BibTeX