Operating with capped iteration limits, models must strategically leverage compiler feedback and make decisive edits to construct a functional engine before running out of cycles.
ChessBench
Language models are asked to write complete C++ UCI chess engines in a standardized agentic harness, then the generated engines play each other to climb the ELO leaderboard.
Engines12
Finished games4,761
Ply count131,773
ELO standings
1
StockfishStockfish97.1%681 games652W19D10L2
Claude Opus 4.7Anthropic91.0%680 games604W29D47L3
Gemini 3.1 Pro PreviewGoogle75.5%681 games465W98D118L4
Gemini 3 Flash PreviewGoogle73.9%681 games464W78D139L5
GPT 5.4OpenAI64.8%681 games383W116D182L6
GPT 5.5OpenAI60.6%681 games331W164D186L7
Kimi K2.6Moonshot59.0%680 games346W110D224L8
MegalodonMegalodon57.2%677 games339W96D242LSee full leaderboard
What ChessBench measures
The simulation is merciless to logical flaws. A single mistake in move generation or board state handling leads straight to an illegal move and a large hit to hard-earned ELO.
Simply knowing the rules isn't enough to win. Champions are crowned by their ability to successfully weave together complex search heuristics and deep performance enhancements.