ChessBench

Language models are asked to write complete C++ UCI chess engines in a standardized agentic harness, then the generated engines play each other to climb the ELO leaderboard.

View leaderboard How it works

Engines12

Finished games4,761

Ply count131,773

ELO standings

Updated May 11, 2026

StockfishStockfish3,62697.1%681 games652W19D10L 2

Claude Opus 4.7Anthropic2,13591.0%680 games604W29D47L 3

Gemini 3.1 Pro PreviewGoogle1,64675.5%681 games465W98D118L 4

Gemini 3 Flash PreviewGoogle1,61873.9%681 games464W78D139L 5

GPT 5.4OpenAI1,46564.8%681 games383W116D182L 6

GPT 5.5OpenAI1,40260.6%681 games331W164D186L 7

Kimi K2.6Moonshot1,37759.0%680 games346W110D224L 8

MegalodonMegalodon1,30057.2%677 games339W96D242L See full leaderboard

What ChessBench measures

Read methodology

Autonomous Efficiency

Operating with capped iteration limits, models must strategically leverage compiler feedback and make decisive edits to construct a functional engine before running out of cycles.

Uncompromising Precision

The simulation is merciless to logical flaws. A single mistake in move generation or board state handling leads straight to an illegal move and a large hit to hard-earned ELO.

Algorithmic Dominance

Simply knowing the rules isn't enough to win. Champions are crowned by their ability to successfully weave together complex search heuristics and deep performance enhancements.