The Leaderboard Illusion

Apr 30, 2025

Interesting paper from Cohere, I think this might cause a bit of a storm - basically it’s an investigation into biases towards closed source model companies (OpenAI, Meta, Google DeepMind are named) in Chatbot Arena.

There’s three ways that the proprietary shops are favoured:

There’s private testing practices that means these model providers are able to test multiple variants before public release, enabling selective disclosure of results.
Proprietary closed models are sampled at higher rates (numbers of battles) and have fewer models removed from the arena than open-source/open-weights equivalents.
These two policies combine to create a large data asymmetry - the closed source model providers get a lot more feedback/data to tune on than open source providers. This creates a feedback loop (more data means bigger performance gains, but can also result in overfitting to arena specific dynamics).

At an extreme, Meta submitted 27 private LLM variants in the lead up to the Llama-4 release. Eesh.

Quite juicy/spicy, and obviously this only represents one sides standpoint, but it reinforces that you can’t really trust any of the benchmark results that you see - the gold standard is to build your own private eval and periodically test. I wrote up a post previously on Karparthy’s method for doing this here.

Anyway, the paper is well worth reading and is hosted on arxiv here.

The Leaderboard Illusion