Do model leaderboards matter?

Insights

Do model leaderboards matter?

The inherent bias in measuring performance

By

Doug Cook

31

Jul

2025

Every few weeks, a new leaderboard makes the rounds, including all the popular models jostling for the top spot across tasks like reasoning, coding, search, or multimodal perception.

The latest from LM Arena shows Gemini 2.5 Pro (June 5 preview) topping both the text and web dev charts, with Claude Opus and OpenAI’s GPT-4o not far behind.

It’s easy to get swept up in these rankings. Each score, each update feels like a moment in the ongoing AI Olympics. But what are these leaderboards actually telling us? And what do they miss?

AI model leaderboard showing rankings, scores, and votes. Organized by model function: Text, WebDev, Vision, and Search

The upside of model rankings

Leaderboards like LM Arena offer one clear benefit: comparability. When models are benchmarked across the same tasks and scored via blind head-to-head matchups, you get a clearer sense of where strengths lie. You can spot shifts, like Gemini’s rapid improvement in coding tasks or Claude’s growing dominance in long-context reasoning.

They also democratize model evaluation. LM Arena’s community voting allows real users to decide which answers are better, bringing a useful human signal into a space dominated by technical metrics.

And for those building applications, these leaderboards act as a quick gut check: Which model is performing best today for my use case? Text generation? Web coding? Vision tasks? The snapshot view is handy for choosing the right engine under the hood.

Two charts showing AI model performance analysis: left displays confidence intervals for model strength ratings, right shows average win rates for each model against all others

The limits of the leaderboard

But here’s the thing: a single score doesn’t equal user experience.

Real-world usage is messy, contextual, and shaped by far more than accuracy. Latency, reliability, UX integration, prompt architecture, and fine-tuning all matter more than raw scores in production settings. A model ranked #1 in abstract reasoning might underperform in your specific domain or be cost-prohibitive to scale.

Leaderboards also risk becoming optimization traps. When benchmarks become the goal, models are often tuned to “beat the test,” not to improve holistic user value. We’ve seen this before in web search, academia, even school testing: teach to the test, and you risk missing the forest for the trees.

And as impressive as these metrics are, they’re still task-based proxies for intelligence. They don’t reflect values like transparency, alignment, or creativity; the qualities that shape how we actually use and trust these tools.

Two heatmap charts showing AI model performance comparisons: left chart displays win rates between models, right chart shows total battle counts for each model pairing

A better frame: purpose over points

So how should we think about these rankings?

Use them, but don’t worship them. They’re a useful starting point, not an endpoint. Let them inform your choice of tools, but not define your expectations. Ask deeper questions:

  • How does this model behave in your real-world workflows?
  • What kinds of user experiences is it enabling?
  • Where do its limitations show up in your specific context?
  • What are the inherent biases represented in the model?
  • Should I consider an open source or locally-hosted model?

The best model isn’t necessarily #1 on the chart. It’s the one that best serves your users, your product, and your values.

Further reading

Got an idea or something to share? Subscribe to our newsletter and follow the conversation on LinkedIn!

Doug Cook

Doug Cook

FOUNDER AND PRINCIPAL

Doug is the founder of thirteen23. When he’s not providing strategic creative leadership on our engagements, he can be found practicing the time-honored art of getting out of the way.

Around the studio