Insights

Do model leaderboards matter?

The inherent bias in measuring performance

Doug Cook

—

Jul

2025

Every few weeks, a new leaderboard makes the rounds, including all the popular models jostling for the top spot across tasks like reasoning, coding, search, or multimodal perception.

The latest from LM Arena shows Gemini 2.5 Pro (June 5 preview) topping both the text and web dev charts, with Claude Opus and OpenAI’s GPT-4o not far behind.

It’s easy to get swept up in these rankings. Each score, each update feels like a moment in the ongoing AI Olympics. But what are these leaderboards actually telling us? And what do they miss?

‍

AI model leaderboard showing rankings, scores, and votes. Organized by model function: Text, WebDev, Vision, and Search

‍

The upside of model rankings

Leaderboards like LM Arena offer one clear benefit: comparability. When models are benchmarked across the same tasks and scored via blind head-to-head matchups, you get a clearer sense of where strengths lie. You can spot shifts, like Gemini’s rapid improvement in coding tasks or Claude’s growing dominance in long-context reasoning.

They also democratize model evaluation. LM Arena’s community voting allows real users to decide which answers are better, bringing a useful human signal into a space dominated by technical metrics.

And for those building applications, these leaderboards act as a quick gut check: Which model is performing best today for my use case? Text generation? Web coding? Vision tasks? The snapshot view is handy for choosing the right engine under the hood.

‍

Two charts showing AI model performance analysis: left displays confidence intervals for model strength ratings, right shows average win rates for each model against all others

‍

The limits of the leaderboard

But here’s the thing: a single score doesn’t equal user experience.

Real-world usage is messy, contextual, and shaped by far more than accuracy. Latency, reliability, UX integration, prompt architecture, and fine-tuning all matter more than raw scores in production settings. A model ranked #1 in abstract reasoning might underperform in your specific domain or be cost-prohibitive to scale.

Leaderboards also risk becoming optimization traps. When benchmarks become the goal, models are often tuned to “beat the test,” not to improve holistic user value. We’ve seen this before in web search, academia, even school testing: teach to the test, and you risk missing the forest for the trees.

And as impressive as these metrics are, they’re still task-based proxies for intelligence. They don’t reflect values like transparency, alignment, or creativity; the qualities that shape how we actually use and trust these tools.

‍

Two heatmap charts showing AI model performance comparisons: left chart displays win rates between models, right chart shows total battle counts for each model pairing

‍

A better frame: purpose over points

So how should we think about these rankings?

Use them, but don’t worship them. They’re a useful starting point, not an endpoint. Let them inform your choice of tools, but not define your expectations. Ask deeper questions:

How does this model behave in your real-world workflows?
What kinds of user experiences is it enabling?
Where do its limitations show up in your specific context?
What are the inherent biases represented in the model?
Should I consider an open source or locally-hosted model?

The best model isn’t necessarily #1 on the chart. It’s the one that best serves your users, your product, and your values.

‍

Doug Cook

FOUNDER AND PRINCIPAL

Doug is the founder of thirteen23. When he’s not providing strategic creative leadership on our engagements, he can be found practicing the time-honored art of getting out of the way.

Around the studio

UPDATES

Studio News

Do model leaderboards matter?

The upside of model rankings

The limits of the leaderboard

A better frame: purpose over points

Further reading

Doug Cook

Around the studio

Designing agentic AI with Dell

Death of a browser

A minimalist’s guide to agents

BAM! BOOM! KAPOW!

Creating a more inclusive, connected world with AI

Fixify raises $25M in Series A

The future of accessibility

Seeing the world through AI

Modeling new shoes

The new language of experience design

How LLMs are reshaping digital experiences

Celebrating Earth Day

Lost in translation

Designing for health and longevity

We made Inc’s 2024 Regionals list!

Intelligent care

Design boom

Designing invisible interfaces

Speaking in gestures

Interacting in space

The internship experience

AI in the kitchen

Mentoring our interns

Designing in the age of intelligence

thirteen23 honored by Inc Magazine

2022: A retrospective

Looking forward to new innovations

Camp thirteen23

Rebranding thirteen23

Design collaboration from afar

Bringing Design Friday to our team

Our design playbook

Sign up to our newsletter

Thanks for subscribing!