AI Model Scoreboard · v4

AI Model Scoreboard v4 · Methodology

How the v4 scores are produced (high-level overview).

Last updated: 2025-12-14

This page explains how the scores on AI Model Scoreboard v4 are produced at a high level.

It is intentionally written to be:

  • Transparent enough for users to understand what the numbers roughly mean
  • Stable enough that we don’t change the rules every week
  • Specific enough to be useful, without exposing every internal detail or exact weight

The goal is to give you a mental model of how the scoreboard thinks about models – not a step-by-step recipe to “game” the ranking.

1. What is AMS v4 trying to measure?

AMS v4 is a comparative scoreboard for large language models.

We focus on three questions:

  1. “How strong is this model in real use?”
  2. “How safe and reliable is it to depend on?”
  3. “Is it a realistic choice for teams and developers to adopt?”

To answer these, v4 combines multiple public signals into a single score between 0–100.

The scoring is fully offline:

  • A private engine gathers data from public sources
  • Scores are computed in batches as “snapshots”
  • The public site only reads static JSON snapshots (public/data/v4/*.json)

There is no live API call to any vendor’s service when you load the website.

2. Which models are included?

AMS v4 only tracks models that meet some basic baseline:

  • The model is accessible to the public (API or SaaS)
  • There is enough public information to estimate performance and cost
  • The model is still maintained (not clearly abandoned)

Each model is assigned to one of three layers:

Full

Models with solid data across multiple dimensions. They are part of the main scoreboard.

Provisional

Models where data is incomplete, noisy, or in transition. They appear on the scoreboard, but parts of the score may rely on estimates.

Rejected

Models that are excluded from the main list. Reasons can include:

  • Extremely poor transparency
  • Repeated incidents or withdrawals
  • Clearly abandoned or no longer available
  • Not enough information to assign a responsible score

Rejected models are not shown on the main leaderboard, but the engine keeps track of them internally to avoid flip-flopping.

3. The five score pillars

Each model is evaluated along five pillars. The final 0–100 score is a blend of these; performance has the largest impact.

We do not publish the exact formulas or weights, but the qualitative meaning is:

3.1 Performance

“How strong is this model when you actually use it?”

Signals include, for example:

  • Public benchmark results (reasoning, coding, general LLM evals)
  • Community leaderboards and head-to-head evals
  • Signs of real-world capability (if and where they’re available)

Higher scores mean:

  • The model tends to solve more tasks correctly
  • It behaves competitively against other current-generation models

3.2 Safety & Reliability

“Can you depend on this model not to break in bad ways?”

We look at things like:

  • Publicly documented safety measures
  • Known incidents, recalls, or major regressions
  • How vendors respond to issues and ship safety updates

A model moves downward in this pillar when:

  • Serious incidents are widely reported
  • The vendor quietly removes or ships unstable versions
  • There is clear evidence of poor handling of safety problems

3.3 Adoption & Support

“Is it realistic to use this model in a real project?”

Signals include:

  • Recent updates (how stale or fresh the model is)
  • SDKs, documentation, and developer experience
  • Reliability of status pages and infrastructure
  • Signs that the model is part of an active product, not a dead branch

Higher Adoption & Support means:

  • The model is being maintained
  • It is not just a one-off research drop
  • Developers have a reasonable chance of integrating and running it at scale

3.4 Openness & Transparency

“How much does the vendor actually tell you?”

We do not reward or punish models for being open-source vs closed-source. Instead, we focus on how clear and honest the documentation is:

  • Is there a model card or equivalent?
  • Is the training data at least described at a high level?
  • Are limitations, biases, and known issues discussed?
  • Are there public policies around data handling and usage?

Higher scores here mean:

  • You can know what you are getting into before betting on the model
  • The vendor treats transparency as part of the product

3.5 Cost Efficiency

“Does the price roughly match what the model can do?”

We combine:

  • Token prices (input and output)
  • Very rough performance tiers
  • The idea that “same strength but cheaper” is usually better

We do not try to predict your exact bill. Instead, we give a relative sense of how expensive a given level of capability is.

Cheap but extremely weak models will not automatically rank high here.

4. Where does the data come from?

AMS v4 uses only publicly available information, such as:

  • Official documentation and pricing pages
  • Public benchmark dashboards and eval suites
  • Vendor blog posts and changelogs
  • Publicly reported incidents and withdrawals
  • Widely cited community resources

We intentionally do not crawl private or scraped datasets.

When there is a conflict between sources, the engine errs on the side of being conservative (e.g. Provisional instead of Full).

5. How often are scores updated?

At the moment, updates are manual snapshots, not live streaming:

  1. The private engine is run offline
  2. It reads the latest public data and internal bootstrap lists
  3. It writes static JSON files (the “snapshot”)
  4. The snapshot is copied into the public repository and deployed

This means:

  • Scores are not real-time
  • A model may have improved (or regressed) since the last snapshot
  • The Updated timestamp on the site reflects when the snapshot was taken, not when each individual data point changed

In the future, this process may be automated via scheduled jobs, but only after the methodology is stable.

6. What the scores are – and are not

The AMS v4 scores are:

  • A compressed summary of multiple public signals
  • Opinionated by design (we chose what to care about and what to ignore)
  • Useful for getting a rough sense of the model landscape

The scores are not:

  • A guarantee that one model is “objectively best” for every use case
  • A replacement for your own evaluations
  • A ranking based on hype cycles, social media volume, or marketing

You should think of this site as:

“One carefully designed scoreboard, focused on a few specific criteria” rather than “the final word on all LLMs”.