Partnership · CodeSOTA × Xfaang

We don't just track the state of the art. We run it.

CodeSOTA is the open, dated benchmark registry Xfaang pays for — and is not allowed to touch. The wall between us is the most useful thing we built.

Kacper Wikiel·CodeSOTA, sponsored by Xfaang·Warsaw·25 May 2026·8 min

Fig. 01 — The frontier, kept current. Solid bars are today's record; the dotted screen behind each is the result it replaced.

"State of the art" is the most abused phrase in artificial intelligence. It shows up on every launch slide and in every sales deck, almost always missing the two things that would let you check it: a date, and a source. Best — when? Measured how? Against whom? Reproduced by anyone, or asserted by the team that shipped the model and priced the API? Strip those away and "state of the art" stops being a measurement. It becomes a mood.

We got tired of the mood. So we built the instrument.

It is called CodeSOTA, and as of the April 2026 snapshot it holds 9,102 benchmark results across 163 models and 371 datasets, sorted into nine capability areas and 121 tasks. Every row carries a date and a source. None of them carry anyone's opinion. This piece is about an arrangement that confuses people when they first hear it: why an automation firm — Xfaang — pays to keep a public registry running that it is structurally forbidden from editing, and what that wall says about how we work.

§ 01 — The map, not the moodThe frontier moves; most people are quoting last year

The uncomfortable fact about this field is that the frontier moves faster than the slides describing it. A record set in spring is folklore by autumn. Benchmarks saturate and get quietly abandoned. A number screenshotted from a vendor chart tells you nothing about whether it was self-reported, cherry-picked, or run on a test set the model had already seen.

CodeSOTA treats the leaderboard as what it actually is — not a screenshot, but a graph of dated claims. Nine capability areas cover the whole working surface of modern AI: Language & Knowledge, Vision & Documents, Audio & Speech, Multimodal Media, Code & Software Engineering, Agents & Tool Use, Structured Data & Forecasting, Robotics & RL, and Science, Medicine & Industry. Each result resolves to a task, a dataset, a metric with its direction made explicit — accuracy climbs, word-error-rate falls — a model, a score, and a link back to the proof.

9,102

Results

163

Models

371

Datasets

Capability areas

A score without a date is a rumor with good posture.

§ 02 — The arrangementThe wall is the product

Diagram of the partnership: CodeSOTA the open record on the left, Xfaang the applied edge on the right, a dotted integrity wall with a seal between them, and a loop where findings flow into shipped software while compute and commissions fund the record but never the ranking. — **Fig. 02** — Two sides, one wall. Xfaang funds the compute and commissions runs; it cannot move a rank. Findings flow one way; money never flows the other.

Here is the part that makes people squint. Xfaang sponsors CodeSOTA — it pays for the compute, commissions benchmark runs, and leans on the findings in client work. Xfaang also has no ability whatsoever to change a ranking. The registry is single-author, built in public, and keeps a hard line between paid vendor work and the integrity of the public record. A vendor can buy a benchmark run. Nobody can buy a benchmark result.

That sounds like a constraint Xfaang tolerates. It is actually the asset Xfaang is buying. A ranking that can be bought tells you something about the buyer. A ranking that can't tells you something about the model. When we hand a client a recommendation, the value isn't that we like the model — it's that the number behind it survives someone looking it up.

The conflict of interest you can see is the one you can trust.

§ 03 — The receiptsA claim you can't fake

Diagram: the anatomy of a verified claim as a chain — task, dataset, metric, model, score, proof — under a stamp showing snapshot date, source tier, verification status and contamination flags. — **Fig. 03** — Anatomy of a claim you can't fake. The chain is ordinary; the stamp underneath it is the part most leaderboards leave off.

The discipline lives in what has to be true before a number is allowed to count. Every result in the registry carries:

A date and a snapshot ID, so you know when the claim held. A source tier — paper, vendor/API-verified, or independently reproduced with a container hash. A verification status, where self-reported "claim-only" is visibly distinct from "verified." Contamination and saturation flags, surfaced rather than buried. And a correction trail: wrong scores are fixed in place with a note, and misleading rows are retracted in the open, struck through with a reason.

This is the same standard we hold ourselves to before we put a component into anyone's system. A benchmark you can't date and can't source isn't evidence — it's marketing. CodeSOTA marks the difference on every row, including the rows that flatter no one.

§ 04 — The original workWe don't read the scores. We run them.

Editorial diagram: a blind A/B listening waveform with a spectral fingerprint on the left, and a six-axis quality radar (naturalness, intelligibility, latency, cost, control, languages) on the right — representing original measurement. — **Fig. 04** — Where published evidence runs thin, the registry stops reading and starts measuring.

A registry that only collected other people's numbers would be a reading list. The work that makes it research is the part where, when the published evidence is thin, we measure ourselves.

Voice is the clearest case. Speech quality refuses to fit inside a single MOS number, so the Speech hub pairs reported scores with original blind listening studies on identical hard text, broken out across naturalness, intelligibility, latency, controllability, cost and language coverage — with the spectral fingerprints published underneath the rankings, not just a position. Elsewhere the same instinct shows up as lineages that trace how attention moved from HumanEval to LiveCodeBench to SWE-bench Verified (so you know which benchmark is still the frontier and which is exhausted); as an Academy whose deliverable is a registry row rather than a certificate — reproduce a published result, then beat it; and as The Dispatch, a short note sent only when a tracked score actually moves.

Reading a leaderboard tells you who won. Running it tells you whether the win survives contact with your data.

§ 05 — The payoffWhat the wall buys you

All of this would be an academic hobby if it stopped at the registry. It doesn't. CodeSOTA is API-first: one CORS-open, no-key endpoint returns the current, dated, sourced pick for any task — citable in a paper, callable from an agent, droppable into a build.

# no key · CORS-open · citable · callable from anything
curl https://www.codesota.com/api/sota?task=ocr

# → { "task": "ocr", "leader": "...", "metric": "accuracy",
#     "score": ..., "source_tier": "reproduced",
#     "snapshot": "2026-04-27", "runners_up": [ ... ] }

That endpoint is where the partnership cashes out. When Xfaang scopes a client feature — say, parsing a pile of messy documents — we don't reach for the model that trended on launch day. We query the registry, take the current leader for that kind of document, and wire it in with the evidence still attached. Independent research on one side of the wall; applied delivery on the other; a documented path between them.

So when Xfaang tells you a particular model is the right component for your product, the recommendation arrives with a snapshot date, a source tier, and — where it counts — a benchmark we ran ourselves. That is the standard. CodeSOTA is just where we keep it in the open.

State of the art is not a trophy you win once. It is a timestamp you have to keep earning.

CodeSOTA is where we keep the timestamps in public. Xfaang is where we put them to work. The wall in the middle isn't a limitation — it's the reason the number means anything at all.

Disclosure

Because the registry preaches about them: Xfaang funds CodeSOTA's compute and commissions some of its benchmark runs. It does not — and cannot — set or edit public rankings. Figures reflect the April 2026 snapshot and drift as the live record updates. Where a row reads "reproduced," we ran it; where it reads "claim-only," we didn't, and we say so.

See the frontier for yourself

Explore the open registry, hit the free API, or bring us a problem and we'll pick the state of the art for it — with receipts.

Explore CodeSOTA →Work with Xfaang