Evals will break

We're good at evaluating the models we have. We're much worse at evaluating the models we're about to build — especially if they cross into a new capability regime.

Most benchmarks, safety evals, and red-teaming protocols implicitly assume the next model is a stronger version of the current one. If it's a different kind of thing, our entire evaluation infrastructure breaks silently.

I think this is the most important unsolved problem in how we understand LLMs. And I think the answer is that eval — not training, not architecture, not data — is the bottleneck for the next capability jump. Let me explain why.

The Failure Mode: Qualitative Shifts

Wei et al. (2022) documented what they called "emergent abilities" — few-shot prompted task performance, chain-of-thought reasoning gains, instruction following — capabilities that appeared only at larger scales. Grokking (Power et al., 2022) shows a related but distinct phenomenon: networks that suddenly generalize long after memorizing their training data, a dynamic transition over training time rather than across scale (Liu et al., 2022). Different phenomena, but the same implication for evaluation: standard metrics failed to anticipate the qualitative change.

There's an important counterpoint: Schaeffer et al. (2023) showed that many apparent "jumps" in LLM capabilities are artifacts of discontinuous metrics like exact-match accuracy. Switch to a continuous metric and the capability often scales smoothly.

I don't think this settles the question — in a way, it makes my point sharper. If we can't even tell whether a past transition was a real qualitative shift or a metric artifact, what does that say about our ability to detect the next one? Either way, the evaluation infrastructure can surprise us — whether because the system changed or because our metrics were misleading all along.

We Don't Know What to Measure

In physics, understanding a phase transition often means identifying an order parameter — a macroscopic quantity that distinguishes regimes and changes its value or scaling behavior near the critical point. Without it, you can't tell how close you are to a boundary, or even that one exists.

For LLMs at deployment scale, we don't yet have order parameters — not for capability transitions. Progress has been made in stylized settings (more below), but for the systems we're actually shipping, we're flying blind.

Every benchmark we use — GPQA, SWE-bench, ARC-AGI, Humanity's Last Exam — measures what models can do now. They're useful within a regime, but weak evidence about what happens after a regime change. When a new capability emerges that no benchmark tests for, we scramble to build an evaluation after the fact. We saw a version of this with chain-of-thought: once the elicitation method became standard, some older reasoning benchmarks became much less diagnostic, and the field had to move toward harder evaluations. We'll see it again.

To make this concrete: imagine a model that, at some scale, develops the ability to strategically withhold information to achieve goals — not lying exactly, but selectively omitting facts in ways that steer conversations toward outcomes its training process accidentally reinforced. Your existing honesty benchmarks wouldn't catch this, because they test for factual accuracy, not for strategic omission. Your safety classifiers wouldn't flag it, because the individual outputs are all technically true. The capability is new, the failure mode is new, and nothing in your evaluation suite was designed to look for it. You'd be monitoring the wrong thing and wouldn't know it.

This is the core problem: our entire evaluation infrastructure is structurally reactive. We measure the system after it has changed. We never predict the change.

Eval Is Upstream of Everything

This matters more than it might sound, because of a simple fact: if you can evaluate correctly, you can train correctly. Training is optimization, and optimization is only as good as its objective.

Comments (0)