How big should an LLM evaluation set be?

For most production workflows, fifty to two hundred labeled examples is the right starting point. Below fifty, you can't distinguish noise from signal between model versions. Above two hundred, the marginal example rarely changes a deployment decision. Grow the set when a real-world failure exposes a category your evals don't cover.

Who should label the evaluation set?

A domain expert who would actually do the work in production — not a contractor, not the engineering team, not the AI vendor. The labels are the policy of what 'correct' means for this workflow. Anyone less expert produces an eval set that measures the wrong thing.

What metrics should an LLM eval report?

At minimum: accuracy on the labeled set, breakdown by example category, cost per call, latency p50/p95, and a list of the specific failures with the model's output for each. Aggregate scores hide the failure modes that matter for go/no-go decisions.

How often should evaluation sets be re-run?

On every model change (provider switch, model upgrade, prompt change) and on every code change to the surrounding pipeline. Most teams also schedule a monthly run against the live model to catch silent drift in third-party model behavior.

Can LLMs grade their own outputs?

For some categories, yes — LLM-as-judge works well for fluency, style, and structural compliance. It does not work for domain correctness in regulated workflows; a model cannot reliably tell you whether a credit memo's reasoning is sound or a contract clause is enforceable. Use LLM judges as a pre-filter, not the final word.

Evaluating frontier LLMs for production

The single artifact that separates a production AI system you trust from a demo you hope works is the evaluation set. Everything else — the model choice, the prompt, the orchestration framework — is a tactical decision. The eval set is the contract.

Most teams skip it. They run a few examples by hand, see good outputs, and ship. Six weeks later they discover a failure mode their domain expert would have caught on day one. By then the system is in production and the rollback is expensive. This post is what we wish more teams did before the rollback.

Start with the labels, not the model

An evaluation set is a list of inputs paired with what the right output should have been. The labels are the work. The model is irrelevant until you have them.

For a contract triage workflow: fifty to a hundred contracts, each with the correct routing decision, the correct urgency tag, and the correct three-sentence summary. For a credit memo workflow: fifty source packets, each paired with the memo your senior analyst would have written. For a document extraction pipeline: fifty source documents, each with the correct structured extraction.

This is the part everyone wants to skip and nobody should. The labels encode what "correct" means for your business — and that meaning cannot be inferred by the model, the consultant, or the prompt. It comes from a domain expert sitting down for two days and writing down what the right answer was.

Pick the examples that matter

The wrong way to build an eval set is random sampling. You'll end up with mostly easy cases that any model handles, and the hard cases — the ones that decide whether the system ships — show up rarely or not at all.

The right way is stratified by failure mode. Before you collect a single example, ask the domain expert: "What are the five categories of failure that would matter most?" For each category, collect ten to twenty examples that would actually trigger that failure. The categories typically split into:

Examples the system must handle correctly because they are common
Examples that are rare but high-stakes (a regulator complaint, a customer churn trigger)
Examples that look easy but are subtly different (the lookalike that gets misclassified)
Examples at the edges of the input distribution (truncated documents, unusual formats, missing fields)
Adversarial examples (where someone is actively trying to manipulate the output)

One hundred examples chosen this way produce more useful signal than one thousand chosen randomly.

Score every output, not just the aggregate

An accuracy number on its own — "92% on the eval set" — is unactionable. It tells you nothing about which 8% failed and whether those failures matter.

The output of an evaluation run should be a table: every example, the expected output, the actual model output, a per-example score, and a category tag. The aggregate is a footnote. The table is the artifact.

When you read the table, sort by failure category. If all the failures cluster in one category, that's a prompt or a model problem you can fix. If they're spread evenly, you're at the model's ceiling and the only paths forward are a stronger model, a different decomposition, or a human-in-the-loop gate.

Measure cost and latency alongside accuracy

Production AI is a three-axis decision: accuracy, cost per call, latency. A system that's 99% accurate at $2 per call and 30 seconds is the wrong fit for a workflow that runs ten thousand times a day. A system that's 92% accurate at $0.02 per call and 800 milliseconds may be the right fit, depending on what the human-in-the-loop gate catches.

Every eval run should report all three. The right model is rarely the most accurate one — it's the most accurate one that fits the workflow's cost and latency budget.

Run it on every change, automatically

Manual eval runs are eval theater. They happen at launch, then never again. Six months in, someone changes a prompt, a vendor updates the model silently, the orchestration framework gets upgraded — and accuracy quietly degrades while everyone assumes it's fine because nothing alerted.

The eval set should be wired into the same CI pipeline as the code. Every commit that touches the AI workflow runs the evals. Every model upgrade triggers them. The pipeline blocks the deploy if accuracy drops below the threshold the team agreed on at launch. This is the most underrated discipline in production AI work.

Grow the set when production teaches you something

The eval set you ship with is wrong in at least one specific way: it doesn't cover a failure category that real users will discover. That's not a defect; it's a property of the work. Real-world data finds the gaps your design phase missed.

The discipline is: every time a real failure surfaces — a wrong routing decision, a hallucinated extraction, a memo with a factual error — the example gets added to the eval set with the correct label, before the fix ships. The fix is not "we updated the prompt"; the fix is "we updated the prompt and added five examples covering this failure mode and they now pass." Without the second half, you'll regress on the next iteration.

The honest version

Building a real eval set takes one to two weeks of a senior expert's time. Running it costs ten to fifty dollars per pass at frontier prices. Maintaining it requires discipline that most teams don't have at launch and have to build.

It is also the cheapest insurance you will ever buy on a production AI system. Every team we know that skipped it spent more cleaning up the consequences than they would have spent doing it right.