The temptation with LLM features is to demo first, ship next, evaluate later. The artefact that goes missing in that sequence is the only one that tells you whether your next prompt edit, model swap, or retrieval tweak made the product better or worse. Without evals, every change is a guess wearing a release note. This week: a practical three-layer scoring framework you can stand up in one afternoon.

Why teams ship without evals

Three reasons, in roughly the order they bite.

Writing evals feels slower than writing features, because for the first day it is. Picking a metric requires picking a definition of "good", and that requires a conversation that founders, PMs, and engineers would rather defer. And the demo passes the eyeball check on six examples, which feels like enough until the seventh user hits an edge case the eyeball missed.

The result is the silent regression. A prompt edit ships, the spot-check looks clean, and weeks later a customer flags that responses are now subtly worse on a class of queries nobody happened to test. The change is reverted by hand. The team learns nothing about which kinds of queries broke, because there was no measurement to point at.

Last week's matrix was about measuring vendors against your workload. This week's evals are the same idea applied to your product itself. The thing that gets measured is the thing that improves.

An eval is not a number. It is three layers.

An eval is a layered set of checks, each catching a different class of failure. Treat it as three layers stacked on top of one another, each cheap to add and each catching things the others miss.

Layer 1: deterministic checks. Format, schema, length, latency, tool-call shape. These are the checks that should never need an LLM to grade. If your API contract says the response is JSON with three keys, a deterministic check confirms it. If the answer must cite a source, a citation regex confirms it. Cheap, fast, and they catch the failures that embarrass you fastest.

Layer 2: reference checks. For tasks with a known correct answer, compare against it. Exact match for IDs, codes, and SQL; embedding cosine similarity or BERTScore for free-form text where wording can vary. These need a golden set, fifty to two hundred examples with their expected outputs, curated by the people who know what good looks like.

Layer 3: LLM-as-judge. For subjective quality, tone, helpfulness, faithfulness, a strong model grades the output against a rubric. Cheap to scale, only as good as its rubric, and correlated with truth rather than equal to it. Stack it on top of reference checks, not under them.

Three layers stacked, cheapest first. A failure at layer 1 means you never spend tokens on the LLM judge; a failure at layer 2 still gets graded by layer 3 for context. The aggregate is one pass rate per layer plus an overall delta, compared against the previous run.

The scoring framework, in fifteen lines

# eval.py - score one example with stacked graders.
def score(example, output):
    layers = {"deterministic": [], "reference": [], "judge": []}
    layers["deterministic"].append(is_valid_schema(output))
    layers["deterministic"].append(output["latency_ms"] < example["latency_budget"])
    if example.get("expected"):
        layers["reference"].append(exact_or_similar(output["text"], example["expected"]))
    if example.get("rubric"):
        layers["judge"].append(llm_grade(output, example["rubric"]))
    return {k: sum(v) / len(v) for k, v in layers.items() if v}

Run this over the golden set, average per layer, weight the layers by how costly each failure mode is to your product, and compare against the previous run. Two numbers matter: the overall pass rate, and the delta from last run. The delta is where regressions live.

The graders themselves are short. is_valid_schema is a pydantic check or a regex. exact_or_similar is == first, embedding cosine fallback for free-form. llm_grade asks a strong model to score the output against the rubric on a 1-to-5 scale, thresholds at four or above to return a boolean, and keeps the one-line reason for the audit trail. Booleans go in, pass rates come out, and the three layers stay comparable. None of the three graders is more than a screenful of code. The work is in curating the golden set, not in writing the runner.

How a real eval loop runs

The eval is not a one-off script. It is a CI step that runs on every change that could affect output quality: a prompt edit, a model swap, a retrieval parameter change, a new tool added to the chain. The score is stored, compared to the baseline, and the merge is blocked if the regression crosses a threshold.

Every prompt or model change runs the golden set in CI. The per-example traces are what let an engineer fix a regression in minutes instead of days, because the failure is reproducible against a named example, not a vague user report.

A short list of tools that already implement this loop, so you do not have to build it from scratch. OpenAI Evals (github.com/openai/evals) is the original open-source reference and still useful as a starting point, though active development has slowed. Inspect, from UK AISI, is the newer open framework and probably where the ecosystem is heading. Promptfoo, Ragas, and DeepEval are the other open options worth a look. LangSmith and Braintrust are managed services with hosted dashboards if you would rather buy than build. Pick one. Do not build your own grader runner before you have written any graders.

When to add evals, and what to skip

The lazy answer is "before your first user". The sharper answer.

Before the first prompt template that ships outside your team. Before any model swap, because the first model swap without evals is the day you discover the evals you wish you had. Before adding a second LLM call to a chain, because chains amplify regressions and a two-percent drop at each step compounds.

What to skip in v1: anything that takes more than one afternoon to set up. Twenty examples beat zero examples. Three deterministic checks beat one LLM judge. The golden set will grow from production failures over time. The goal of v1 is to catch the boring regressions that hide under your nose, not to win a benchmark.

Common mistakes

Four shortcuts that produce evals you cannot trust.

  1. Only LLM-as-judge. The cheapest layer to add and the easiest to overweight. It catches tone and helpfulness, misses schema and citation, and correlates with the judge's mood as much as the system's quality. Stack it on top of reference checks, not under them.

  2. Golden set is the team's favourite examples. The set looks great until you discover none of it matches what users actually ask. Pull half your golden set from real anonymised production queries the day you have them.

  3. No held-out test set. When you tune prompts against the eval, you overfit. Hold out twenty to thirty percent of the golden set, never look at those during prompt iteration, use them as the final check before shipping.

  4. No regression alert. An eval that runs and stores the score but does not block bad changes is a graph that gets ignored. Wire it to the merge button or to a Slack message, not to a dashboard alone.

The takeaway

Evals are the product spec for an LLM feature, written in code instead of in a PRD. Three layers, fifty examples, one afternoon. The change that lands without an eval is the change you debug under pressure later, when a customer flags it. If your team cannot agree on what good means, write the eval that pins it down. The product is what comes after.

Production checklist

  • Pick the three deterministic checks that catch your most embarrassing failures. Wire them into CI today.

  • Curate a golden set of fifty examples, half from team-defined "must work", half from real production queries the moment you have them.

  • Hold out twenty to thirty percent as a never-tuned-against test set.

  • Stand up reference checks (exact match, embedding similarity) on the examples that have a known correct answer.

  • Add LLM-as-judge on top, only for the subjective dimensions reference checks cannot reach.

  • Block merges on a regression threshold. Post the diff to the pull request, not a dashboard.

  • Pull new golden examples from production failures every week. The set grows from the misses, not from imagination.

Further reading