The Evaluation Pyramid for AI Agents

A single eval cannot carry the weight of a production agent.
Agents do too many different kinds of work. They read context, choose tools, transform files, make claims, recover from errors, and sometimes ask a human for approval before crossing a boundary. One scoreboard at the end of that run is not enough. It might tell you something felt right or wrong. It will not tell you where trust broke.
The more useful model is a pyramid.
At the base: deterministic checks. Fast, cheap, boring signals that should catch the failures we can define precisely.
In the middle: LLM-as-judge. Focused review for the places where rules run out and judgment is actually needed.
At the top: human review. Reserved for high-impact decisions, external side effects, and cases where accountability cannot be delegated to a model.
The pyramid matters because each layer has a different job. When we blur those jobs together, evals get expensive, noisy, and hard to debug.
The Base Is Deterministic
Most agent reliability work should start with checks that do not require another model.
Did the agent call the tool it claimed to call? Did the command exit successfully? Did the expected file change? Did the structured output match the schema? Did the run include an approval request before a public action? Did the final response mention a source that never appeared in the trace?
These are not glamorous evals. They are closer to linting, unit tests, type checks, and CI gates.
That is why they are valuable.
A deterministic check is cheap to run on every attempt. It is easy to explain when it fails. It can become part of the harness without adding much latency. And it keeps the LLM judge from being asked to do work that the system already knows how to verify.
For agents, the base layer usually includes checks like:
- Required tool calls appeared in the trace.
- Tool errors were surfaced instead of ignored.
- Claims in the final answer are supported by evidence from the run.
- Output conforms to a schema or contract.
- Required artifacts exist after the task completes.
- Approval boundaries were respected before irreversible or external actions.
If a failure can be expressed as a rule, start there.
The Middle Is Judgment
Some failures are real but not cleanly deterministic.
The agent may have opened the right files and still misunderstood the task. It may have produced code that builds but solves the wrong problem. It may have summarized a customer request accurately in one paragraph and missed the critical caveat in the next. It may have chosen a recovery path that was plausible but risky.
This is where LLM-as-judge earns its place.
The mistake is using the judge as a blanket verdict machine: “Was this run good?” That question is too broad. It hides too many failure modes inside one score.
A better judge has a narrow assignment:
Given the user request, the tool trace, the final answer, and the acceptance criteria, did the agent complete the requested verification before claiming success?
Or:
Given the incident trace and the expected recovery behavior, did the agent update state after the failure, or did it only retry the same operation?
The narrower the question, the more useful the judgment. The judge sees less noise. The result is easier to compare against a human review. The failure can be turned into a regression case instead of a vague complaint about quality.
The middle layer is not there because LLM judges are magic. It is there because some parts of agent behavior are semantic, contextual, and hard to reduce to a rule without losing the point.
The Top Is Human Review
Human review should be rare enough to matter and early enough to prevent damage.
Some decisions need an accountable person. Publishing a public post. Sending a customer email. Merging a production migration. Deleting data. Spending money. Making a legal or financial commitment.
The eval pyramid does not remove those boundaries. It makes them sharper.
The lower layers should prepare the human reviewer instead of replacing them. Before the run reaches a person, the system should already know whether required checks passed, what evidence was collected, what changed, and where the unresolved risk sits.
A good agent does not hand a human a mystery. It brings the trace.
That changes the role of review. The reviewer is no longer digging through logs to reconstruct what happened. They are making the decision only a person should make, with the supporting material already organized.
Don’t Invert the Pyramid
A common failure mode is an upside-down evaluation pyramid.
Every run goes to an LLM judge. Ambiguous scores become the main signal. Humans get pulled in only after something feels off. Deterministic checks are added later, if at all.
That shape is expensive and brittle.
It also makes improvement harder. When a judge says a run is a 6 out of 10, what should the team fix? The prompt? The tool contract? The retry policy? The approval boundary? The context assembly? The model?
Deterministic failures point to the broken part of the system. Judgment failures point to the ambiguous behavior that needs a clearer spec, a narrower judge, or a human decision. Human review points to a boundary that should remain explicit.
Those signals are different. Treat them differently.
The Pyramid Becomes the Operating System
Once the layers are in place, evals stop being a separate research exercise. They become how the agent operates.
A run starts. The harness records the trace. Deterministic checks catch missing actions and invalid outputs. Judge calls review the small set of semantic questions that rules cannot answer. Human approval gates pause the run before high-impact actions. The final result includes not just what the agent did, but how it was evaluated along the way.
That is the shape we want in production systems.
Not one score at the end. Not a giant judge prompt. Not blind trust in a confident final answer.
A layered evaluation stack that matches the actual risk of the work.
The pyramid is simple, but it changes the conversation. Instead of asking whether an agent is “good,” we can ask better questions.
What did we verify with rules? What required judgment? What still needed a person? And when something failed, did we save the trace so the same mistake cannot hide again?
That is how evals become infrastructure.