2026-05-04

Evals Are Unit Tests for Agent Behavior

#agents#evals#reliability#engineering

Abstract gold agent traces becoming regression checks

The most useful eval usually starts as a bug report.

An agent skipped a step. It summarized a file it never opened. It said the build passed because one command succeeded, even though the actual test suite never ran. It completed 80% of a task, wrote a confident final message, and left the part that mattered undone.

These are not exotic model failures. They are the normal failures of systems that can reason, use tools, and still lose track of the job.

The fix is not just a better prompt. The fix is turning the failure into a check that can run again.

A failure is not really fixed until it cannot come back silently.

Evals Are Operational Memory

Teams often talk about evals like they are research artifacts: datasets, leaderboards, judge prompts, scorecards. Those can be useful. But for production agents, the more practical framing is simpler.

Evals are unit tests for behavior.

A unit test captures something the code should always do. An eval captures something the agent should always do. The value is not that it proves the system is perfect. The value is that it prevents a known mistake from becoming invisible again.

If an agent once answered from stale context, we want a regression case that forces it to refresh the source. If it once skipped verification, we want a case that fails when the verification step is missing. If it once swallowed a tool error and continued as if nothing happened, we want a check that catches the missing error path.

The pattern is familiar to engineers. Production incident. Root cause. Regression test. CI gate.

Agents need the same muscle.

Start With the Smallest Deterministic Check

Our default is rules before judges.

When an agent fails, the first question is not “can an LLM evaluate this?” It is “what is the cheapest unambiguous signal that would have caught this?”

Sometimes the answer is very small.

Did the agent claim it ran tests? Check the tool trace for the test command.

Did it say it changed a file? Check the diff.

Did it cite a source? Check that the source was actually opened during the run.

Did it finish a multi-step task? Check that each required step has an artifact: a file, a command result, a status update, a build output, a ticket comment, a PR link.

These checks are boring. That is the point. They are fast, cheap, and easy to debug. When they fail, the explanation is precise: the trace does not contain the required action, the artifact is missing, the output schema is invalid, the build command exited nonzero.

No judge prompt needed.

Use LLM Judges for Ambiguity, Not Laziness

Some failures cannot be captured by a simple rule.

Maybe the agent technically touched every file, but misunderstood the user’s intent. Maybe it produced a plan that looked complete but chose the wrong abstraction. Maybe it wrote an answer that was supported by sources but missed the important caveat.

That is where LLM-as-judge can help. But we try to keep the judge focused on the ambiguous part, not the entire run.

A weak judge prompt asks: “Was this task completed well?”

A better judge prompt asks: “Given the user request, the final response, and the tool evidence below, did the agent verify the specific claim it made about build success?”

The narrower the judgment, the more useful the result. The judge gets cleaner context. The score is easier to interpret. The failure is easier to reproduce. And when the judge disagrees with a human reviewer, we can usually see why.

Judges are powerful, but they are not free. They add latency, cost, and another model failure mode. We use them where rules run out.

Keep Human Review at the Top

Some decisions should not be automated away.

Publishing a public announcement. Merging a risky infrastructure change. Sending a customer-facing note. Deleting data. Making a legal or financial commitment.

For those, the eval should not pretend to replace a person. It should prepare the review.

Did the agent collect the evidence? Did it summarize the risk? Did it show the diff? Did it identify the external side effect? Did it ask for approval before crossing the boundary?

The human reviewer should not be doing archaeology. The agent should bring the trace, the artifacts, and the open question to the top of the stack.

That is the evaluation pyramid in practice: deterministic checks at the base, LLM judgment in the middle, human review at the top.

Save the Trace With the Test

A good regression eval needs more than a prompt and an expected answer.

It needs the trace of the failure.

What did the agent see? What tools did it call? What did those tools return? What did it claim in the final response? Which required step was missing? Which assumption went stale? Which error was swallowed?

Without the trace, teams end up testing the wrong thing. They create a synthetic prompt that resembles the incident, but not the actual failure mode. The eval passes, and the bug comes back anyway.

The trace is the specimen. The eval is the lab test built from it.

When we keep them together, the system gets better in a way that compounds. Every recurring failure teaches the harness something. Every harness check gives the agent a sharper boundary. Every boundary makes the next run easier to trust.

The Payoff Is Trust

Agent reliability is not one big breakthrough. It is a pile of small remembered failures.

One check for skipped verification. One check for stale context. One check for missing citations. One check for incomplete task state. One check for tool errors. One check for approval boundaries.

Over time, those checks become the operational memory of the system.

This is why we care about evals. Not because a dashboard score looks good. Not because a judge said the answer was an 8.4. Because the agent made a mistake once, and the platform learned not to let that mistake hide again.

That is how production systems earn trust.

They remember what went wrong.