2026-05-18

Why We Prefer Rules Before LLM Judges

#agents#evals#reliability#engineering

Abstract gold agent traces passing through deterministic rule gates before reaching a smaller LLM judgment node.

LLM judges are easy to reach for.

An agent finishes a run. The answer looks plausible. The trace is long. Nobody wants to inspect every tool call by hand. So we ask another model: was this good?

Sometimes that is the right move. Some failures are semantic. Some require judgment. Some depend on whether the agent understood the intent behind the task, not just whether it touched the right files.

But if the judge is the first line of defense, the system is usually upside down.

Our default is rules before judges.

Not because rules are more sophisticated. Because they are less mysterious. They catch the failures that should not require interpretation, and they leave the judge with the smaller set of questions that actually need judgment.

Start With What the System Can Know

A surprising amount of agent reliability is not subjective.

Did the agent call the tool it claimed to call? Did the command exit with code zero? Did the expected file appear? Did the diff include the requested change? Did the final answer mention a source that never appeared in the trace? Did the run ask for approval before an external side effect?

None of those questions need an LLM.

They need structured traces, clear acceptance criteria, and boring checks that run every time. If the agent says it ran the build, the harness can look for the build command. If the agent says it updated a document, the harness can inspect the changed file. If the task required a public action approval gate, the harness can verify that the run stopped before crossing it.

A rule does not have to understand the whole task. It only has to catch one failure mode cleanly.

That is why rules compound. One check for missing verification. One check for swallowed tool errors. One check for unsupported final claims. One check for incomplete artifacts. Each one removes a class of silent failure from the judge's workload.

Rules Fail Loudly

The best thing about a deterministic check is that it usually fails with an explanation.

The file was not created. The schema did not validate. The trace does not contain the required command. The tool returned an error and the final answer ignored it.

That kind of failure is useful because the next engineering move is obvious. Fix the tool contract. Tighten the prompt. Add a required step. Change the state machine. Improve the artifact check.

A broad judge score rarely gives you that clarity.

If a judge says a run was a 6 out of 10, what broke? The retrieval? The tool call? The plan? The final wording? The model's interpretation of the user's intent? The judge's own rubric?

Scores can be helpful as a signal, but they are a bad substitute for direct evidence. We want the system to say, as often as possible, exactly which invariant was violated.

Judges Are for the Ambiguous Layer

There are still plenty of failures rules cannot capture well.

An agent may run the requested command and still draw the wrong conclusion from the output. It may edit the correct file but choose a brittle abstraction. It may summarize the relevant source and omit the caveat that changes the decision. It may recover from an error in a way that works today but leaves hidden state behind for the next run.

Those are real failures. They deserve evaluation.

This is where LLM-as-judge belongs: not as a blanket verdict on the whole run, but as a focused reviewer for a specific ambiguity.

The judge prompt should be narrow enough that a human reviewer would know what evidence to inspect:

Given the user request, the final response, and the tool trace, did the agent verify the build before claiming the build passed?

Or:

Given the error and the recovery steps, did the agent update state after understanding the failure, or did it only repeat the same operation?

The narrower the assignment, the more useful the judgment. The judge sees cleaner context. The answer is easier to compare with a human review. The failure is easier to turn into a regression case.

The Ordering Changes the Economics

Rules-first evaluation is not just a quality preference. It changes the cost and latency profile of the system.

Deterministic checks can run on every attempt. They are cheap enough to put in the hot path. They can fail fast before a long judge call. They can run in CI, in local development, in background monitors, and in production traces without turning every run into another model workload.

The judge then handles fewer cases and better cases.

Instead of reviewing a messy transcript and deciding whether everything was acceptable, the judge receives a trace that has already been organized by the rule layer. The obvious failures are gone. The required artifacts are present or explicitly missing. The question is about the remaining ambiguity.

That makes the judge more reliable because the input is better. It also makes the system easier to operate because most failures do not require another probabilistic component to explain them.

Rules Also Define the Contract

A rule is more than a check. It is a statement about what the agent is expected to do.

If we require source-grounded claims, the trace needs source evidence. If we require build verification, the trace needs a build command and result. If we require approval before publishing, the agent must stop at the boundary and ask. If we require artifacts for a multi-step task, the run is not complete until those artifacts exist.

Those expectations shape the agent.

They make the invisible contract visible. They tell the prompt, the tools, the runtime, and the reviewer what completion actually means. Without that contract, the system drifts toward vibes: the answer sounded confident, the plan looked reasonable, the judge was mostly satisfied.

Production systems need sharper edges than that.

Use the Most Boring Tool That Works

The point is not to avoid LLM judges. The point is to stop asking them to do work that simpler machinery can do better.

Use a rule when the failure is objective. Use a judge when the failure is semantic. Use a human when the consequence requires accountability.

That ordering keeps each layer honest.

Rules provide fast, debuggable guardrails. Judges handle the ambiguity that remains. Humans review the decisions that should not be delegated away.

For agent reliability, that is the practical path. Not one giant judge prompt. Not blind trust in a final answer. A stack of checks where the cheapest reliable signal comes first.

The LLM is powerful. We should save it for the parts that need it.