2026-05-25

How Agents Fail Quietly

#agents#reliability#observability#engineering

Abstract gold agent traces fading into hidden failure signals and verification gates

The scary agent failure is not always the one that explodes.

Sometimes the tool call returns an error and the agent keeps going. Sometimes the agent summarizes a file it never opened. Sometimes it finishes the easy 80% of a task, writes a polished final answer, and leaves the part that mattered untouched. Sometimes the output is wrong in a way that looks reasonable until someone downstream trusts it.

That is what makes production agents hard to operate.

A normal software failure usually leaves a loud signal: an exception, a failed health check, a broken deploy, a page that will not load. Agent failures can look like success. The run completes. The answer is fluent. The checklist sounds confident. The mistake is hiding inside the gap between what the agent claimed and what it actually did.

Reliability work starts by naming those quiet failures.

Plausible Wrong Answers

Language models are very good at producing answers that sound complete.

That is useful when the answer is grounded. It is dangerous when confidence outruns evidence. An agent can make a claim from stale context, infer a missing detail, smooth over uncertainty, or compress a messy tool result into a conclusion that is not actually supported.

The failure mode is not “the model hallucinated” in the abstract. The operational failure is more specific: the system allowed an unsupported claim to reach the user without evidence attached.

The fix is not just telling the model to be careful. The harness needs to ask sharper questions.

Did the run include the source the final answer cited? Did the agent inspect the file it summarized? Did the command output support the conclusion? Did the final answer distinguish between verified facts and assumptions?

If the answer sounds plausible but the trace cannot support it, the run should fail.

Partial Completion

Agents are especially vulnerable to partial completion because many tasks have a visible easy part and a less visible hard part.

A code agent can edit the right file but skip the tests. A research agent can gather three sources but miss the primary source. A support agent can draft a reply but fail to update the ticket state. A publishing agent can create the Markdown but forget the image, metadata, or build verification.

The final answer often hides the gap because it describes the work that did happen.

“I updated the draft.”

“I checked the logs.”

“I prepared the post.”

Those statements may be true and still incomplete.

Partial completion is why acceptance criteria matter. The system needs a definition of done that is more concrete than “make progress.” For a production agent, completion should usually mean artifacts exist, required checks ran, and the final response names anything still unresolved.

If the task has five required steps, the trace should show five completed steps or an explicit reason the agent stopped.

Skipped Verification

The most common quiet failure is claiming success before verification.

This is familiar in software work. The agent makes a change, sees that the edit applied cleanly, and reports that the issue is fixed. But it never ran the build. Or it ran the wrong command. Or it ran a narrow check and generalized the result to the whole system.

Humans do this too. Agents just do it at machine speed and with better prose.

The reliability pattern is straightforward: verification claims need verification evidence.

If the agent says the build passed, the trace should contain the build command and the result. If it says a route exists, the generated file or HTTP response should be checked. If it says a workflow is ready to publish, it should have inspected the metadata, image reference, generated page, and git status.

The final answer should not be allowed to upgrade “I changed it” into “it works” without proof.

Stale Assumptions

Agents often carry context across a run. That context can go stale quickly.

A file changes after the agent read it. A command output invalidates the plan. A repo has untracked work that changes what is safe to edit. A dependency version differs from memory. A route is generated differently than the agent expected.

Quiet failures happen when the agent continues from the old assumption instead of refreshing the source of truth.

This is why tool evidence matters. The agent should not rely on memory for repository state, current files, package scripts, dates, ports, processes, or deployment status. It should check the live system at the moment the claim matters.

A stale assumption is not always a model problem. Sometimes it is a runtime problem: the harness did not force the agent to re-read the thing it was about to cite.

Swallowed Tool Errors

Tool errors are supposed to make failures easier to see. Agents can make them disappear.

A command exits nonzero, but the agent focuses on the useful part of the output. A file read fails, but the agent answers from nearby context. A search returns no results, and the agent treats that as evidence instead of a failed lookup. A build produces warnings that matter, and the final response says only that the command completed.

The issue is not that every tool error should stop the run. Some errors are recoverable. The issue is that recovery should be explicit.

A good trace shows the error, the interpretation, the recovery step, and the new evidence that the recovery worked. A bad trace has a red flag in the middle and a confident final answer at the end.

For production agents, swallowed errors should be first-class eval cases. If the tool failed, the final response needs to acknowledge the failure or show the recovery.

The Pattern: Claim, Evidence, Boundary

Most quiet failures can be reduced to three questions.

What did the agent claim?

What evidence supports that claim?

What boundary should have stopped the run before the claim reached the user?

The boundary might be deterministic: required artifact missing, command not run, schema invalid, source not opened. It might require judgment: the answer technically cites sources but misses the important caveat. It might require a human: the next step is public, irreversible, risky, or accountable.

This is where evals, observability, and approval gates meet. The trace should make quiet failures visible. The eval stack should turn repeated failures into checks. The approval system should stop the agent before quiet uncertainty becomes an external side effect.

Make Quiet Failures Loud

The goal is not to make agents afraid to act. The goal is to make their uncertainty and incompleteness visible.

A reliable agent can still fail. But it should fail in ways the system can see.

Unsupported claims should point to missing evidence. Partial completion should point to missing artifacts. Skipped verification should point to absent commands. Stale assumptions should point to sources that were not refreshed. Swallowed errors should point to recovery that never happened.

Once those failures are loud, they can become regression tests. Once they become regression tests, they stop being mysterious. And once they stop being mysterious, the agent becomes easier to trust.

Not because it never makes mistakes.

Because the mistakes stop hiding.