2025-06-15

Building Autonomous Systems That Actually Work

#agentic-ai#automation#engineering#architecture

Layered system architecture

We shipped our first production agent system about a year ago. Since then, we've built nine more — orchestration platforms, data pipelines, document verification engines, workflow assistants, report generators, quality enforcement agents, execution frameworks. Different domains, different tech stacks, different clients. But the same lessons keep surfacing.

This post is an attempt to write down what we've actually learned, not what sounds good in a pitch.

Most AI Automation Fails for Boring Reasons

It's rarely the model. GPT-4, Claude, Gemini — they're all remarkably capable. The failures we see (and the ones we've caused ourselves, early on) almost always come down to the same handful of engineering problems:

No separation between "what to check" and "how to check it." The first version of one of our verification systems was a single massive prompt. It kind of worked. But when the client needed to adjust the sensitivity of one specific check, we had to rewrite the whole thing. We eventually split it into a pipeline of focused agents — each with a narrow job, its own threshold, and its own evaluation criteria. Now you can tune one check without touching the others.

State management as an afterthought. Agents that run once are easy. Agents that run continuously, pick up where they left off after a crash, and don't reprocess data they've already seen — that's where most teams get stuck. We learned this the hard way building Agent Runner and now bake checkpoint logic into everything from day one.

No answer for "why did the system do that?" If your agent produces an output and you can't trace backwards through its decision chain to understand why, you don't have a production system. You have a demo. Every system we build logs the full reasoning chain — which agents ran, what they saw, what they decided, and what they passed downstream.

The Architecture Pattern That Keeps Working

After nine projects, we've converged on a pattern. It's not revolutionary — it's just disciplined:

Deterministic first, LLM second. For our document verification platform, the first pass is pure rule-based extraction and validation. Known formats, expected fields, regex patterns. The LLM only gets involved for the ambiguous cases — the stuff that actually requires judgment. This means 70% of inputs never touch the model at all, which is faster, cheaper, and more predictable.

Focused agents over monolithic prompts. Our orchestration platform chains 55+ specialized analysis engines and 18 AI agents. Each agent does one thing well. The orchestration layer handles routing, sequencing, and result aggregation. When we tried doing it with fewer, larger agents, the results were less consistent and much harder to debug.

Event-driven, not request-driven. Our data pipeline processes 65k+ pattern definitions across heterogeneous data packages. It's fully serverless — a file lands, events fire, processing chains execute, results aggregate. No polling, no batch jobs, no "run it every hour and hope nothing changed." The architecture matches the shape of the problem.

Things We Got Wrong

Being honest about mistakes is more useful than listing wins, so here are a few:

We over-indexed on agent autonomy early on. The first system we built gave agents too much latitude to decide their own next steps. The results were impressive when it worked and baffling when it didn't. We've since moved toward more structured agent pipelines — the agents still make judgments, but the pipeline topology is defined upfront. Less magic, more reliability.

We underestimated the value of a good TUI. When we built Bouncer, we added a terminal interface almost as an afterthought. Turned out to be one of the most valuable features. Watching agents process files in real-time, seeing verdicts come through, catching issues as they happen — it changed how we debug and how our clients trust the system. We now prototype some kind of observability UI for every project.

We assumed clients wanted dashboards. They wanted Slack. When we built Beacon, we started with a web dashboard. Nobody used it. Then we put the same capabilities behind a Slack interface backed by 13 MCP tool servers, and adoption was immediate. People don't want to learn a new tool. They want intelligence in the tools they already use.

What "Production-Grade" Actually Means

We throw this term around a lot in the industry. Here's what it means to us, concretely:

It runs without supervision. Not "it works when someone watches it." It works at 3am on a Saturday when nobody's looking.
It handles bad input gracefully. Not by crashing. Not by hallucinating. By recognizing the problem, logging it, notifying someone, and moving on.
It's auditable. Every decision can be traced. Every output can be explained. Every change is versioned.
It degrades, it doesn't collapse. If one agent in a pipeline fails, the pipeline doesn't die. It flags the failure, skips the step if it can, and produces partial results with a clear note about what's missing.

None of this is glamorous. None of it makes for a good demo. But it's the difference between a system that impresses people in a meeting and a system that actually runs a piece of your business.

Where We're Headed

We open-sourced Agent Runner and Bouncer because we think the production orchestration layer is where the industry needs the most help. The models are good enough. The frameworks for calling them are good enough. What's missing is the boring infrastructure that makes agents reliable at scale.

If you're building agents and hitting the same walls we did — the glue code, the state management, the "it works locally but not in production" gap — check out the tools we've published or reach out. We've probably already solved the problem you're stuck on.