2026-01-08

55 Tools, 18 Agents, 1 Platform: Lessons from Building Conductor

#orchestration#agents#engineering#lessons-learned

55 tools and 18 agents in one orchestration platform

Conductor started as a five-tool prototype. The client had a handful of analysis engines they were running manually, in sequence, copying outputs from one tool into the next. Classic automation opportunity. "Let's chain these together and add some AI to handle the routing."

Six months later, Conductor orchestrates 55+ specialized analysis engines and 18 AI agents across configurable multi-step pipelines. It is, by a wide margin, the most complex system we've built. Here's what we learned.

Lesson 1: You Will Not Design the Right Architecture on Day One

The first architecture was a simple sequential pipeline. Tool A → Tool B → Tool C. Each tool had a fixed position. The output format of each tool was hardcoded to match the input format of the next.

This lasted about three weeks. Then the client said "sometimes we need to run Tool C before Tool B" and "sometimes we skip Tool D entirely" and "we just bought a new tool that needs to slot in between stages 2 and 3."

Version two was a DAG (directed acyclic graph) executor. You define the pipeline as a set of nodes and edges, and the orchestrator handles execution order, parallelism where possible, and data passing between stages. This was better, but still assumed that the pipeline topology was known at definition time.

Version three — the one in production — treats each pipeline as a sequence of stages, where each stage has an agent that decides which tools to run based on the actual data at that point. The topology is still defined upfront (we learned from the "too much agent autonomy" mistake), but the specific tools invoked at each stage are dynamic.

If we'd tried to build version three from day one, we would have over-engineered everything and shipped nothing. The iterative path was the right one.

Lesson 2: Failure Isolation Is Everything

When you have 55+ tools, some of them are going to fail. Not occasionally — regularly. External API goes down. A tool takes 10x longer than expected. A tool returns malformed output. A tool works fine with small inputs but OOMs on large ones.

Early Conductor treated every tool failure as a pipeline failure. Tool crashes? Pipeline stops. This meant our pipeline success rate was roughly the product of all individual tool reliability rates. With 55 tools each at 99% reliability, you get 0.99^55 = ~57% pipeline success. Unacceptable.

The fix was making tool failures non-fatal by default. Each tool invocation is wrapped in a timeout and error handler. If a tool fails, the pipeline records the failure, marks that stage's output as partial, and continues. The final output includes a manifest of which tools ran, which succeeded, which failed, and which were skipped.

For some tools, failure should be fatal — if the primary extraction tool fails, there's no point continuing. You configure this per-tool. But the default is "record and continue," not "crash the whole thing."

Lesson 3: The Taxonomy Problem Is Real

When 55 tools each produce their own output format and their own categorization scheme, you need a unified taxonomy. Otherwise the final output is an incoherent pile of incompatible findings.

Tool A might classify something as "high severity." Tool B calls the same thing "critical." Tool C uses a numeric scale. Tool D doesn't assign severity at all.

We built a normalization layer that maps every tool's output into a common taxonomy. This sounds simple but it consumed more engineering time than any other part of the system. Every time a new tool is added, its output format needs to be mapped into the common schema. Every time the taxonomy evolves (which it does, because the client's understanding of their domain evolves), every mapping needs to be reviewed.

If we were starting over, we'd define the taxonomy first and build the tool integrations around it. We did it the other way — built the tool integrations first and then tried to unify them after the fact. That was painful.

Lesson 4: Pipeline Composition > Pipeline Configuration

The initial design used a YAML configuration file for each pipeline. You'd list the tools, their order, their parameters, their failure behavior. The config files got long. Really long. And they were fragile — change one tool's parameter name and the pipeline breaks silently.

We moved to a composition model instead. Each pipeline is built from reusable stage definitions. A stage is a unit that says "run these tools with these parameters and feed the output to this normalization mapping." Pipelines are composed by chaining stages together.

The difference is subtle but important. With configuration, you're managing one massive document that defines everything. With composition, you're managing small, testable pieces that snap together. When something breaks, you can test the individual stage in isolation. When you need a new pipeline, you compose existing stages in a new order rather than writing a new config from scratch.

Lesson 5: Observability Is Not Optional

With 55 tools and 18 agents, "something went wrong" is a useless error message. You need to know: which stage? Which tool? Which agent made the routing decision? What was the input? What was the output? How long did it take? What was the memory usage?

We instrument everything. Every tool call is logged with its input hash, output hash, duration, and resource usage. Every agent decision is logged with the context it saw and the tools it selected. Every pipeline run gets a unique ID that threads through every log entry.

This level of instrumentation has a performance cost. We budget about 5% overhead for logging and metrics. It's worth it. When a client reports that a pipeline produced unexpected results, we can reconstruct exactly what happened — which tools ran, what they saw, what the agents decided, and where the output diverged from expectations. Without that, debugging a 55-tool pipeline would be archaeology.

Lesson 6: Know When to Stop Adding Tools

This one's more strategic than technical. At some point, the marginal value of adding the 56th tool drops below the marginal cost of maintaining it. Every tool has an ongoing maintenance burden — API changes, output format changes, availability monitoring, taxonomy mapping updates.

We have an informal rule now: a new tool needs to demonstrably improve pipeline output quality on a meaningful percentage of real-world inputs before it gets added to production. "This tool is cool and could theoretically help" isn't good enough. Show us the improvement on actual data.

Three of the original 55 tools have since been retired because their contribution didn't justify their maintenance cost. That's healthy. A platform that only grows and never shrinks is accumulating dead weight.

The Meta-Lesson

Building Conductor taught us that orchestration at scale is fundamentally a systems engineering problem, not an AI problem. The hard parts weren't the AI agents — those were relatively straightforward. The hard parts were failure isolation, taxonomy normalization, pipeline composition, and observability. The boring infrastructure that makes the interesting parts reliable.

If you're building something similar, spend your first month on the orchestration layer. Get failure handling, logging, and pipeline composition right before you add your first tool. The tools are the easy part. The platform that runs them is the hard part.