2025-11-05

Dual-Evaluation Architecture: Rules First, LLM Second

#architecture#automation#engineering#patterns

Dual-evaluation — deterministic rules above, LLM analysis below

When we built our document verification platform, the first prototype was pure LLM. Upload a document, send it to Claude with a detailed prompt, get back a verification result. It worked. Sometimes. The results were inconsistent, the costs were high, and we couldn't explain to the client exactly why the system approved or rejected a specific document.

The production system looks nothing like that prototype. It uses what we call a dual-evaluation architecture — deterministic rule checks first, LLM judgment second — and it's become our default pattern for any system that needs to be fast, cheap, and trustworthy.

The Pattern

The idea is simple. Before an LLM touches a document, run it through a gauntlet of deterministic checks:

Extraction. Pull out structured data using rules, patterns, and OCR. Field names, values, dates, identifiers. No AI needed for this — it's pattern matching against known document formats.

Validation. Check the extracted data against rules. Is this date in the future? Is this value within the expected range? Does this identifier match a known format? Is this field present at all? These are binary checks with unambiguous answers.

Cross-referencing. Compare extracted data against known databases. Does this issuer exist? Does this policy number match our records? Is this entity on the expected list? Again, no judgment needed — just lookups.

Only after a document passes (or fails) these deterministic stages does it move to the LLM evaluation layer. And by that point, the LLM isn't being asked to do everything — it's being asked to handle the specific ambiguities that the rules couldn't resolve.

Why This Ordering Matters

70% of documents never need the LLM. This was the number that surprised us most. The majority of documents either clearly pass all deterministic checks (and can be approved without LLM review) or clearly fail on something unambiguous (wrong date, missing required field, invalid format) and can be rejected immediately. The LLM only gets involved in the genuinely ambiguous cases.

Cost drops dramatically. If you're processing 500 documents a day and only 150 of them need LLM evaluation, you've just cut your inference costs by 70%. At scale, this is the difference between a system that's economically viable and one that isn't.

Speed goes through the roof. Deterministic checks run in milliseconds. An LLM call takes seconds. When 70% of your volume completes in milliseconds instead of seconds, your overall throughput increases by an order of magnitude.

Auditability becomes trivial. When a document is rejected by a deterministic check, the explanation is precise: "Field X has value Y, which violates rule Z." No interpretation needed. No wondering what the model was thinking. The client can see exactly what happened and either fix the document or dispute the rule. For compliance use cases, this kind of clarity isn't optional.

The LLM Layer Gets Better Input

Here's the subtler benefit that took us a while to appreciate. By the time a document reaches the LLM evaluation layer, it's been fully extracted, validated, and enriched with cross-reference data. The LLM isn't looking at a raw document and trying to figure everything out from scratch. It's looking at structured, validated data with specific questions to answer.

The prompt goes from:

"Review this insurance certificate and determine if it meets requirements."

To something more like:

"The following certificate has been extracted and validated. All required fields are present. The coverage amount ($2M) meets the minimum threshold. However, the policy effective date is 3 days before the contract start date. Given the 30-day retroactive coverage clause in the requirements, assess whether this timing gap is acceptable."

That's a much easier question for a model to answer well. The context is clean, the question is specific, and the model can focus on judgment rather than extraction. The quality of the LLM output goes up because the quality of the LLM input goes up.

Building the Rule Layer

The rule layer isn't as simple as a bunch of if statements. For our verification platform, we have a 9-stage pipeline with rules at each stage:

Stage 1-3: Extraction. Multiple extraction methods run in parallel — pattern-based, template-based, and OCR-based. Results are compared and reconciled. If two methods disagree on a field value, that field gets flagged for LLM review rather than picking a winner automatically.

Stage 4-5: Structural validation. Schema checks, required field checks, format validation. This catches the obvious problems — missing pages, wrong document type, incomplete forms.

Stage 6-7: Business logic. Domain-specific rules. Coverage minimums, date range validations, entity matching against carrier databases. These rules are configurable per client because every client has different requirements.

Stage 8: Cross-reference. Carrier intelligence — checking extracted data against known patterns for specific issuers. Does this carrier typically issue certificates in this format? Does this policy number follow the expected pattern for this carrier?

Stage 9: LLM evaluation. Only for documents that have ambiguities the rules couldn't resolve. The LLM sees all the context from stages 1-8.

Each stage produces a structured report. The final output includes the full pipeline trace — which stages passed, which flagged issues, and what the LLM evaluated. A human reviewer can see the entire decision chain.

When Not to Use This Pattern

It's not universal. If your use case is genuinely open-ended — summarize this article, generate creative content, have a conversation — there aren't meaningful deterministic checks to run first. The LLM is the whole point.

But if your use case has any structure at all — known document types, expected fields, business rules, reference databases — you almost certainly have a deterministic layer waiting to be built. And building it will make everything downstream faster, cheaper, and more reliable.

The pattern also requires upfront investment in understanding the domain. You need to know what the rules are before you can encode them. For our verification platform, we spent weeks studying the document types, interviewing the manual reviewers, and cataloging the decision rules they were applying intuitively. That work paid for itself many times over, but it's not free.

The Broader Principle

The dual-evaluation pattern is really just a specific instance of a broader principle: don't use AI for things that don't require AI.

It sounds obvious, but the current moment in the industry is pushing everyone toward "put an LLM on it" as the default solution. And for genuinely ambiguous, judgment-heavy tasks, that's correct. But most real-world workflows are a mix of deterministic steps and judgment calls. Identifying which parts are which — and building the right tool for each — is where the engineering happens.

The LLM is the most powerful tool in your stack. Use it where it matters. Use cheaper, faster, more predictable tools everywhere else.