Blog | Maplehill Labs

Evidence, Not Confidence, Is the Agent UX

2026-07-06

Production agents should earn trust by showing evidence of what they saw, changed, verified, and left unresolved instead of asking users to trust confident summaries.

The New Agent Stack Is a Control Plane

2026-06-29

Production agents need more than prompts and tools. They need a control plane for permissions, state, observability, approvals, and recovery.

Agent Reliability Is Mostly Plumbing

2026-06-22

Reliable agents depend less on one clever prompt than on queues, timeouts, locks, idempotency, structured outputs, snapshots, and backpressure around the model.

The Difference Between Retry and Recovery

2026-06-15

Retries repeat an operation; recovery updates state after understanding why the operation failed. Reliable agents need both, but they are not the same control loop.

What Should Go in an Agent Trace

2026-06-08

A useful agent trace preserves the request, context, tool evidence, decisions, validation, boundaries, and final state so failures can become debuggable regression cases.

Why Agent Observability Is Harder Than Logging

2026-06-01

Agent observability needs more than logs: it has to preserve what the agent saw, why it acted, what tools returned, and where drift entered the run.

How Agents Fail Quietly

2026-05-25

The most dangerous agent failures are often quiet: plausible answers, partial completion, skipped verification, stale assumptions, and swallowed tool errors.

Why We Prefer Rules Before LLM Judges

2026-05-18

LLM judges are useful for ambiguous agent behavior, but the reliability stack gets cheaper, faster, and easier to debug when deterministic checks run first.

The Evaluation Pyramid for AI Agents

2026-05-11

Reliable agents need layers of evaluation: deterministic checks at the base, LLM judgment for ambiguity, and human review where consequences matter.

Evals Are Unit Tests for Agent Behavior

2026-05-04

Recurring agent failures should not live in Slack threads or postmortems. They should become regression tests that keep the same mistake from coming back silently.

Security Agents Are Becoming the Product

2026-05-02

Claude Mythos, Claude Security, and OpenAI Codex Security point toward the same shift: AI security tools are moving from scanners to agentic remediation systems.

Claude Code Channels: Event-Driven Agents in Your Terminal (and How It Compares to OpenClaw)

2026-03-22

Claude Code Channels let external systems push events (webhooks, alerts, chat) into a live Claude Code session via an MCP channel server. Here’s the mental model, a minimal webhook example, and a neutral comparison to OpenClaw.

NVIDIA NemoClaw: What It Is, What It Adds to OpenClaw, and Why It Matters for Always‑On Agents

2026-03-19

NVIDIA’s NemoClaw isn’t an OpenClaw alternative — it’s a stack that installs the OpenShell runtime and policy guardrails to make always‑on agents safer and more deployable. Here’s what was announced and how to think about it.

The Clawverse: OpenClaw, ZeroClaw, NanoClaw, and NullClaw (and what they optimize for)

2026-03-03

A practical comparison of a new wave of personal AI assistant runtimes — not by hype, but by what they optimize for: portability, isolation, footprint, and extensibility.

Garbage In, Garbage Out

2026-02-25

Your agent is only as good as the data you feed it. Bad input doesn't just produce bad output — it produces convincingly bad output.

Bouncer: Teaching a Folder to Reject Bad Files

2026-02-24

We built a file monitoring agent that runs 12 AI quality checks on every incoming file and fixes the easy stuff automatically.

Agent Runner: Stop Writing Glue Code for Your AI Agents

2026-02-20

We got tired of rewriting the same orchestration boilerplate every time we deployed an agent. So we built a framework that handles it.

Your Agent Doesn't Need a Database. It Needs Redis.

2026-02-10

State management for AI agents — why Redis beats Postgres for checkpoints, pub/sub coordination, and ephemeral state.

Serverless Agents Are Underrated

2026-01-22

Why our data pipeline is fully serverless, when it makes sense for agent workloads, and when it doesn't.

55 Tools, 18 Agents, 1 Platform: Lessons from Building Conductor

2026-01-08

What happens when you try to orchestrate that many moving parts. An engineering retrospective, not a case study.

Dual-Evaluation Architecture: Rules First, LLM Second

2025-11-05

The architectural pattern behind our verification platform. Why deterministic checks before LLM judgment makes systems faster, cheaper, and more trustworthy.

The MCP Bet

2025-09-10

We went all-in on Model Context Protocol for Beacon instead of building custom integrations. Here's why, and what we learned.

Searching Your Life from the Command Line

2025-09-05

We built an MCP server that lets Claude search your Limitless Pendant recordings. Turns out, making your life queryable changes how you work.

Giving Your AI a Memory It Won't Forget

2025-08-15

AI assistants are brilliant in the moment and useless the next day. We built an MCP server for Mem.ai that fixes the amnesia problem.

Why We Chose Claude Over GPT for Production Agents

2025-07-18

We've built production systems on both. Here's an honest breakdown of what pushed us toward Claude for agent workloads.

Building Autonomous Systems That Actually Work

2025-06-15

Nine production systems later, here's what we've learned about the gap between AI demos and AI that runs your business.