Journal

Blog

Thoughts on building autonomous systems, engineering culture, and the future of AI.

How Agents Fail Quietly

2026-05-25

The most dangerous agent failures are often quiet: plausible answers, partial completion, skipped verification, stale assumptions, and swallowed tool errors.

Why We Prefer Rules Before LLM Judges

2026-05-18

LLM judges are useful for ambiguous agent behavior, but the reliability stack gets cheaper, faster, and easier to debug when deterministic checks run first.

The Evaluation Pyramid for AI Agents

2026-05-11

Reliable agents need layers of evaluation: deterministic checks at the base, LLM judgment for ambiguity, and human review where consequences matter.

Evals Are Unit Tests for Agent Behavior

2026-05-04

Recurring agent failures should not live in Slack threads or postmortems. They should become regression tests that keep the same mistake from coming back silently.

Security Agents Are Becoming the Product

2026-05-02

Claude Mythos, Claude Security, and OpenAI Codex Security point toward the same shift: AI security tools are moving from scanners to agentic remediation systems.

Garbage In, Garbage Out

2026-02-25

Your agent is only as good as the data you feed it. Bad input doesn't just produce bad output — it produces convincingly bad output.

The MCP Bet

2025-09-10

We went all-in on Model Context Protocol for Beacon instead of building custom integrations. Here's why, and what we learned.