Agents and evals

Agent systems, code-review loops, evaluation infrastructure, and the work required to make AI reliable outside a demo.

The AI work I keep returning to: orchestration, feedback loops, measurable behavior, and where autonomy breaks down.

Start with operating the agents, then move into review, evals, and local context.

Read first

Orchestrating AI Coding Agents: What I Learned Running Three Autonomous Sessions at Once5 min
I ran three concurrent AI coding agents across four repos. They shipped 20+ PRs, wrote 100+ posts, and handled real review and CI work.
DiffScope: What Happens When You Give a Code Review Agent Real Context5 min
Most AI review tools see a diff. DiffScope sees the diff, the callers, the type hierarchy, the team history, and knows when to shut up. Here is how.
The Evaluation Infrastructure We Need: Why AI Testing is Fundamentally Broken3 min
Existing evaluation infrastructure was built for deterministic software. AI systems are probabilistic, context-dependent, and non-reproducible. The...
Building Kestrel: A Context-Aware AI Desktop Assistant in One Session4 min
How I built a full LittleBird clone with screen context reading, meeting recording, arena mode, and MCP tool support — from scratch to packaged .app in a single coding session.

Everything else

Jul 8, 2025
AI Code Review Is Reasoning, Not Pattern Matching
AI code reviewers moved from rules-based checking to reasoning-based analysis. The gap between what they catch and what humans catch is closing fast.
3 min
Jul 8, 2025
The Shift to Async Code Gen: What It Means for Developers
Async code generation turns development into specification and review. The coding happens in the background. This changes what it means to be a senior...
2 min
May 3, 2025
Building the HTTP for Agents: A Complete Guide to Agent Infrastructure
Autonomous agents need the same infrastructure primitives that web services got a decade ago: identity, policy, and secrets as first-class citizens.
4 min
Jun 25, 2025
Beyond Simple Prompts: Production-Grade LLM Techniques with DSPy
The best AI companies don't write prompts by hand. They generate them programmatically, test them systematically, and optimize them continuously.
2 min
Jan 6, 2025
Prompt Engineering Science: I Tested Temperature and Top-P on 1000 Queries
Systematic experiments on temperature and top-p sampling parameters across 1000 real queries with empirical data on creativity, coherence, and...
3 min
Jan 6, 2025
How RAG Actually Works: Architecture Patterns That Scale
Deep dive into RAG architectures: chunking strategies, retrieval methods, embedding optimization, and production patterns with research-backed analysis.
3 min
Jun 25, 2025
When Claude Hits Its Limits: Building an AI-to-AI Escalation System
Different LLMs have different strengths. Routing tasks to the right model -- like heterogeneous compute -- turns out to be more valuable than using one ...
3 min
Jun 25, 2025
Two Minds in the Machine: Shared Context Is the Only Thing That Matters
I added Gemini to a codebase that already had Claude embedded. The useful discovery was about shared context files, not model capabilities.
3 min
Jun 19, 2025
When AI Learns to Write Like You: A Meta-Analysis
I asked Claude to analyze my writing style across my blog posts. The patterns it found -- and the ones I didn't know I had -- were genuinely surprising.
2 min
Sep 8, 2025
The AI Agent Gold Rush: Why Everyone's Building Picks and Shovels
Most AI agent infrastructure is premature. The agents themselves barely work. The industry is selling Formula 1 equipment to people still learning to...
3 min