Agents and evals
Agent systems, code-review loops, evaluation infrastructure, and the work required to make AI reliable outside a demo.
The AI work I keep returning to: orchestration, feedback loops, measurable behavior, and where autonomy breaks down.
Start with operating the agents, then move into review, evals, and local context.
Read first
- Orchestrating AI Coding Agents: What I Learned Running Three Autonomous Sessions at Once5 min
I ran three concurrent AI coding agents across four repos. They shipped 20+ PRs, wrote 100+ posts, and handled real review and CI work.
- DiffScope: What Happens When You Give a Code Review Agent Real Context5 min
Most AI review tools see a diff. DiffScope sees the diff, the callers, the type hierarchy, the team history, and knows when to shut up. Here is how.
- The Evaluation Infrastructure We Need: Why AI Testing is Fundamentally Broken3 min
Existing evaluation infrastructure was built for deterministic software. AI systems are probabilistic, context-dependent, and non-reproducible. The...
- Building Kestrel: A Context-Aware AI Desktop Assistant in One Session4 min
How I built a full LittleBird clone with screen context reading, meeting recording, arena mode, and MCP tool support — from scratch to packaged .app in a single coding session.
Everything else
- AI Code Review Is Reasoning, Not Pattern Matching3 min
AI code reviewers moved from rules-based checking to reasoning-based analysis. The gap between what they catch and what humans catch is closing fast.
- The Shift to Async Code Gen: What It Means for Developers2 min
Async code generation turns development into specification and review. The coding happens in the background. This changes what it means to be a senior...
- Building the HTTP for Agents: A Complete Guide to Agent Infrastructure4 min
Autonomous agents need the same infrastructure primitives that web services got a decade ago: identity, policy, and secrets as first-class citizens.
- Beyond Simple Prompts: Production-Grade LLM Techniques with DSPy2 min
The best AI companies don't write prompts by hand. They generate them programmatically, test them systematically, and optimize them continuously.
- Prompt Engineering Science: I Tested Temperature and Top-P on 1000 Queries3 min
Systematic experiments on temperature and top-p sampling parameters across 1000 real queries with empirical data on creativity, coherence, and...
- How RAG Actually Works: Architecture Patterns That Scale3 min
Deep dive into RAG architectures: chunking strategies, retrieval methods, embedding optimization, and production patterns with research-backed analysis.
- When Claude Hits Its Limits: Building an AI-to-AI Escalation System3 min
Different LLMs have different strengths. Routing tasks to the right model -- like heterogeneous compute -- turns out to be more valuable than using one ...
- Two Minds in the Machine: Shared Context Is the Only Thing That Matters3 min
I added Gemini to a codebase that already had Claude embedded. The useful discovery was about shared context files, not model capabilities.
- When AI Learns to Write Like You: A Meta-Analysis2 min
I asked Claude to analyze my writing style across my blog posts. The patterns it found -- and the ones I didn't know I had -- were genuinely surprising.
- The AI Agent Gold Rush: Why Everyone's Building Picks and Shovels3 min
Most AI agent infrastructure is premature. The agents themselves barely work. The industry is selling Formula 1 equipment to people still learning to...