Jonathan Haaswritingthemesnowusesabout
emailgithubx
Jonathan Haaswritingthemesnowusesabout
June 25, 2025·3 min read

When Claude Hits Its Limits: Building an AI-to-AI Escalation System

Different LLMs have different strengths. Routing tasks to the right model -- like heterogeneous compute -- turns out to be more valuable than using one ...

#ai#mcp#claude#gemini#debugging

Filed under Agents and evals. The AI work I keep returning to: orchestration, feedback loops, measurable behavior, and where autonomy earns or loses permission.

I hit a wall debugging a distributed system race condition. Claude Code had analyzed 30 files, but the bug spanned microservices with gigabytes of traces. Claude is excellent at surgical code edits. Correlating thousands of trace spans across services requires a model that can hold all that context at once.

So I built an escalation system. Claude calls Gemini when it needs the 1M token context window. Gemini returns structured analysis. Claude uses it to make the fix.

The Architecture

The Model Context Protocol (MCP) made this straightforward. An MCP server sits between Claude Code and Gemini. When Claude recognizes it needs help -- too many files, too much trace data, cross-service correlation -- it calls the escalation tool. The server packages relevant context and routes it to Gemini. Gemini returns structured findings. Claude acts on them.

The interesting addition is conversational tools. Instead of one-shot analysis, Claude and Gemini engage in multi-turn dialogues. Claude asks follow-ups based on Gemini's responses. The models build on each other's insights in ways single-shot analysis can't achieve.

What I Learned

Context preparation is 80% of the work. The hardest part isn't the API calls. It's deciding which files, traces, and logs are actually relevant. Too little context and the analysis fails. Too much and you waste tokens and confuse the model. Most development time went into intelligent context selection.

The escalation heuristic matters. Not every problem needs Gemini. The MCP server uses thresholds: file count over 50, trace size over 100MB, more than 3 services involved, more than 5 failed approaches. Getting this wrong in either direction is expensive -- unnecessary escalations waste money, missed escalations waste developer hours.

Multi-turn beats single-shot. Single-shot analysis misses nuances. The conversational approach lets models build iteratively, leading to discoveries neither would make alone. This was the biggest surprise -- multi-turn capability was more valuable than raw context window size.

It's not magic. Sometimes the analysis is wrong. Sometimes the context preparation misses the relevant data. Sometimes the bug is in the one file you didn't include. This is a tool that improves your odds, not a guarantee.

The Bigger Point

Treat LLMs like heterogeneous compute resources. Route tasks to the model best equipped to handle them, the same way you'd pick the right database for the right query pattern. Claude for code edits and reasoning. Gemini for massive context analysis. Smaller models for simple classification tasks.

In two years, developers will orchestrate multiple specialized models routinely. The question is whether you're building the orchestration layer now or waiting for someone else to build it for you.

The deep-code-reasoning-mcp server is open source.

Share:
//

More in Agents and evals

Previous on this shelf: How RAG Actually Works: Architecture Patterns That Scale

Next on this shelf: Shared Context Is the Real Multi-Agent Primitive

Open the full shelf

emailgithubx