DiffScope: What Happens When You Give a Code Review Agent Real Context

Most AI code review tools are bad agents.

They take a diff, send it to an LLM, and dump whatever comes back into your PR. No context about the codebase. No memory of what your team cares about. No judgment about when to stay quiet. They optimize for coverage -- find as many things as possible -- when they should optimize for signal -- find the things that matter.

The result is predictable: engineers ignore every AI comment. The real findings get buried in noise. The tool becomes a checkbox that nobody reads.

DiffScope is built on a different premise: a review agent should behave like a good senior engineer. It sees the diff in context -- the callers, the type hierarchy, the dependency graph. It learns what your team cares about and suppresses what they don't. And it can prove, with numbers, that it's getting better over time.

The agent loop

DiffScope can operate as a simple pipe (diff in, comments out) or as an iterative agent with --agent-review. In agent mode, the model enters a tool-calling loop:

pub async fn run_agent_loop(
    adapter: &dyn LLMAdapter,
    initial_request: ChatRequest,
    tools: &[Box<dyn ReviewTool>],
    config: &AgentLoopConfig,  // max_iterations: 10, token budget
) -> Result<AgentLoopResult>

The model can call tools to look up symbol definitions, fetch file content, query git blame, and traverse the dependency graph. Each tool result gets appended to the conversation, and the model iterates until it's satisfied or hits the budget.

This is how DiffScope handles complex reviews. A single LLM call with a diff produces surface-level comments. An agent that can look things up, follow references, and build understanding produces comments that sound like they came from someone who read the codebase.

The difference matters most on large changes. A 500-line diff touching auth code needs the model to understand the session management pattern, the middleware chain, and the test coverage. The agent loop lets it explore that context iteratively instead of guessing from the diff alone.

Why context is the whole game

Here's the core insight: the quality of a code review is determined by the reviewer's context, not their intelligence.

A junior engineer who understands the codebase catches more bugs than a senior engineer reading a diff cold. AI code review has the same problem -- the model is smart enough, it just doesn't know enough.

DiffScope has a dual-mode symbol indexer that gives the model the same context a human reviewer would have:

Regex mode (default) walks the repo, applies language-specific patterns (fn/struct/impl in Rust, class/def in Python, func/type in Go), and builds a bidirectional dependency graph -- which files depend on which other files.

LSP mode (opt-in) spawns your actual language server (rust-analyzer, pylsp, gopls) via JSON-RPC for precise go-to-definition and workspace symbol queries.

Both produce a SymbolIndex with dependency tracking. When you change a function, DiffScope traverses reverse dependencies to pull in callers. When you modify a type, it finds every usage. The model sees blast radius, not just the diff.

Recent work pushed this further. Trait contract edges follow interface implementations -- change a trait method and DiffScope pulls in every struct that implements it. Graph-ranked semantic retrieval biases context selection toward structurally related files rather than just textually similar ones. The persistent graph cache means the index rebuilds incrementally instead of from scratch.

Confidence scoring: teaching an agent when to shut up

Every comment starts at a baseline of 0.70. The scoring engine adjusts based on what it detects:

SQL injection, command injection, XSS: +0.20
Hardcoded secrets: +0.25
Auth issues (JWT, CSRF): +0.15 to +0.20
Weak crypto: +0.15
CWE reference present: +0.10

A SQL injection finding lands at 0.95+. A style suggestion stays at 0.70. You set --min-confidence per path in your config:

paths:
  "src/api/**":
    review_instructions: |
      Prioritize auth and input validation.
    severity_overrides:
      security: error
  "tests/**":
    severity_overrides:
      style: suggestion

The threshold is the product decision. Your security team wants --min-confidence 60 on the API layer. Your frontend team wants --min-confidence 85 on components. Each team defines their own noise tolerance. An agent that doesn't let you control its verbosity isn't respecting your time.

The feedback loop: an agent that learns

This is the part that matters most.

When a reviewer accepts or rejects a suggestion, DiffScope records the feedback and builds a ConventionStore. The suppression logic uses Wilson score confidence intervals -- a statistical technique that accounts for sample size. Three rejections out of three has different weight than thirty rejections out of forty.

Patterns with consistently low acceptance get suppressed. Patterns with high acceptance get boosted. The store tracks file-pattern context so suppressions learned from test files don't bleed into API code.

Recent work added feedback learning lift measurement -- analytics that quantify the noise reduction. "Your team rejected 47 'consider adding error handling' comments. DiffScope now suppresses that pattern, saving an estimated 12 minutes per review."

Why does this matter? Because every AI agent starts noisy. The question is whether it stays noisy. A feedback loop that actually learns from reviewer behavior is the mechanism for getting quieter over time.

Multi-judge verification

For high-stakes reviews, DiffScope runs a verification pass with multiple LLM judges. Each judge evaluates accuracy, line correctness, and whether the suggestion is technically sound.

Three consensus modes: Any, Majority, All. Verification batches 6 comments per call for cost efficiency.

This matters for the same reason peer review matters in science: one model can hallucinate confidently. Two models hallucinating the same thing in the same way is much rarer. The multi-judge pattern catches the confident-but-wrong findings that a single model would let through.

The five-stage filter pipeline

Between the LLM and your PR, every comment passes through a DAG of postprocessing stages:

Deduplication -- merge identical findings across hunks
Feedback annotation -- mark as previously accepted/rejected by your team
Suppression -- remove patterns your team has consistently dismissed
Blast radius assessment -- annotate how many files and functions are affected
Multi-judge verification -- optional consensus check

Five chances for a bad comment to die before it reaches your PR. The ones that survive are worth reading.

Model-agnostic: the agent owns the decision

DiffScope supports OpenAI, Anthropic, Ollama, and any OpenAI-compatible endpoint. Multiple model roles -- Primary for review, Weak for triage, Reasoning for deep analysis -- let you use a cheap model for filtering and an expensive one for the findings that survive.

The Ollama support matters for regulated teams. Financial services, healthcare, defense -- your code stays on your hardware. The quality tradeoff (local 7B vs. cloud frontier) is yours to make, per-repo or per-path.

# Cloud
git diff | diffscope review --model claude-sonnet-4-20250514

# Local -- code never leaves your machine
git diff | diffscope review --base-url http://localhost:11434 --model ollama:codellama

Getting started

cargo install diffscope

git diff main | diffscope review
diffscope pr --number 123 --post-comments

For CI, the GitHub Action posts inline comments on every PR:

- uses: evalops/diffscope@v1
  with:
    model: gpt-4o
    openai-api-key: ${{ secrets.OPENAI_API_KEY }}
    post-comments: true

The goal isn't replacing human reviewers. It's catching the obvious stuff -- injections, unchecked errors, missing auth -- before a human spends time on it. Let your senior engineers focus on architecture and design. Let the agent catch the bugs.

Check out the repo.