Most AI code review tools are bad agents.
They take a diff, send it to an LLM, and dump whatever comes back into your PR. No context about the codebase. No memory of what your team cares about. No judgment about when to stay quiet. They optimize for coverage -- find as many things as possible -- when they should optimize for signal -- find the things that matter.
The result is predictable: engineers ignore every AI comment. The real findings get buried in noise. The tool becomes a checkbox that nobody reads.
DiffScope is built on a different premise: a review agent should behave like a good senior engineer. It sees the diff in context -- the callers, the type hierarchy, the dependency graph. It learns what your team cares about and suppresses what they don't. And it can prove, with numbers, that it's getting better over time.
The agent loop
DiffScope can operate as a simple pipe (diff in, comments out) or as an iterative agent with --agent-review. In agent mode, the model enters a tool-calling loop:
pub async fn run_agent_loop(
adapter: &dyn LLMAdapter,
initial_request: ChatRequest,
tools: &[Box<dyn ReviewTool>],
config: &AgentLoopConfig, // max_iterations: 10, token budget
) -> Result<AgentLoopResult>
The model can call tools to look up symbol definitions, fetch file content, query git blame, and traverse the dependency graph. Each tool result gets appended to the conversation, and the model iterates until it's satisfied or hits the budget.
This is how DiffScope handles complex reviews. A single LLM call with a diff produces surface-level comments. An agent that can look things up, follow references, and build understanding produces comments that sound like they came from someone who read the codebase.
The difference matters most on large changes. A 500-line diff touching auth code needs the model to understand the session management pattern, the middleware chain, and the test coverage. The agent loop lets it explore that context iteratively instead of guessing from the diff alone.
Why context is the whole game
Here's the core insight: the quality of a code review is determined by the reviewer's context, not their intelligence.
A junior engineer who understands the codebase catches more bugs than a senior engineer reading a diff cold. AI code review has the same problem -- the model is smart enough, it just doesn't know enough.
DiffScope has a dual-mode symbol indexer that gives the model the same context a human reviewer would have:
Regex mode (default) walks the repo, applies language-specific patterns (fn/struct/impl in Rust, class/def in Python, func/type in Go), and builds a bidirectional dependency graph -- which files depend on which other files.
LSP mode (opt-in) spawns your actual language server (rust-analyzer, pylsp, gopls) via JSON-RPC for precise go-to-definition and workspace symbol queries.
Both produce a SymbolIndex with dependency tracking. When you change a function, DiffScope traverses reverse dependencies to pull in callers. When you modify a type, it finds every usage. The model sees blast radius, not just the diff.
Recent work pushed this further. Trait contract edges follow interface implementations -- change a trait method and DiffScope pulls in every struct that implements it. Graph-ranked semantic retrieval biases context selection toward structurally related files rather than just textually similar ones. The persistent graph cache means the index rebuilds incrementally instead of from scratch.
Confidence scoring: teaching an agent when to shut up
Every comment starts at a baseline of 0.70. The scoring engine adjusts based on what it detects:
- SQL injection, command injection, XSS: +0.20
- Hardcoded secrets: +0.25
- Auth issues (JWT, CSRF): +0.15 to +0.20
- Weak crypto: +0.15
- CWE reference present: +0.10
A SQL injection finding lands at 0.95+. A style suggestion stays at 0.70. You set --min-confidence per path in your config:
paths:
"src/api/**":
review_instructions: |
Prioritize auth and input validation.
severity_overrides:
security: error
"tests/**":
severity_overrides:
style: suggestion
The threshold is the product decision. Your security team wants --min-confidence 60 on the API layer. Your frontend team wants --min-confidence 85 on components. Each team defines their own noise tolerance. An agent that doesn't let you control its verbosity isn't respecting your time.
The feedback loop: an agent that learns
This is what makes DiffScope fundamentally different from a tool that pipes diffs to an LLM.
When a reviewer accepts or rejects a suggestion, DiffScope records the feedback and builds a ConventionStore. The suppression logic uses Wilson score confidence intervals -- a statistical technique that accounts for sample size. Three rejections out of three has different weight than thirty rejections out of forty.
Patterns with consistently low acceptance get suppressed. Patterns with high acceptance get boosted. The store tracks file-pattern context so suppressions learned from test files don't bleed into API code.
Recent work added feedback learning lift measurement -- analytics that quantify the noise reduction. "Your team rejected 47 'consider adding error handling' comments. DiffScope now suppresses that pattern, saving an estimated 12 minutes per review."
Why does this matter? Because every AI agent starts noisy. The question is whether it stays noisy. Most tools do. DiffScope gets quieter over time and can prove it with numbers. That's the difference between a tool teams tolerate and a tool teams trust.
Multi-judge verification
For high-stakes reviews, DiffScope runs a verification pass with multiple LLM judges. Each judge evaluates accuracy, line correctness, and whether the suggestion is technically sound.
Three consensus modes: Any, Majority, All. Verification batches 6 comments per call for cost efficiency.
This matters for the same reason peer review matters in science: one model can hallucinate confidently. Two models hallucinating the same thing in the same way is much rarer. The multi-judge pattern catches the confident-but-wrong findings that a single model would let through.
The five-stage filter pipeline
Between the LLM and your PR, every comment passes through a DAG of postprocessing stages:
- Deduplication -- merge identical findings across hunks
- Feedback annotation -- mark as previously accepted/rejected by your team
- Suppression -- remove patterns your team has consistently dismissed
- Blast radius assessment -- annotate how many files and functions are affected
- Multi-judge verification -- optional consensus check
Five chances for a bad comment to die before it reaches your PR. The ones that survive are worth reading.
Model-agnostic: the agent owns the decision
DiffScope supports OpenAI, Anthropic, Ollama, and any OpenAI-compatible endpoint. Multiple model roles -- Primary for review, Weak for triage, Reasoning for deep analysis -- let you use a cheap model for filtering and an expensive one for the findings that survive.
The Ollama support matters for regulated teams. Financial services, healthcare, defense -- your code stays on your hardware. The quality tradeoff (local 7B vs. cloud frontier) is yours to make, per-repo or per-path.
# Cloud
git diff | diffscope review --model claude-sonnet-4-20250514
# Local -- code never leaves your machine
git diff | diffscope review --base-url http://localhost:11434 --model ollama:codellama
Getting started
cargo install diffscope
git diff main | diffscope review
diffscope pr --number 123 --post-comments
For CI, the GitHub Action posts inline comments on every PR:
- uses: evalops/diffscope@v1
with:
model: gpt-4o
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
post-comments: true
The goal isn't replacing human reviewers. It's catching the obvious stuff -- injections, unchecked errors, missing auth -- before a human spends time on it. Let your senior engineers focus on architecture and design. Let the agent catch the bugs.