How AI Reviews Code: Why Verification Beats Diff-Only Review
AI code review works best when the model can inspect context, verify suspicions, and turn vague comments into grounded findings.
The first trap in AI code review is thinking the model’s job is to read a diff and immediately produce findings.
That’s the easy version to build. It is also the version that gets noisy.
When people talk about AI review, they usually mean this:
Take this diff. Find bugs. Return JSON.
That can work. For a small patch, it will catch obvious things: missing awaits, inverted conditions, deleted checks, broken imports.
But after building AtlasEngine, I don’t think that’s what good AI review actually is.
The useful version looks more like this:
- Start with a clear review scope.
- Read the diff.
- Pull in the surrounding code.
- Form a small suspicion.
- Check whether that suspicion is real.
- Only then write the finding.
The important part is step 5.
Noisy path
Useful path
Diff-only review is noisy
If the model only sees a patch, it has to infer too much.
It can see that a line changed. It can’t always see the contract around that line.
It doesn’t know who calls the function, whether the field is optional everywhere, whether the defensive check is redundant, or whether the cache key matches the writer.
When the model can’t answer those questions, it guesses. Some guesses are useful. Many become vague comments that a developer has to re-check manually.
That’s where a lot of AI review noise comes from.
The better version is allowed to look around
The stronger pattern is to let the model behave more like a reviewer with a terminal:
Review this branch against main.
Run git diff.
Read changed files.
Follow imports and callers when needed.
Verify each suspected issue before reporting it.
Return only grounded findings.
Yes, this takes longer than one prompt.
But the thing we learned is that review quality usually isn’t limited by how fast the model can navigate files. It’s limited by whether the model can decide if a suspected issue is real.
One benchmark branch we used for Atlas had 67 changed files and roughly 28k lines moved. That kind of change isn’t hard because rg is slow. It’s hard because the reviewer has to work out which of those changes broke a contract. Letting the model read a few more files is often worth the extra seconds.
A small example
Say a change does this:
async function resolveFinding(findingId: string) {
const finding = await db.finding.update({
where: { id: findingId },
data: { status: 'resolved' },
});
- await cache.invalidate(`review:${finding.reviewId}`);
+ await cache.invalidate(`review:${findingId}`);
return finding;
}
A diff-only review might say:
Possible cache invalidation issue. Check whether the cache key should use reviewId.
That’s better than nothing. But it’s still asking the human to finish the review.
An agentic reviewer can check:
rg "review:" src
And then find something like:
function getReviewCacheKey(reviewId: string) {
return `review:${reviewId}`;
}
If that helper is what writes the cache, the contract is clear. Review data is keyed by reviewId, not by finding id.
Now the comment becomes specific:
Bug: resolveFinding now invalidates review:<findingId>, but review data is cached under review:<reviewId>. After resolving a finding, the review panel can keep showing stale findings until another invalidation path runs.
That’s the difference. The model didn’t just notice a suspicious line. It checked the contract and turned the suspicion into a real finding.
Suspicion
This cache invalidation looks wrong.
Evidence
The writer uses reviewId as the key.
Finding
The UI can show stale findings after resolve.
What the model is doing
LLMs don’t review code like compilers. They’re pattern machines with a lot of learned code context.
That’s fine, as long as the review system is designed around it.
The model is strongest when you make it gather evidence:
- This changed line looks suspicious.
- What contract was it supposed to satisfy?
- Where is this value read?
- What state can reach this branch?
- Is there a test that proves the new behavior?
The weak mode is asking it to judge a large patch from partial context and produce a long list. Verification reduces false positives. It doesn’t remove them. Some contracts live outside the repo: production data, user behavior, team convention, rollout plans. But even with that limit, verified findings are much more useful than first-pass guesses.
Two passes, not one
One implementation detail surprised us: it helps to separate discovery from formatting.
The first pass should feel like a terminal session. Give the model a light prompt, the diff, permission to explore, and space to take notes. If you force strict JSON too early, recall drops. The model spends attention on the shape of the answer while it’s still trying to understand the code.
The second pass can be strict. Take the raw notes and turn them into schema-shaped findings: dedupe them, attach evidence, and drop anything still too weak to show the developer.
The rule is simple: the second pass may structure what discovery found. It shouldn’t invent new findings.
Explore first. Verify as you go. Structure at the end.
That split keeps recall high without flooding the developer with noise. It’s also why “return JSON” works better as the second step than as the first instruction.
What AtlasEngine adds
The model still does the reasoning. Atlas gives that reasoning a workflow.
- a branch or PR scope
- a trusted baseline
- path-aware review policy
- persisted findings and outcomes
- reruns when the code changes
That’s the part a chat prompt doesn’t give you. Atlas makes the review repeatable enough to use in an engineering workflow.
The short version
The bad flow is diff in, findings out. Fast, cheap, and often noisy.
The better flow gives the model a scope, lets it inspect context, and makes it verify the thing it thinks it found.
AI reviews code well when it’s allowed to check its own suspicion before showing it to a human.