Blog
Kausthub Jadhav

CodeGraph Cut Atlas Review Cost by 68%

We added CodeGraph to AtlasEngine reviews and reran the benchmark across six real OSS pull requests. Mean review cost fell 67.6%, tokens dropped 70.3%, and no repo regressed on cost.

CodeGraphBenchmarksEngineering

In the earlier Atlas review path, the agent used grep for almost every symbol it needed to inspect.

That worked, but it got expensive fast. On a 663k-LOC TypeScript repo, the reviewer burned 532k tokens and made nine grep/read calls before it returned a single useful review.

CodeGraph changes that path. Instead of making the agent search through the repo repeatedly, we give it a semantic index it can query at review time. Ask who calls a function, what a symbol touches, or where a path leads, and the answer comes back through the graph.

We measured the change across six real OSS pull requests in five languages. Mean per-review cost fell 67.6%. Tokens dropped 70.3%. The agent made 74.4% fewer non-graph tool calls. No repo regressed on cost.

Mean cost saved

67.6%

$2.38 → $0.77 per review across the matrix.

Tokens saved

70.3%

445k → 132k mean tokens per review.

Fewer tool calls

74.4%

7.2 → 1.8 non-graph tool calls on average.

The first result was wrong

Earlier, we tested CodeGraph against AtlasEngine itself and called it a no-go.

That run was not fake. It showed what it showed: grep/read calls dropped, but wall-clock barely moved and cost did not improve enough to matter.

The mistake was treating that as the whole story.

We missed two things:

  1. The corpus had one repo in one language family. CodeGraph helps most on larger codebases with broader call graphs, which we had not really tested.
  2. We were watching duration too closely. Duration is noisy. Cost and tokens tell you how much code the model had to drag through the review loop.

So we reran the experiment with a broader matrix.

The benchmark

We picked six OSS pull requests across TypeScript, Python, Rust, Kotlin, and C#. The repos ranged from 31k LOC to 1.06M LOC.

For each PR, we cut two local branches from main, merged the same upstream PR into both, and ran Atlas review once per branch:

atlas-without-codegraph/<repo>-pull-<N>
atlas-with-codegraph/<repo>-pull-<N>

One arm had CodeGraph off. One arm had CodeGraph on. Same PR, same base, same merged diff. The model and prompt stayed fixed.

That part matters. If the patch changes between runs, the benchmark is already compromised.

The results

RepoLanguageLOCCostTokensTimeTool calls
frigate #23244TypeScript663k$2.84 → $0.49 (-82.9%)-88.0%-5.7%9 → 0
fastapi #15006Python415k$6.23 → $1.46 (-76.6%)-78.5%-39.1%14 → 3
uv #18881Rust678k$3.15 → $0.99 (-68.6%)-68.8%-54.3%16 → 5
okhttp #9447Kotlin202k$0.45 → $0.33 (-25.1%)-8.4%-44.8%0 → 0
PowerShell #26688C#1.06M$1.29 → $1.04 (-19.2%)-21.3%+1.5%3 → 2
requests #7463Python31k$0.33 → $0.32 (-2.2%)-0.3%-22.6%1 → 1
Mean$2.38 → $0.77 (-67.6%)-70.3%-27.9%7.2 → 1.8

The shape of the win is pretty clear.

Big repos saved the most. Frigate, FastAPI, and uv had the largest cost and token drops. Requests, the small-repo control, barely moved on cost because grep was already cheap there.

Wall-clock improved too, but less consistently than cost. PowerShell saved cost and tokens but was basically flat on time.

That makes sense. CodeGraph reduces the cost of reaching the relevant code. It does not remove the reasoning step. Once the model has the right context, it still has to decide whether the change is safe.

What this does not prove

This was a speed and cost benchmark. It was not a full quality scorecard.

That distinction matters. A cheaper review is only useful if it still finds the right issues. CodeGraph should guide navigation, not replace reading the changed hunks. Before turning this on by default, we still want a finding-overlap pass on the same PRs.

There is also a reps caveat. This matrix used one run per arm per repo. The effect size is large enough that we are comfortable treating the result as real, but a second rep would make the confidence stronger.

So the product decision is conservative: ship it behind a per-project toggle, default off, and keep measuring.

Try it on your project

Open a project in Atlas, go to Settings → Review → Policy file, and enable CodeGraph for that project. Let the index finish, run a review, and check the cost line in review.stats.

If your repo is small, you may not see much movement. If your repo is large enough that review spends real time searching and reading nearby code, CodeGraph should have more room to help.

The short version

Our first CodeGraph benchmark was too narrow.

On one repo, it looked like noise. Across six real OSS PRs and five languages, the result flipped: lower cost, lower tokens, fewer tool calls, and no cost regressions.

The lesson is not “graphs magically review code.” They do not.

The lesson is simpler: if an AI reviewer spends a lot of money finding the right code, a good graph can make the review much cheaper before the reasoning even starts.