Provenant
Attribution-Guided Wiki Indexing

The retrieval layer AI coding agents are missing

Agents read too much and know too little about what they just read. Provenant replaces raw-file BM25 with LLM wiki pages, measures which retrieved pages were actually cited, and repairs the index when confidence drops — without replacing Copilot or Cursor.

AI coding agents are now standard infrastructure. GitHub Copilot alone has over 1.8 million paid users. The recurring failure mode is not model quality — it is retrieval: keyword overlap picks files, the agent ingests the whole bundle, and nobody checks which files actually shaped the answer.

$18K
Approximate annual wasted context spend for a 100-engineer org at 50 agent queries/day (GPT-4o pricing) — files retrieved but never used in the answer.
65×
Token reduction vs naive full-file reading on Django and Flask evals, with answer quality preserved within judge noise.
§

The problem

Pose a question about middleware ordering in a monorepo with thousands of modules. Retrieval hands back a dozen paths; two matter, the rest are cargo. The agent still tokenizes every path you sent — you pay for the bundle, and noise can steer the model toward a plausible-but-wrong edit.

This is not a tuning problem. It is structural. Developers write issue reports in natural language: behavior, intent, symptoms. Source code is written for compilers: identifiers, control flow, type annotations. BM25 matches tokens, not meaning. Ask why "retries never back off under load" and BM25 may never surface utils/retry_policy.py even when that module owns the bug.

Worse: retrieval rarely learns from outcomes. A bad page that keeps ranking highly stays in rotation. Static indexes do not tell you which context the model ignored — there is no health metric and no repair trigger.

"Provenant is not a replacement for Copilot or Cursor. It is the intelligence underneath them."
§

Why now

Three things converged to make this possible and necessary at the same time.

Long-context models made naive retrieval expensive. At 4K tokens, agents retrieved 3 files. At 128K, they retrieve 15–20 — and bill for all of them. The cost of careless retrieval scaled with the capability improvement.

Agents went from experiment to infrastructure. Organizations run thousands of agent queries per day. At that volume, a 60× token inefficiency is a line item, not a rounding error.

LLMs can now write accurate prose summaries. The wiki generation step — summarizing a source file into useful natural language — required a quality threshold that only became reliably achievable in 2024–2025.

§

What Provenant does

Provenant builds a wiki layer once per repo: each file becomes a short LLM-written summary in plain English. Search runs on that prose, so queries meet descriptions instead of identifier soup.

Each answer carries citations. Provenant compares cited wiki pages to the pages it retrieved — that fraction is attribution confidence. When confidence sags, only the ignored pages are queued for rewrite in the background while the user keeps working.

§

Architecture

Wiki generation

Run provenant init once. Tree-sitter pulls symbols, imports, and layout; DeepSeek-V3.2 turns that skeleton into a wiki entry covering intent, public surface, important types/functions, and neighbours worth knowing about.

Pages live in two indices: SQLite FTS5 for BM25, and LanceDB with 768-dimensional embeddings (nomic-embed-text-v1.5 via Fireworks AI) for semantic search. A 70-page Flask-scale index costs approximately $0.05 and takes under two minutes.

Retrieval: BM25 and HyDE

Day-to-day retrieval is BM25 on wiki titles and bodies. When semantic lift is worth the cost, HyDE asks the model to draft a fake mini-answer, embeds that draft, and pulls vector neighbours. Lists are fused with Reciprocal Rank Fusion (k=60), then a lightweight reranker boosts pages whose wording actually overlaps the question before synthesis sees top-k.

HyDE is gated — it runs only when vector similarity crosses a bar. On 768-dim embeddings that happened on 15 of 500 SWE-bench Verified tasks (~3%), mostly in django, and those runs picked up Coverage@10 where they fired.

Attribution confidence

After synthesis:

confconfidence = |cited| ÷ |retrieved| — computed from citations the model already emitted. Zero extra model calls. On 20 Django probes it tracks judge scores with Pearson r = 0.415.

High confidence: the answer leaned on what you fetched. Low confidence: lots of retrieved pages, few citations — either irrelevant context or muddy summaries. Log it per query and you get a cheap dashboard instead of batch-judging everything.

Automatic self-healing

Below confidence 0.35, repair runs asynchronously — the user-facing answer is not blocked. Each skipped citation reloads its source and gets a targeted rewrite prompt: the page was fetched but unused; tighten scope and align wording with the file on disk. Updates land in SQLite FTS and LanceDB; each page cools down for 300s so bursts of similar questions do not thrash the same entry.

asyncio.create_task(_bg_repair(uncited))

Repairs never touch cited pages. In our Django run (1,393 wiki pages), the worst batch flagged 10 uncited entries — 0.7% of the index. One cycle cost about two cents, not a re-index bill.

§

Benchmark results

Evaluations target SWE-bench Verified file localisation: 500 merged issues across 12 Python codebases, each with gold files from the accepted patch. The question is whether retrieval surfaces a touched file in the top ranks — not patch generation or test runs.

Headline metric: Coverage@k — share of tasks with any gold path in the top-k predictions. MRR is reported where noted.

File localization

MethodC@5C@10MRR
Raw BM25 on source files56.2%69.0%0.404
Provenant BM25-on-wiki63.8%70.8%0.447
+ reranker + selective HyDE66.2%75.2%0.454

Raw file BM25: 56.2% C@5. Wiki BM25: 63.8% — +7.6 points on the same 500 tasks. C@10 inches up less because the win is ranking: the right file shows up higher, not just inside the top ten.

Per-repository breakdown

RepositoryNBM25 wiki C@10+rerank +HyDEΔC@10
astropy2272.7%86.4%+13.7
pytest1957.9%68.4%+10.5
pylint1060.0%70.0%+10.0
sphinx4445.5%52.3%+6.8
django23174.0%79.2%+5.2
sklearn3253.1%56.2%+3.1
matplotlib3485.3%88.2%+2.9
sympy7574.7%76.0%+1.3
xarray2281.8%77.3%−4.5

Eight of nine repos improve. HyDE only fired in django. On astropy (+13.7 C@10) and pytest (+10.5), reranking alone did the work — useful if you do not want embedding infra on day one.

xarray regressed (−4.5 C@10): tickets already name DataArray-class symbols; vector search added plausible neighbours that were not in the patch. Per-repo gates are the obvious fix.

Token efficiency

RepositoryWiki tokens/queryNaive tokens/queryReduction
Flask (30 questions)1,07069,04464.5×
Django (20 questions)99459,63460.0×

Roughly 400 lines of source become a ~150-token wiki page. Blind judge on 20 Django Qs: mean score delta −0.15 (wiki vs full files), eight ties. That is judge noise, not a quality collapse — while input tokens fall 60–65×.

Attribution confidence as a quality signal

Confidence bucketNAvg quality (1–5)
Low (0.0–0.2)44.50
Mid (0.4–0.6)94.89
High (0.8–1.0)75.00

Self-healing: early results

We exercised repair on the four lowest-confidence Django items — 10 pages rewritten (0.7% of index), ~$0.02 total, then re-queried.

Query topicConf beforeConf afterQuality beforeQuality after
Signal dispatch0.200.4045
Transactions0.200.2055
Test client0.200.4054
F() expressions0.200.2045
Average0.200.304.504.75
§
§

Why wiki retrieval generalizes

Gains track how far issue prose drifts from symbol names. Documentation-heavy trackers (sphinx, pylint, pytest) jump the most; codebases where tickets already name APIs (sympy, xarray) move less because keyword search was never starving.

Think of three dialects: executable source, human ticket text, and the wiki bridge written for search. Provenant lives in the third dialect so BM25 stops guessing identifiers it never saw in the question.

xarray's −4.5 C@10 slip is the mirror case: shared tokens like DataArray already align tickets and code, so extra semantic retrieval invited look-alike modules that were not in the patch. Repo-specific gates for HyDE should claw that back without dulling django-sized wins.

§

The self-improving index

Pair citation-derived confidence with surgical repair and you get a self-improving codebase index: every question leaves a trace of which summaries helped, and only the ignored summaries get rewritten.

Nightly full re-indexes throw away good pages and scale with repo size. Provenant's loop edits the ~0.7% with evidence of failure, spends cents per cycle, and wakes up only when confidence says something was off.

Hot paths see more queries, more citation signal, and faster polish; cold files stay untouched until someone asks. Usage and retrieval quality drift in the same direction instead of rotting evenly.

§

vs. existing tools

Copilot / CursorSourcegraph CodyProvenant
Retrieval overRaw sourceRaw sourceLLM wiki pages
Attribution signal
Self-healing index
Token efficiencyBaselineBaseline60–65× less
Works without embeddings

Treat Provenant as a plug-in retrieval plane: stack it under Copilot, Cursor, or Cody rather than ripping them out.

§

Cost & scale

ActionCostTime
Index Flask (70 files)$0.05< 2 min
Index Django (1,393 files)~$1.00~25 min
Per query — Provenant (wiki)~$0.002
Per query — naive full-file~$0.13
Repair cycle (10 pages)~$0.02background
§

MCP tools

Provenant exposes capabilities as MCP tools — callable from Claude Code, Cursor, and any MCP-compatible agent.

ToolDescription
provenant_askAnswer a natural language question about the repository
provenant_contextRetrieve wiki context for specific files or symbols
provenant_searchSearch wiki pages by keyword or semantic query
provenant_dead_codeIdentify unreferenced files and symbols
provenant_riskAssess change risk for a given file
provenant_whyExplain why a file was retrieved for a query
provenant_overviewHigh-level repository summary
provenant_symbolLook up a specific function, class, or constant
Limitations

Repair evidence is still thin: one large repo, twenty judged questions. Confidence scores grounding via citations, not truth. HyDE costs an extra model call on ~3% of traffic. Bad init-time summaries cap retrieval until repair catches them.

§

Stack

ComponentTechnology
Parsingtree-sitter
BM25 indexSQLite FTS5
Vector storeLanceDB (768-dim)
Embeddingsnomic-embed-text-v1.5 via Fireworks AI
SynthesisDeepSeek-V3.2
Async repairPython asyncio
Web UINext.js
APIFastAPI
LLM routingLiteLLM
§

Getting started

# Index a repository
provenant init /path/to/your/repo

# Ask a question
provenant ask "where is upload validation enforced?"

The live prototype runs against pallets/flask — fully indexed and queryable. Provenant ships as an MCP server alongside any agent that supports the Model Context Protocol.

+7.6
pp Coverage@5 over raw BM25 on SWE-bench Verified (500 tasks, 12 repos).
r=0.415
Pearson correlation between attribution confidence and LLM judge quality (Django, n=20).

Use it, measure citations, let the weak pages heal. Token spend falls 60–65× with judge scores barely moving — that is the economic case. The technical case is file localisation, live attribution, and repair that touches less than 1% of the index per cycle.

Provenant
Retrieval that measures what it retrieved, repairs what it got wrong, and stops billing you for files the model never read. Questions or collaboration — reach out on GitHub.