Provenant: Attribution-Guided Wiki Indexing for AI Coding Agents

AI coding agents are now standard infrastructure. GitHub Copilot alone has over 1.8 million paid users. The recurring failure mode is not model quality — it is retrieval: keyword overlap picks files, the agent ingests the whole bundle, and nobody checks which files actually shaped the answer.

$18K

Approximate annual wasted context spend for a 100-engineer org at 50 agent queries/day (GPT-4o pricing) — files retrieved but never used in the answer.

65×

Token reduction vs naive full-file reading on Django and Flask evals, with answer quality preserved within judge noise.

The problem

Pose a question about middleware ordering in a monorepo with thousands of modules. Retrieval hands back a dozen paths; two matter, the rest are cargo. The agent still tokenizes every path you sent — you pay for the bundle, and noise can steer the model toward a plausible-but-wrong edit.

This is not a tuning problem. It is structural. Developers write issue reports in natural language: behavior, intent, symptoms. Source code is written for compilers: identifiers, control flow, type annotations. BM25 matches tokens, not meaning. Ask why "retries never back off under load" and BM25 may never surface utils/retry_policy.py even when that module owns the bug.

Worse: retrieval rarely learns from outcomes. A bad page that keeps ranking highly stays in rotation. Static indexes do not tell you which context the model ignored — there is no health metric and no repair trigger.

"Provenant is not a replacement for Copilot or Cursor. It is the intelligence underneath them."

Why now

Three things converged to make this possible and necessary at the same time.

Long-context models made naive retrieval expensive. At 4K tokens, agents retrieved 3 files. At 128K, they retrieve 15–20 — and bill for all of them. The cost of careless retrieval scaled with the capability improvement.

Agents went from experiment to infrastructure. Organizations run thousands of agent queries per day. At that volume, a 60× token inefficiency is a line item, not a rounding error.

LLMs can now write accurate prose summaries. The wiki generation step — summarizing a source file into useful natural language — required a quality threshold that only became reliably achievable in 2024–2025.

What Provenant does

Provenant builds a wiki layer once per repo: each file becomes a short LLM-written summary in plain English. Search runs on that prose, so queries meet descriptions instead of identifier soup.

Each answer carries citations. Provenant compares cited wiki pages to the pages it retrieved — that fraction is attribution confidence. When confidence sags, only the ignored pages are queued for rewrite in the background while the user keeps working.

Architecture

Wiki generation

Run provenant init once. Tree-sitter pulls symbols, imports, and layout; DeepSeek-V3.2 turns that skeleton into a wiki entry covering intent, public surface, important types/functions, and neighbours worth knowing about.

Pages live in two indices: SQLite FTS5 for BM25, and LanceDB with 768-dimensional embeddings (nomic-embed-text-v1.5 via Fireworks AI) for semantic search. A 70-page Flask-scale index costs approximately $0.05 and takes under two minutes.

Retrieval: BM25 and HyDE

Day-to-day retrieval is BM25 on wiki titles and bodies. When semantic lift is worth the cost, HyDE asks the model to draft a fake mini-answer, embeds that draft, and pulls vector neighbours. Lists are fused with Reciprocal Rank Fusion (k=60), then a lightweight reranker boosts pages whose wording actually overlaps the question before synthesis sees top-k.

HyDE is gated — it runs only when vector similarity crosses a bar. On 768-dim embeddings that happened on 15 of 500 SWE-bench Verified tasks (~3%), mostly in django, and those runs picked up Coverage@10 where they fired.

Attribution confidence

After synthesis:

confconfidence = |cited| ÷ |retrieved| — computed from citations the model already emitted. Zero extra model calls. On 20 Django probes it tracks judge scores with Pearson r = 0.415.

High confidence: the answer leaned on what you fetched. Low confidence: lots of retrieved pages, few citations — either irrelevant context or muddy summaries. Log it per query and you get a cheap dashboard instead of batch-judging everything.

Automatic self-healing

Below confidence 0.35, repair runs asynchronously — the user-facing answer is not blocked. Each skipped citation reloads its source and gets a targeted rewrite prompt: the page was fetched but unused; tighten scope and align wording with the file on disk. Updates land in SQLite FTS and LanceDB; each page cools down for 300s so bursts of similar questions do not thrash the same entry.

asyncio.create_task(_bg_repair(uncited))

Repairs never touch cited pages. In our Django run (1,393 wiki pages), the worst batch flagged 10 uncited entries — 0.7% of the index. One cycle cost about two cents, not a re-index bill.

Benchmark results

Evaluations target SWE-bench Verified file localisation: 500 merged issues across 12 Python codebases, each with gold files from the accepted patch. The question is whether retrieval surfaces a touched file in the top ranks — not patch generation or test runs.

Headline metric: Coverage@k — share of tasks with any gold path in the top-k predictions. MRR is reported where noted.

File localization

Method	C@5	C@10	MRR
Raw BM25 on source files	56.2%	69.0%	0.404
Provenant BM25-on-wiki	63.8%	70.8%	0.447
+ reranker + selective HyDE	66.2%	75.2%	0.454

Raw file BM25: 56.2% C@5. Wiki BM25: 63.8% — +7.6 points on the same 500 tasks. C@10 inches up less because the win is ranking: the right file shows up higher, not just inside the top ten.

Per-repository breakdown

Repository	N	BM25 wiki C@10	+rerank +HyDE	ΔC@10
astropy	22	72.7%	86.4%	+13.7
pytest	19	57.9%	68.4%	+10.5
pylint	10	60.0%	70.0%	+10.0
sphinx	44	45.5%	52.3%	+6.8
django	231	74.0%	79.2%	+5.2
sklearn	32	53.1%	56.2%	+3.1
matplotlib	34	85.3%	88.2%	+2.9
sympy	75	74.7%	76.0%	+1.3
xarray	22	81.8%	77.3%	−4.5

Eight of nine repos improve. HyDE only fired in django. On astropy (+13.7 C@10) and pytest (+10.5), reranking alone did the work — useful if you do not want embedding infra on day one.

xarray regressed (−4.5 C@10): tickets already name DataArray-class symbols; vector search added plausible neighbours that were not in the patch. Per-repo gates are the obvious fix.

Token efficiency

Repository	Wiki tokens/query	Naive tokens/query	Reduction
Flask (30 questions)	1,070	69,044	64.5×
Django (20 questions)	994	59,634	60.0×

Roughly 400 lines of source become a ~150-token wiki page. Blind judge on 20 Django Qs: mean score delta −0.15 (wiki vs full files), eight ties. That is judge noise, not a quality collapse — while input tokens fall 60–65×.

Attribution confidence as a quality signal

Confidence bucket	N	Avg quality (1–5)
Low (0.0–0.2)	4	4.50
Mid (0.4–0.6)	9	4.89
High (0.8–1.0)	7	5.00

Self-healing: early results

We exercised repair on the four lowest-confidence Django items — 10 pages rewritten (0.7% of index), ~$0.02 total, then re-queried.

Query topic	Conf before	Conf after	Quality before	Quality after
Signal dispatch	0.20	0.40	4	5
Transactions	0.20	0.20	5	5
Test client	0.20	0.40	5	4
F() expressions	0.20	0.20	4	5
Average	0.20	0.30	4.50	4.75

Why wiki retrieval generalizes

Gains track how far issue prose drifts from symbol names. Documentation-heavy trackers (sphinx, pylint, pytest) jump the most; codebases where tickets already name APIs (sympy, xarray) move less because keyword search was never starving.

Think of three dialects: executable source, human ticket text, and the wiki bridge written for search. Provenant lives in the third dialect so BM25 stops guessing identifiers it never saw in the question.

xarray's −4.5 C@10 slip is the mirror case: shared tokens like DataArray already align tickets and code, so extra semantic retrieval invited look-alike modules that were not in the patch. Repo-specific gates for HyDE should claw that back without dulling django-sized wins.

The self-improving index

Pair citation-derived confidence with surgical repair and you get a self-improving codebase index: every question leaves a trace of which summaries helped, and only the ignored summaries get rewritten.

Nightly full re-indexes throw away good pages and scale with repo size. Provenant's loop edits the ~0.7% with evidence of failure, spends cents per cycle, and wakes up only when confidence says something was off.

Hot paths see more queries, more citation signal, and faster polish; cold files stay untouched until someone asks. Usage and retrieval quality drift in the same direction instead of rotting evenly.

vs. existing tools

	Copilot / Cursor	Sourcegraph Cody	Provenant
Retrieval over	Raw source	Raw source	LLM wiki pages
Attribution signal	✗	✗	✓
Self-healing index	✗	✗	✓
Token efficiency	Baseline	Baseline	60–65× less
Works without embeddings	✓	✗	✓

Treat Provenant as a plug-in retrieval plane: stack it under Copilot, Cursor, or Cody rather than ripping them out.

Cost & scale

Action	Cost	Time
Index Flask (70 files)	$0.05	< 2 min
Index Django (1,393 files)	~$1.00	~25 min
Per query — Provenant (wiki)	~$0.002	—
Per query — naive full-file	~$0.13	—
Repair cycle (10 pages)	~$0.02	background

MCP tools

Provenant exposes capabilities as MCP tools — callable from Claude Code, Cursor, and any MCP-compatible agent.

Tool	Description
`provenant_ask`	Answer a natural language question about the repository
`provenant_context`	Retrieve wiki context for specific files or symbols
`provenant_search`	Search wiki pages by keyword or semantic query
`provenant_dead_code`	Identify unreferenced files and symbols
`provenant_risk`	Assess change risk for a given file
`provenant_why`	Explain why a file was retrieved for a query
`provenant_overview`	High-level repository summary
`provenant_symbol`	Look up a specific function, class, or constant

Limitations

Repair evidence is still thin: one large repo, twenty judged questions. Confidence scores grounding via citations, not truth. HyDE costs an extra model call on ~3% of traffic. Bad init-time summaries cap retrieval until repair catches them.

Stack

Component	Technology
Parsing	tree-sitter
BM25 index	SQLite FTS5
Vector store	LanceDB (768-dim)
Embeddings	nomic-embed-text-v1.5 via Fireworks AI
Synthesis	DeepSeek-V3.2
Async repair	Python asyncio
Web UI	Next.js
API	FastAPI
LLM routing	LiteLLM

Getting started

# Index a repository
provenant init /path/to/your/repo

# Ask a question
provenant ask "where is upload validation enforced?"

The live prototype runs against pallets/flask — fully indexed and queryable. Provenant ships as an MCP server alongside any agent that supports the Model Context Protocol.

+7.6

pp Coverage@5 over raw BM25 on SWE-bench Verified (500 tasks, 12 repos).

r=0.415

Pearson correlation between attribution confidence and LLM judge quality (Django, n=20).

Use it, measure citations, let the weak pages heal. Token spend falls 60–65× with judge scores barely moving — that is the economic case. The technical case is file localisation, live attribution, and repair that touches less than 1% of the index per cycle.

ProvenantAttribution-Guided Wiki Indexing