Methodology

Three-tier automated hallucination measurement — no human annotation bottleneck.

Tier A — Retrieval Grounding

LePaRD expert-annotated citation pairs. 20,877 unique queries over 7,813,273-chunk corpus. Metrics: Hit@1, Hit@5, Hit@10, Hit@100, MRR, NDCG@10. Two-stage semantic bridge (eyecite + rapidfuzz) produced 2,429,533 verified pairs.

Tier B — LLM-as-Judge Hallucination Detection

gpt-4o-mini judges each generation against shown contexts, returning FAITHFUL / PARTIAL / HALLUCINATED. 5 ablations x 20,877 queries = 104,385 generations judged. Budget ~$53. 95% CIs ±0.86% to ±1.96%. Pearson r = -0.9624 (r²=92.6%) between Hit@10 and hallucination rate.

Tier C — Retrieval Ceiling Analysis

Stratified evaluation by gold-cluster citation frequency (HEAD/TORSO/TAIL). Hit@100=0.375 ceiling defines irreducible hallucination floor ~56%. Per-query paired comparison: fine-tuned reranker wins 30.6x more queries than hub-concat. Symmetric leakage cleaning via RE2 prevented BM25 Hit@1 inflation from 2.5% to 18%+.

Hallucination Results by Ablation

Ablation	n judged	Hit@10	Faithful	Partial	Hallucinated
No RAG	12,977	—	0.06%	0.00%	99.94%
BGE-M3	7,101	0.0862	9.07%	26.94%	63.99%
BM25	6,929	0.1459	9.15%	27.93%	62.92%
RRF (Hybrid)	7,017	0.1557	9.49%	30.24%	60.27%
Reranker (fine-tuned)	2,500	0.3598	11.84%	32.40%	55.76%

Judge: gpt-4o-mini. Generator: Qwen2.5-7B-Instruct. Pearson r = -0.9624 between Hit@10 and hallucination rate.