Methodology
Three-tier automated hallucination measurement — no human annotation bottleneck.
Tier A — Retrieval Grounding
LePaRD expert-annotated citation pairs. 20,877 unique queries over 7,813,273-chunk corpus. Metrics: Hit@1, Hit@5, Hit@10, Hit@100, MRR, NDCG@10. Two-stage semantic bridge (eyecite + rapidfuzz) produced 2,429,533 verified pairs.
Tier B — LLM-as-Judge Hallucination Detection
gpt-4o-mini judges each generation against shown contexts, returning FAITHFUL / PARTIAL / HALLUCINATED. 5 ablations x 20,877 queries = 104,385 generations judged. Budget ~$53. 95% CIs ±0.86% to ±1.96%. Pearson r = -0.9624 (r²=92.6%) between Hit@10 and hallucination rate.
Tier C — Retrieval Ceiling Analysis
Stratified evaluation by gold-cluster citation frequency (HEAD/TORSO/TAIL). Hit@100=0.375 ceiling defines irreducible hallucination floor ~56%. Per-query paired comparison: fine-tuned reranker wins 30.6x more queries than hub-concat. Symmetric leakage cleaning via RE2 prevented BM25 Hit@1 inflation from 2.5% to 18%+.
Hallucination Results by Ablation
| Ablation | n judged | Hit@10 | Faithful | Partial | Hallucinated |
|---|---|---|---|---|---|
| No RAG | 12,977 | — | 0.06% | 0.00% | 99.94% |
| BGE-M3 | 7,101 | 0.0862 | 9.07% | 26.94% | 63.99% |
| BM25 | 6,929 | 0.1459 | 9.15% | 27.93% | 62.92% |
| RRF (Hybrid) | 7,017 | 0.1557 | 9.49% | 30.24% | 60.27% |
| Reranker (fine-tuned) | 2,500 | 0.3598 | 11.84% | 32.40% | 55.76% |
Judge: gpt-4o-mini. Generator: Qwen2.5-7B-Instruct. Pearson r = -0.9624 between Hit@10 and hallucination rate.