Methodology

Three-tier automated hallucination measurement — no human annotation bottleneck.

Tier A — Retrieval Grounding

LePaRD expert-annotated citation pairs. 20,877 unique queries over 7,813,273-chunk corpus. Metrics: Hit@1, Hit@5, Hit@10, Hit@100, MRR, NDCG@10. Two-stage semantic bridge (eyecite + rapidfuzz) produced 2,429,533 verified pairs.

Tier B — LLM-as-Judge Hallucination Detection

gpt-4o-mini judges each generation against shown contexts, returning FAITHFUL / PARTIAL / HALLUCINATED. 5 ablations x 20,877 queries = 104,385 generations judged. Budget ~$53. 95% CIs ±0.86% to ±1.96%. Pearson r = -0.9624 (r²=92.6%) between Hit@10 and hallucination rate.

Tier C — Retrieval Ceiling Analysis

Stratified evaluation by gold-cluster citation frequency (HEAD/TORSO/TAIL). Hit@100=0.375 ceiling defines irreducible hallucination floor ~56%. Per-query paired comparison: fine-tuned reranker wins 30.6x more queries than hub-concat. Symmetric leakage cleaning via RE2 prevented BM25 Hit@1 inflation from 2.5% to 18%+.

Hallucination Results by Ablation

Ablationn judgedHit@10FaithfulPartialHallucinated
No RAG12,9770.06%0.00%99.94%
BGE-M37,1010.08629.07%26.94%63.99%
BM256,9290.14599.15%27.93%62.92%
RRF (Hybrid)7,0170.15579.49%30.24%60.27%
Reranker (fine-tuned)2,5000.359811.84%32.40%55.76%

Judge: gpt-4o-mini. Generator: Qwen2.5-7B-Instruct. Pearson r = -0.9624 between Hit@10 and hallucination rate.