Results Dashboard

MS4 complete. n=20,877 unique queries. Generator: Qwen2.5-7B-Instruct. Judge: gpt-4o-mini.

Pearson r = -0.9624 (r2=92.6%) between Hit@10 and hallucination rate

ArchitectureHit@1Hit@10MRRNDCG@10Hallucinatedn judged
No RAG (baseline)----99.94%12,977
BGE-M30.02440.08620.04570.051663.99%7,101
BM250.02510.14590.06420.078362.92%6,929
RRF (BM25+BGE-M3, k=60)0.03910.15570.07690.089460.27%7,017
Reranker Concat (hub)0.02840.10980.05750.0639--
Reranker MaxP (hub)0.04260.14700.07780.0881--
Reranker Fine-tuned (best)0.30690.35980.32750.334955.76%2,500

Hit@1 gain (fine-tuned vs BM25)

+1,120%

0.0251 -> 0.3069

Hallucination reduction

-44.2pp

no-RAG 99.94% -> 55.76%

Retrieval ceiling (Hit@100)

37.5%

Irreducible ~56% hallucination floor

Final Results Summary

final_summary.json SHA-256: 43eec4d3023f9485... | 5 ablations x 20,877 queries = 104,385 LLM judgments | Generator: Qwen2.5-7B-Instruct (greedy, 4x L4) | Judge: gpt-4o-mini (~$53)

Stratified (HEAD/TORSO/TAIL): TAIL Hit@10 exceeds HEAD by 1.66x-2.26x for hub variants. Fine-tuned reranker flips pattern: HEAD=0.3596, TORSO=0.3694, TAIL=0.3292. W&B: 45 offline runs, 191.64GB DVC/S3.