Results Dashboard
MS4 complete. n=20,877 unique queries. Generator: Qwen2.5-7B-Instruct. Judge: gpt-4o-mini.
Pearson r = -0.9624 (r2=92.6%) between Hit@10 and hallucination rate
| Architecture | Hit@1 | Hit@10 | MRR | NDCG@10 | Hallucinated | n judged |
|---|---|---|---|---|---|---|
| No RAG (baseline) | - | - | - | - | 99.94% | 12,977 |
| BGE-M3 | 0.0244 | 0.0862 | 0.0457 | 0.0516 | 63.99% | 7,101 |
| BM25 | 0.0251 | 0.1459 | 0.0642 | 0.0783 | 62.92% | 6,929 |
| RRF (BM25+BGE-M3, k=60) | 0.0391 | 0.1557 | 0.0769 | 0.0894 | 60.27% | 7,017 |
| Reranker Concat (hub) | 0.0284 | 0.1098 | 0.0575 | 0.0639 | - | - |
| Reranker MaxP (hub) | 0.0426 | 0.1470 | 0.0778 | 0.0881 | - | - |
| Reranker Fine-tuned (best) | 0.3069 | 0.3598 | 0.3275 | 0.3349 | 55.76% | 2,500 |
Hit@1 gain (fine-tuned vs BM25)
+1,120%
0.0251 -> 0.3069
Hallucination reduction
-44.2pp
no-RAG 99.94% -> 55.76%
Retrieval ceiling (Hit@100)
37.5%
Irreducible ~56% hallucination floor
Final Results Summary
final_summary.json SHA-256: 43eec4d3023f9485... | 5 ablations x 20,877 queries = 104,385 LLM judgments | Generator: Qwen2.5-7B-Instruct (greedy, 4x L4) | Judge: gpt-4o-mini (~$53)
Stratified (HEAD/TORSO/TAIL): TAIL Hit@10 exceeds HEAD by 1.66x-2.26x for hub variants. Fine-tuned reranker flips pattern: HEAD=0.3596, TORSO=0.3694, TAIL=0.3292. W&B: 45 offline runs, 191.64GB DVC/S3.