Results Dashboard

MS4 complete. n=20,877 unique queries. Generator: Qwen2.5-7B-Instruct. Judge: gpt-4o-mini.

Pearson r = -0.9624 (r2=92.6%) between Hit@10 and hallucination rate

Architecture	Hit@1	Hit@10	MRR	NDCG@10	Hallucinated	n judged
No RAG (baseline)	-	-	-	-	99.94%	12,977
BGE-M3	0.0244	0.0862	0.0457	0.0516	63.99%	7,101
BM25	0.0251	0.1459	0.0642	0.0783	62.92%	6,929
RRF (BM25+BGE-M3, k=60)	0.0391	0.1557	0.0769	0.0894	60.27%	7,017
Reranker Concat (hub)	0.0284	0.1098	0.0575	0.0639	-	-
Reranker MaxP (hub)	0.0426	0.1470	0.0778	0.0881	-	-
Reranker Fine-tuned (best)	0.3069	0.3598	0.3275	0.3349	55.76%	2,500

Hit@1 gain (fine-tuned vs BM25)

+1,120%

0.0251 -> 0.3069

Hallucination reduction

-44.2pp

no-RAG 99.94% -> 55.76%

Retrieval ceiling (Hit@100)

37.5%

Irreducible ~56% hallucination floor

Final Results Summary

final_summary.json SHA-256: 43eec4d3023f9485... | 5 ablations x 20,877 queries = 104,385 LLM judgments | Generator: Qwen2.5-7B-Instruct (greedy, 4x L4) | Judge: gpt-4o-mini (~$53)

Stratified (HEAD/TORSO/TAIL): TAIL Hit@10 exceeds HEAD by 1.66x-2.26x for hub variants. Fine-tuned reranker flips pattern: HEAD=0.3596, TORSO=0.3694, TAIL=0.3292. W&B: 45 offline runs, 191.64GB DVC/S3.