Research Question

Which retrieval setup most improves evidence grounding and reduces contradiction and neutral-evidence failures in a legal RAG system built over U.S. federal appellate opinions?

Hypotheses

H1: Fine-tuned reranker (RRF + bge-reranker-v2-m3 on 7,442 legal hard negatives) achieves significantly higher Hit@10 than BM25 and BGE-M3 alone — CONFIRMED: Hit@1 +1,120% (0.0251 → 0.3069).
H2: Architectures with higher Hit@10 produce significantly lower hallucination rate (gpt-4o-mini judge: FAITHFUL/PARTIAL/HALLUCINATED) — CONFIRMED: Pearson r=−0.9624 (r²=92.6%) across 4 RAG ablations.
H3: RRF hybrid achieves higher Hit@10 than BGE-M3 alone — CONFIRMED: 0.1557 vs 0.0862 (+80.6%).

Motivation

Grounded in Mata v. Avianca Airlines (2023) — a documented case of legal hallucination with real-world consequences. Targets U.S. federal appellate opinions from CourtListener (1,465,484 opinions).

Key Results (n=20,877 queries)

Retriever	Hit@1	Hit@10	MRR	Hallucinated
BM25	0.0251	0.1459	0.0642	62.92%
BGE-M3	0.0244	0.0862	0.0457	63.99%
RRF (Hybrid)	0.0391	0.1557	0.0769	60.27%
Reranker (fine-tuned)	0.3069	0.3598	0.3275	55.76%

Pearson r = -0.9624 (r²=92.6%) between Hit@10 and hallucination rate.