Research Question
Which retrieval setup most improves evidence grounding and reduces contradiction and neutral-evidence failures in a legal RAG system built over U.S. federal appellate opinions?
Hypotheses
- H1: Fine-tuned reranker (RRF + bge-reranker-v2-m3 on 7,442 legal hard negatives) achieves significantly higher Hit@10 than BM25 and BGE-M3 alone — CONFIRMED: Hit@1 +1,120% (0.0251 → 0.3069).
- H2: Architectures with higher Hit@10 produce significantly lower hallucination rate (gpt-4o-mini judge: FAITHFUL/PARTIAL/HALLUCINATED) — CONFIRMED: Pearson r=−0.9624 (r²=92.6%) across 4 RAG ablations.
- H3: RRF hybrid achieves higher Hit@10 than BGE-M3 alone — CONFIRMED: 0.1557 vs 0.0862 (+80.6%).
Motivation
Grounded in Mata v. Avianca Airlines (2023) — a documented case of legal hallucination with real-world consequences. Targets U.S. federal appellate opinions from CourtListener (1,465,484 opinions).
Key Results (n=20,877 queries)
| Retriever | Hit@1 | Hit@10 | MRR | Hallucinated |
|---|---|---|---|---|
| BM25 | 0.0251 | 0.1459 | 0.0642 | 62.92% |
| BGE-M3 | 0.0244 | 0.0862 | 0.0457 | 63.99% |
| RRF (Hybrid) | 0.0391 | 0.1557 | 0.0769 | 60.27% |
| Reranker (fine-tuned) | 0.3069 | 0.3598 | 0.3275 | 55.76% |
Pearson r = -0.9624 (r²=92.6%) between Hit@10 and hallucination rate.