Research Question

Which retrieval setup most improves evidence grounding and reduces contradiction and neutral-evidence failures in a legal RAG system built over U.S. federal appellate opinions?

Hypotheses

  • H1: Fine-tuned reranker (RRF + bge-reranker-v2-m3 on 7,442 legal hard negatives) achieves significantly higher Hit@10 than BM25 and BGE-M3 alone — CONFIRMED: Hit@1 +1,120% (0.0251 → 0.3069).
  • H2: Architectures with higher Hit@10 produce significantly lower hallucination rate (gpt-4o-mini judge: FAITHFUL/PARTIAL/HALLUCINATED) — CONFIRMED: Pearson r=−0.9624 (r²=92.6%) across 4 RAG ablations.
  • H3: RRF hybrid achieves higher Hit@10 than BGE-M3 alone — CONFIRMED: 0.1557 vs 0.0862 (+80.6%).

Motivation

Grounded in Mata v. Avianca Airlines (2023) — a documented case of legal hallucination with real-world consequences. Targets U.S. federal appellate opinions from CourtListener (1,465,484 opinions).

Key Results (n=20,877 queries)

RetrieverHit@1Hit@10MRRHallucinated
BM250.02510.14590.064262.92%
BGE-M30.02440.08620.045763.99%
RRF (Hybrid)0.03910.15570.076960.27%
Reranker (fine-tuned)0.30690.35980.327555.76%

Pearson r = -0.9624 (r²=92.6%) between Hit@10 and hallucination rate.