Architectures

Five retrieval configurations compared under Qwen2.5-7B-Instruct generator (greedy decoding, local on 4× NVIDIA L4). Hallucination judged by gpt-4o-mini (FAITHFUL/PARTIAL/HALLUCINATED) against shown contexts.

IDArchitectureTypeRoleKey Parameters
(a)BM25Non-neural baselineReference floork1=1.5, b=0.75
(b)BGE-M3Dense retriever (CLS pooling)Primary dense baselinelr=1e-5, batch=32, epochs=3, 1024-subword chunks
(c)RRF (BM25+BGE-M3)Lexical + Dense FusionStrong hybrid baselinek=60 (Cormack 2009), top-100 per retriever fused
(c2)Reranker Concat (hub)CrossEncoder hubOut-of-domain rerankerbge-reranker-v2-m3, 2-chunk concat, max_length=1024
(c3)Reranker MaxP (hub)CrossEncoder hub MaxPChunk-level max-poolbge-reranker-v2-m3, per-chunk MaxP, max_length=1024
(c4)Reranker Fine-tunedCrossEncoder fine-tuned on legal hard negativesExpected strongest (+980% Hit@1)bge-reranker-v2-m3 + 7,442 legal hard negatives, lr=2e-5, batch=32, epochs=2

Architecture & Training Summary

Corpus: 7,813,273 chunks (1,024-subword / 128-overlap, BAAI/bge-m3 tokenizer) from 1,465,484 federal appellate opinions across 13 circuits. BM25 index: 36 min build, 110 min retrieval at 3.2 qps single-thread. BGE-M3: 55 min retrieval at 6.3 qps across 4x L4. Reranker fine-tuned on 7,442 hard-negative pairs (lr=2e-5, batch=32 eff., 2 epochs, 22 GPU-hours, 4x L4 DDP). Hard negatives sampled from RRF ranks 2-100, max 2 chunks/cluster, 7 neg/pos.