Architectures
Five retrieval configurations compared under Qwen2.5-7B-Instruct generator (greedy decoding, local on 4× NVIDIA L4). Hallucination judged by gpt-4o-mini (FAITHFUL/PARTIAL/HALLUCINATED) against shown contexts.
| ID | Architecture | Type | Role | Key Parameters |
|---|---|---|---|---|
| (a) | BM25 | Non-neural baseline | Reference floor | k1=1.5, b=0.75 |
| (b) | BGE-M3 | Dense retriever (CLS pooling) | Primary dense baseline | lr=1e-5, batch=32, epochs=3, 1024-subword chunks |
| (c) | RRF (BM25+BGE-M3) | Lexical + Dense Fusion | Strong hybrid baseline | k=60 (Cormack 2009), top-100 per retriever fused |
| (c2) | Reranker Concat (hub) | CrossEncoder hub | Out-of-domain reranker | bge-reranker-v2-m3, 2-chunk concat, max_length=1024 |
| (c3) | Reranker MaxP (hub) | CrossEncoder hub MaxP | Chunk-level max-pool | bge-reranker-v2-m3, per-chunk MaxP, max_length=1024 |
| (c4) | Reranker Fine-tuned | CrossEncoder fine-tuned on legal hard negatives | Expected strongest (+980% Hit@1) | bge-reranker-v2-m3 + 7,442 legal hard negatives, lr=2e-5, batch=32, epochs=2 |
Architecture & Training Summary
Corpus: 7,813,273 chunks (1,024-subword / 128-overlap, BAAI/bge-m3 tokenizer) from 1,465,484 federal appellate opinions across 13 circuits. BM25 index: 36 min build, 110 min retrieval at 3.2 qps single-thread. BGE-M3: 55 min retrieval at 6.3 qps across 4x L4. Reranker fine-tuned on 7,442 hard-negative pairs (lr=2e-5, batch=32 eff., 2 epochs, 22 GPU-hours, 4x L4 DDP). Hard negatives sampled from RRF ranks 2-100, max 2 chunks/cluster, 7 neg/pos.