Reducing Hallucination in Legal RAG Chatbots: A Comparative Study of Deep Learning Retrieval Architectures
COMPSCI 1090B: Data Science 2: Advanced Topics in Data Science
Harvard University · 2025-2026 Spring
Project Group - #43: GitHub
- Alex Oort Alonso
- Allan Korir
- Phong Le
- Brit Biddle
Assigned Group's Teaching Fellow contact:
Zac Sardi-Santos
Milestone 2 Presentation with TF:
Friday, April 10, 2026, at 4:00 PM ET
Milestone 3 Presentation with TF:
Friday, April 24, 2026, at 4:00 PM ET
Milestone 4 — Final Deliverables:
Due: Monday, May 12, 2026
MS4 TF meeting:
Saturday, May 9, 2026, at 7:00 p.m. ET
TF Reviewer Comments & Instructor Notes — Addressed
All ResolvedConcern:
“It is not clear where the human-annotated hallucination rate comes from. The proposal does not mention how embedding methods will be trained to encode legal text.”
Response:
- •No human annotation required. Hallucination measurement is fully automated across three tiers:
- –Tier A — Retrieval ground truth: LePaRD 4M+ expert-annotated citation pairs serve as gold-standard retrieval ground truth; evaluation capped at 10K–50K pairs. Metrics: Hit@k, MRR, NDCG@10.
- –Tier B — LLM-as-Judge hallucination measurement:
gpt-4o-minijudges each generation (FAITHFUL / PARTIAL / HALLUCINATED) against the contexts shown to the generator. 5 ablations x 20,877 queries = 104,385 judgments. Budget ~$53. 95% CIs +-0.86% to +-1.96%. Pearson r = -0.9624 (r2=92.6%) between Hit@10 and hallucination rate. - –Tier C — Stratified retrieval ceiling analysis: HEAD/TORSO/TAIL evaluation by gold-cluster citation frequency. Hit@100=0.375 ceiling defines irreducible hallucination floor ~56%. Inverted long-tail finding: TAIL Hit@10 exceeds HEAD by 1.66x-2.26x across hub variants (rare precedents have distinctive contexts; constitutional canon has generic ones). Fine-tuning flips the pattern: HEAD >= TORSO > TAIL.
- •Embedding model training: BGE-M3 fine-tuned with
MultipleNegativesRankingLosson 500K–1M capped LePaRD pairs (lr=1e-5, warmup=10%, batch=32, epochs=3). CLS pooling enforced per BAAI config; runtime assertion inmodel_loader.py; pooling flags logged to W&B. BM25 requires no training (k1=1.5, b=0.75). All architectures evaluated with Qwen2.5-7B-Instruct held constant (greedy decoding, local 4x L4); hallucination judged by gpt-4o-mini.
Concern:
“Warning: groups should only consider this project if they have a plan for addressing the concerns regarding human-annotation of hallucinations and training of the embedding model. Without addressing the annotation problem the project will be infeasible.”
Response:
- •Human annotation bottleneck eliminated. LePaRD (ACL 2024) provides 4M+ expert-annotated legal citation pairs as gold-standard retrieval ground truth (2,429,533 verified pairs via eyecite + rapidfuzz semantic bridge, 60.74% of 4M). gpt-4o-mini judges hallucination automatically; no human annotation, no annotation bottleneck.
- •Training for the Embedding Model: BGE-M3 is fine-tuned as the main dense retriever using contrastive (
MultipleNegativesRankingLoss) learning on 500K–1M LePaRD citation pairs (lr=1e-5, batch=32, 3 epochs) to produce high-quality legal embeddings for both the standalone dense retriever and the hybrid BM25+BGE-M3+reranker pipeline. - •Compute feasibility confirmed and capped. Training: 500K–1M pairs (not 3.2M full LePaRD). Retrieval eval: 10K–50K queries. Generation eval: 1,000 stratified queries (±2.5pp at 95% CI). Iteration corpus: ~150K opinions (10% subset) for fast iteration; full 1.46M for final runs.
- •Infrastructure already operational. 1,465,484 federal appellate opinions downloaded, filtered, sharded (7.6GB); DVC + S3 versioning active; all
src/modules implemented and tested. Environment assertstransformers.__version__ == "4.39.3",torch.cuda.is_bf16_supported(), andget_device_capability()[0] >= 8at startup. - •Sequential model loading prevents VRAM exhaustion on single 23.7GB L4 (SLURM-allocated). BGE-M3 (~2.27GB) and DeBERTa NLI reranker (~1.7GB) are the only GPU-resident models — loaded one phase at a time with explicit DataLoader deletion +
torch.cuda.empty_cache()+gc.collect()between phases. Qwen2.5-7B-Instruct loads locally (~15GB) for generation; each ablation runs as its own SLURM job, 4-way query-sharded across 4x L4. gpt-4o-mini API used only for post-hoc hallucination judging (~$53 total). Memory stats, CUDA stream sync time, andallow_tf32state logged per phase. - •Priority sequencing: LePaRD acquisition → 10–20% subset fast iteration → BM25 + BGE-M3 + Tier A → scale + Tier B/C.
Pipeline Status
Agile Sprint Plan — Coding Tasks
Sprint 1 — Environment & Data Infrastructure
Mar 24 – Apr 10
- Environment bootstrap: setup.sh, tests passing, coverage verified
- CourtListener: 1,465,484 opinions downloaded, 159 shards, 7.6GB
- DVC + S3 artifact versioning operational
- All src/ modules implemented and tested
- SQLite citation index built via src/extract.py
- ruff + mypy linting configured in pyproject.toml
- pip-audit CVE scan + CycloneDX SBOM generation in CI
Sprint 2 — Data Wrangling & LePaRD Acquisition
Apr 10 – Apr 17
- CourtListener RAG-readiness refinement (Cell 2 — tokenizer-aware chunking 1024 subwords)
- spaCy stripped pipeline setup (exclude=["ner","parser","lemmatizer"]), nlp.max_length set for full appellate opinions
- Citation-aware chunk splits with metadata per chunk: court_id, year, is_precedential, opinion_id, chunk_index
- LePaRD acquisition via HuggingFace — Priority 1 (cap 500K–1M pairs)
- DVC push data shards to S3 cs1090b-hallucinationlegalragchatbots
- Train/val/test split — src/split.py (500K train / 50K val / 10K–50K test)
Sprint 3 — Index Generation & Model Training
Apr 17 – Apr 24
- BM25 (bm25s) index over pre-chunked payloads from Stage 3
- BGE-M3 FAISS Flat index for validation (CLS pooling, bfloat16)
- BGE-M3 fine-tuning: MultipleNegativesRankingLoss, lr=1e-5, batch=32, epochs=3
- Hybrid: BM25+BGE-M3+bge-reranker-v2-m3 CrossEncoder (top-50→top-10)
- FAISS IVF for full-corpus: index.train() on 100K subset, assert index.is_trained
- Log Hit@k vs nprobe on validation set to justify IVF parameters; log nprobe/nlist to W&B
Sprint 4 — Evaluation: Tiers A/B/C
Apr 24 – May 5
- Tier A: LePaRD Hit@k, MRR, NDCG@10 on 10K–50K capped test set
- Tier B: gpt-4o-mini LLM-as-judge — FAITHFUL/PARTIAL/HALLUCINATED per generation vs shown contexts (~$53, 104,385 judgments)
- Hard-negative mining: 7,442 train + 391 val queries, 7 negatives each from RRF top-100
- Tier C: Stratified evaluation HEAD/TORSO/TAIL by gold-cluster citation frequency — inverted long-tail finding
- Fine-tune bge-reranker-v2-m3 on legal hard negatives: lr=2e-5, batch=32 (eff.), epochs=2, 4× L4 DDP, 22 GPU-hours
- RAG generation: Qwen2.5-7B-Instruct, 5 ablations × 20,877 queries = 104,385 generations, 4-way query-sharded across 4× L4
- W&B experiment tracking: VRAM, GPU hours, metrics per phase
Sprint 5 — Analysis, Ablations & Final Deliverables
May 5 – May 12
- Paired bootstrap significance tests (B=10,000), Cohen's d, BH-FDR
- Ablation: BGE-M3 vs Hybrid, w/o reranker, k∈{1,5,10,20}
- Ablation: training size 100K vs 500K vs 1M pairs
- Ablation: chunk overlap 128 vs 64 subwords on 10% subset
- Ablation: Stage 3 normalization on/off
- Ablation: Contradiction vs Neutral vs combined metric sensitivity
- W&B: 45 offline runs, lineage DAG (prep→bm25→bge-m3→rrf→reranker→rag→judge), 191.64GB DVC/S3
- Final report: 2000–2500 words
- Video presentation: 6 minutes
Course Milestones
Milestone 1: Group Formation & Project Selection
Weight: 2%Due: March 24, 2025
Select top 5 project choices. Groups of 3–5 students. Staff assigns groups March 27.
Milestone 2: Data Wrangling & Project Redefinition
Weight: 10%Due: April 10, 2025
Data acquisition, preprocessing, missing data, imbalances, scaling. 10-min presentation.
Milestone 3: EDA, Initial Modeling & Pipeline Development
Weight: 20%Due: April 24, 2025
EDA, baseline model, training/testing pipeline, evaluation metrics. 10-min presentation.
Milestone 4: Final Modeling & Deliverables
Weight: 68%Due: May 12, 2025
2000–2500 word report, 6-min video, well-commented Python notebook.
Sprint Timeline
Mar 24 – Apr 10
Sprint 1 — Environment & Data Infrastructure
Apr 10 – Apr 17
Sprint 2 — Data Wrangling & LePaRD Acquisition
Apr 17 – Apr 24
Sprint 3 — Index Generation & Model Training
Apr 24 – May 5
Sprint 4 — Evaluation: Tiers A/B/C
May 5 – May 12
Sprint 5 — Analysis, Ablations & Final Deliverables