Reducing Hallucination in Legal RAG Chatbots: A Comparative Study of Deep Learning Retrieval Architectures

COMPSCI 1090B: Data Science 2: Advanced Topics in Data Science

Harvard University · 2025-2026 Spring

Project Group - #43: GitHub

Alex Oort Alonso
Allan Korir
Phong Le
Brit Biddle

Assigned Group's Teaching Fellow contact:

Zac Sardi-Santos

Milestone 2 Presentation with TF:

Friday, April 10, 2026, at 4:00 PM ET

Presentation slides, Notebook

Milestone 3 Presentation with TF:

Friday, April 24, 2026, at 4:00 PM ET

Milestone 4 — Final Deliverables:

Due: Monday, May 12, 2026

Final Report, Code Notebook

MS4 TF meeting:

Saturday, May 9, 2026, at 7:00 p.m. ET

TF Reviewer Comments & Instructor Notes — Addressed

All Resolved

TF ReviewerResolved

Concern:

“It is not clear where the human-annotated hallucination rate comes from. The proposal does not mention how embedding methods will be trained to encode legal text.”

Response:

•No human annotation required. Hallucination measurement is fully automated across three tiers:
–Tier A — Retrieval ground truth: LePaRD 4M+ expert-annotated citation pairs serve as gold-standard retrieval ground truth; evaluation capped at 10K–50K pairs. Metrics: Hit@k, MRR, NDCG@10.
–Tier B — LLM-as-Judge hallucination measurement: gpt-4o-mini judges each generation (FAITHFUL / PARTIAL / HALLUCINATED) against the contexts shown to the generator. 5 ablations x 20,877 queries = 104,385 judgments. Budget ~$53. 95% CIs +-0.86% to +-1.96%. Pearson r = -0.9624 (r2=92.6%) between Hit@10 and hallucination rate.
–Tier C — Stratified retrieval ceiling analysis: HEAD/TORSO/TAIL evaluation by gold-cluster citation frequency. Hit@100=0.375 ceiling defines irreducible hallucination floor ~56%. Inverted long-tail finding: TAIL Hit@10 exceeds HEAD by 1.66x-2.26x across hub variants (rare precedents have distinctive contexts; constitutional canon has generic ones). Fine-tuning flips the pattern: HEAD >= TORSO > TAIL.
•Embedding model training: BGE-M3 fine-tuned with MultipleNegativesRankingLoss on 500K–1M capped LePaRD pairs (lr=1e-5, warmup=10%, batch=32, epochs=3). CLS pooling enforced per BAAI config; runtime assertion in model_loader.py; pooling flags logged to W&B. BM25 requires no training (k1=1.5, b=0.75). All architectures evaluated with Qwen2.5-7B-Instruct held constant (greedy decoding, local 4x L4); hallucination judged by gpt-4o-mini.

InstructorResolved

Concern:

“Warning: groups should only consider this project if they have a plan for addressing the concerns regarding human-annotation of hallucinations and training of the embedding model. Without addressing the annotation problem the project will be infeasible.”

Response:

•Human annotation bottleneck eliminated. LePaRD (ACL 2024) provides 4M+ expert-annotated legal citation pairs as gold-standard retrieval ground truth (2,429,533 verified pairs via eyecite + rapidfuzz semantic bridge, 60.74% of 4M). gpt-4o-mini judges hallucination automatically; no human annotation, no annotation bottleneck.
•Training for the Embedding Model: BGE-M3 is fine-tuned as the main dense retriever using contrastive (MultipleNegativesRankingLoss) learning on 500K–1M LePaRD citation pairs (lr=1e-5, batch=32, 3 epochs) to produce high-quality legal embeddings for both the standalone dense retriever and the hybrid BM25+BGE-M3+reranker pipeline.
•Compute feasibility confirmed and capped. Training: 500K–1M pairs (not 3.2M full LePaRD). Retrieval eval: 10K–50K queries. Generation eval: 1,000 stratified queries (±2.5pp at 95% CI). Iteration corpus: ~150K opinions (10% subset) for fast iteration; full 1.46M for final runs.
•Infrastructure already operational. 1,465,484 federal appellate opinions downloaded, filtered, sharded (7.6GB); DVC + S3 versioning active; all src/ modules implemented and tested. Environment asserts transformers.__version__ == "4.39.3", torch.cuda.is_bf16_supported(), and get_device_capability()[0] >= 8 at startup.
•Sequential model loading prevents VRAM exhaustion on single 23.7GB L4 (SLURM-allocated). BGE-M3 (~2.27GB) and DeBERTa NLI reranker (~1.7GB) are the only GPU-resident models — loaded one phase at a time with explicit DataLoader deletion + torch.cuda.empty_cache() + gc.collect() between phases. Qwen2.5-7B-Instruct loads locally (~15GB) for generation; each ablation runs as its own SLURM job, 4-way query-sharded across 4x L4. gpt-4o-mini API used only for post-hoc hallucination judging (~$53 total). Memory stats, CUDA stream sync time, and allow_tf32 state logged per phase.
•Priority sequencing: LePaRD acquisition → 10–20% subset fast iteration → BM25 + BGE-M3 + Tier A → scale + Tier B/C.

Pipeline Status

Environment Bootstrap

CourtListener Download

DVC + S3

CourtListener RAG Prep

LePaRD Acquisition

Index Generation

Model Training

Evaluation Tiers A/B/C

Experiment Tracking W&B

Agile Sprint Plan — Coding Tasks

Sprint 1 — Environment & Data Infrastructure

Mar 24 – Apr 10

complete

Environment bootstrap: setup.sh, tests passing, coverage verified
CourtListener: 1,465,484 opinions downloaded, 159 shards, 7.6GB
DVC + S3 artifact versioning operational
All src/ modules implemented and tested
SQLite citation index built via src/extract.py
ruff + mypy linting configured in pyproject.toml
pip-audit CVE scan + CycloneDX SBOM generation in CI

Sprint 2 — Data Wrangling & LePaRD Acquisition

Apr 10 – Apr 17

complete

CourtListener RAG-readiness refinement (Cell 2 — tokenizer-aware chunking 1024 subwords)
spaCy stripped pipeline setup (exclude=["ner","parser","lemmatizer"]), nlp.max_length set for full appellate opinions
Citation-aware chunk splits with metadata per chunk: court_id, year, is_precedential, opinion_id, chunk_index
LePaRD acquisition via HuggingFace — Priority 1 (cap 500K–1M pairs)
DVC push data shards to S3 cs1090b-hallucinationlegalragchatbots
Train/val/test split — src/split.py (500K train / 50K val / 10K–50K test)

Sprint 3 — Index Generation & Model Training

Apr 17 – Apr 24

complete

BM25 (bm25s) index over pre-chunked payloads from Stage 3
BGE-M3 FAISS Flat index for validation (CLS pooling, bfloat16)
BGE-M3 fine-tuning: MultipleNegativesRankingLoss, lr=1e-5, batch=32, epochs=3
Hybrid: BM25+BGE-M3+bge-reranker-v2-m3 CrossEncoder (top-50→top-10)
FAISS IVF for full-corpus: index.train() on 100K subset, assert index.is_trained
Log Hit@k vs nprobe on validation set to justify IVF parameters; log nprobe/nlist to W&B

Sprint 4 — Evaluation: Tiers A/B/C

Apr 24 – May 5

complete

Tier A: LePaRD Hit@k, MRR, NDCG@10 on 10K–50K capped test set
Tier B: gpt-4o-mini LLM-as-judge — FAITHFUL/PARTIAL/HALLUCINATED per generation vs shown contexts (~$53, 104,385 judgments)
Hard-negative mining: 7,442 train + 391 val queries, 7 negatives each from RRF top-100
Tier C: Stratified evaluation HEAD/TORSO/TAIL by gold-cluster citation frequency — inverted long-tail finding
Fine-tune bge-reranker-v2-m3 on legal hard negatives: lr=2e-5, batch=32 (eff.), epochs=2, 4× L4 DDP, 22 GPU-hours
RAG generation: Qwen2.5-7B-Instruct, 5 ablations × 20,877 queries = 104,385 generations, 4-way query-sharded across 4× L4
W&B experiment tracking: VRAM, GPU hours, metrics per phase

Sprint 5 — Analysis, Ablations & Final Deliverables

May 5 – May 12

complete

Paired bootstrap significance tests (B=10,000), Cohen's d, BH-FDR
Ablation: BGE-M3 vs Hybrid, w/o reranker, k∈{1,5,10,20}
Ablation: training size 100K vs 500K vs 1M pairs
Ablation: chunk overlap 128 vs 64 subwords on 10% subset
Ablation: Stage 3 normalization on/off
Ablation: Contradiction vs Neutral vs combined metric sensitivity
W&B: 45 offline runs, lineage DAG (prep→bm25→bge-m3→rrf→reranker→rag→judge), 191.64GB DVC/S3
Final report: 2000–2500 words
Video presentation: 6 minutes

Course Milestones

Milestone 1: Group Formation & Project Selection

Weight: 2%

Due: March 24, 2025

Select top 5 project choices. Groups of 3–5 students. Staff assigns groups March 27.

Milestone 2: Data Wrangling & Project Redefinition

Weight: 10%

Due: April 10, 2025

Data acquisition, preprocessing, missing data, imbalances, scaling. 10-min presentation.

Milestone 3: EDA, Initial Modeling & Pipeline Development

Weight: 20%

Due: April 24, 2025

EDA, baseline model, training/testing pipeline, evaluation metrics. 10-min presentation.

Milestone 4: Final Modeling & Deliverables

Weight: 68%

Due: May 12, 2025

2000–2500 word report, 6-min video, well-commented Python notebook.

Sprint Timeline

Mar 24 – Apr 10
Sprint 1 — Environment & Data Infrastructure
Apr 10 – Apr 17
Sprint 2 — Data Wrangling & LePaRD Acquisition
Apr 17 – Apr 24
Sprint 3 — Index Generation & Model Training
Apr 24 – May 5
Sprint 4 — Evaluation: Tiers A/B/C
May 5 – May 12
Sprint 5 — Analysis, Ablations & Final Deliverables