Reducing Hallucination in Legal RAG Chatbots: A Comparative Study of Deep Learning Retrieval Architectures

COMPSCI 1090B: Data Science 2: Advanced Topics in Data Science

Harvard University · 2025-2026 Spring

Project Group - #43: GitHub

  1. Alex Oort Alonso
  2. Allan Korir
  3. Phong Le
  4. Brit Biddle

Assigned Group's Teaching Fellow contact:

Zac Sardi-Santos


Milestone 2 Presentation with TF:

Friday, April 10, 2026, at 4:00 PM ET

Presentation slides, Notebook


Milestone 3 Presentation with TF:

Friday, April 24, 2026, at 4:00 PM ET


Milestone 4 — Final Deliverables:

Due: Monday, May 12, 2026

Final Report, Code Notebook


MS4 TF meeting:

Saturday, May 9, 2026, at 7:00 p.m. ET

TF Reviewer Comments & Instructor Notes — Addressed

All Resolved
TF ReviewerResolved

Concern:

“It is not clear where the human-annotated hallucination rate comes from. The proposal does not mention how embedding methods will be trained to encode legal text.”

Response:

  • No human annotation required. Hallucination measurement is fully automated across three tiers:
  • Tier A — Retrieval ground truth: LePaRD 4M+ expert-annotated citation pairs serve as gold-standard retrieval ground truth; evaluation capped at 10K–50K pairs. Metrics: Hit@k, MRR, NDCG@10.
  • Tier B — LLM-as-Judge hallucination measurement: gpt-4o-mini judges each generation (FAITHFUL / PARTIAL / HALLUCINATED) against the contexts shown to the generator. 5 ablations x 20,877 queries = 104,385 judgments. Budget ~$53. 95% CIs +-0.86% to +-1.96%. Pearson r = -0.9624 (r2=92.6%) between Hit@10 and hallucination rate.
  • Tier C — Stratified retrieval ceiling analysis: HEAD/TORSO/TAIL evaluation by gold-cluster citation frequency. Hit@100=0.375 ceiling defines irreducible hallucination floor ~56%. Inverted long-tail finding: TAIL Hit@10 exceeds HEAD by 1.66x-2.26x across hub variants (rare precedents have distinctive contexts; constitutional canon has generic ones). Fine-tuning flips the pattern: HEAD >= TORSO > TAIL.
  • Embedding model training: BGE-M3 fine-tuned with MultipleNegativesRankingLoss on 500K–1M capped LePaRD pairs (lr=1e-5, warmup=10%, batch=32, epochs=3). CLS pooling enforced per BAAI config; runtime assertion in model_loader.py; pooling flags logged to W&B. BM25 requires no training (k1=1.5, b=0.75). All architectures evaluated with Qwen2.5-7B-Instruct held constant (greedy decoding, local 4x L4); hallucination judged by gpt-4o-mini.
InstructorResolved

Concern:

“Warning: groups should only consider this project if they have a plan for addressing the concerns regarding human-annotation of hallucinations and training of the embedding model. Without addressing the annotation problem the project will be infeasible.”

Response:

  • Human annotation bottleneck eliminated. LePaRD (ACL 2024) provides 4M+ expert-annotated legal citation pairs as gold-standard retrieval ground truth (2,429,533 verified pairs via eyecite + rapidfuzz semantic bridge, 60.74% of 4M). gpt-4o-mini judges hallucination automatically; no human annotation, no annotation bottleneck.
  • Training for the Embedding Model: BGE-M3 is fine-tuned as the main dense retriever using contrastive (MultipleNegativesRankingLoss) learning on 500K–1M LePaRD citation pairs (lr=1e-5, batch=32, 3 epochs) to produce high-quality legal embeddings for both the standalone dense retriever and the hybrid BM25+BGE-M3+reranker pipeline.
  • Compute feasibility confirmed and capped. Training: 500K–1M pairs (not 3.2M full LePaRD). Retrieval eval: 10K–50K queries. Generation eval: 1,000 stratified queries (±2.5pp at 95% CI). Iteration corpus: ~150K opinions (10% subset) for fast iteration; full 1.46M for final runs.
  • Infrastructure already operational. 1,465,484 federal appellate opinions downloaded, filtered, sharded (7.6GB); DVC + S3 versioning active; all src/ modules implemented and tested. Environment asserts transformers.__version__ == "4.39.3", torch.cuda.is_bf16_supported(), and get_device_capability()[0] >= 8 at startup.
  • Sequential model loading prevents VRAM exhaustion on single 23.7GB L4 (SLURM-allocated). BGE-M3 (~2.27GB) and DeBERTa NLI reranker (~1.7GB) are the only GPU-resident models — loaded one phase at a time with explicit DataLoader deletion + torch.cuda.empty_cache() + gc.collect() between phases. Qwen2.5-7B-Instruct loads locally (~15GB) for generation; each ablation runs as its own SLURM job, 4-way query-sharded across 4x L4. gpt-4o-mini API used only for post-hoc hallucination judging (~$53 total). Memory stats, CUDA stream sync time, and allow_tf32 state logged per phase.
  • Priority sequencing: LePaRD acquisition → 10–20% subset fast iteration → BM25 + BGE-M3 + Tier A → scale + Tier B/C.

Pipeline Status

Environment Bootstrap
CourtListener Download
DVC + S3
CourtListener RAG Prep
LePaRD Acquisition
Index Generation
Model Training
Evaluation Tiers A/B/C
Experiment Tracking W&B

Agile Sprint Plan — Coding Tasks

Sprint 1 — Environment & Data Infrastructure

Mar 24 – Apr 10

complete
  • Environment bootstrap: setup.sh, tests passing, coverage verified
  • CourtListener: 1,465,484 opinions downloaded, 159 shards, 7.6GB
  • DVC + S3 artifact versioning operational
  • All src/ modules implemented and tested
  • SQLite citation index built via src/extract.py
  • ruff + mypy linting configured in pyproject.toml
  • pip-audit CVE scan + CycloneDX SBOM generation in CI

Sprint 2 — Data Wrangling & LePaRD Acquisition

Apr 10 – Apr 17

complete
  • CourtListener RAG-readiness refinement (Cell 2 — tokenizer-aware chunking 1024 subwords)
  • spaCy stripped pipeline setup (exclude=["ner","parser","lemmatizer"]), nlp.max_length set for full appellate opinions
  • Citation-aware chunk splits with metadata per chunk: court_id, year, is_precedential, opinion_id, chunk_index
  • LePaRD acquisition via HuggingFace — Priority 1 (cap 500K–1M pairs)
  • DVC push data shards to S3 cs1090b-hallucinationlegalragchatbots
  • Train/val/test split — src/split.py (500K train / 50K val / 10K–50K test)

Sprint 3 — Index Generation & Model Training

Apr 17 – Apr 24

complete
  • BM25 (bm25s) index over pre-chunked payloads from Stage 3
  • BGE-M3 FAISS Flat index for validation (CLS pooling, bfloat16)
  • BGE-M3 fine-tuning: MultipleNegativesRankingLoss, lr=1e-5, batch=32, epochs=3
  • Hybrid: BM25+BGE-M3+bge-reranker-v2-m3 CrossEncoder (top-50→top-10)
  • FAISS IVF for full-corpus: index.train() on 100K subset, assert index.is_trained
  • Log Hit@k vs nprobe on validation set to justify IVF parameters; log nprobe/nlist to W&B

Sprint 4 — Evaluation: Tiers A/B/C

Apr 24 – May 5

complete
  • Tier A: LePaRD Hit@k, MRR, NDCG@10 on 10K–50K capped test set
  • Tier B: gpt-4o-mini LLM-as-judge — FAITHFUL/PARTIAL/HALLUCINATED per generation vs shown contexts (~$53, 104,385 judgments)
  • Hard-negative mining: 7,442 train + 391 val queries, 7 negatives each from RRF top-100
  • Tier C: Stratified evaluation HEAD/TORSO/TAIL by gold-cluster citation frequency — inverted long-tail finding
  • Fine-tune bge-reranker-v2-m3 on legal hard negatives: lr=2e-5, batch=32 (eff.), epochs=2, 4× L4 DDP, 22 GPU-hours
  • RAG generation: Qwen2.5-7B-Instruct, 5 ablations × 20,877 queries = 104,385 generations, 4-way query-sharded across 4× L4
  • W&B experiment tracking: VRAM, GPU hours, metrics per phase

Sprint 5 — Analysis, Ablations & Final Deliverables

May 5 – May 12

complete
  • Paired bootstrap significance tests (B=10,000), Cohen's d, BH-FDR
  • Ablation: BGE-M3 vs Hybrid, w/o reranker, k∈{1,5,10,20}
  • Ablation: training size 100K vs 500K vs 1M pairs
  • Ablation: chunk overlap 128 vs 64 subwords on 10% subset
  • Ablation: Stage 3 normalization on/off
  • Ablation: Contradiction vs Neutral vs combined metric sensitivity
  • W&B: 45 offline runs, lineage DAG (prep→bm25→bge-m3→rrf→reranker→rag→judge), 191.64GB DVC/S3
  • Final report: 2000–2500 words
  • Video presentation: 6 minutes

Course Milestones

Milestone 1: Group Formation & Project Selection

Weight: 2%

Due: March 24, 2025

Select top 5 project choices. Groups of 3–5 students. Staff assigns groups March 27.

Milestone 2: Data Wrangling & Project Redefinition

Weight: 10%

Due: April 10, 2025

Data acquisition, preprocessing, missing data, imbalances, scaling. 10-min presentation.

Milestone 3: EDA, Initial Modeling & Pipeline Development

Weight: 20%

Due: April 24, 2025

EDA, baseline model, training/testing pipeline, evaluation metrics. 10-min presentation.

Milestone 4: Final Modeling & Deliverables

Weight: 68%

Due: May 12, 2025

2000–2500 word report, 6-min video, well-commented Python notebook.

Sprint Timeline

  1. Mar 24 – Apr 10

    Sprint 1 — Environment & Data Infrastructure

  2. Apr 10 – Apr 17

    Sprint 2 — Data Wrangling & LePaRD Acquisition

  3. Apr 17 – Apr 24

    Sprint 3 — Index Generation & Model Training

  4. Apr 24 – May 5

    Sprint 4 — Evaluation: Tiers A/B/C

  5. May 5 – May 12

    Sprint 5 — Analysis, Ablations & Final Deliverables