Reducing Hallucination in Legal RAG Chatbots
COMPSCI 1090B: Data Science 2: Advanced Topics in Data Science
Harvard University · [email protected] · 2025-2026 Spring
Project Group - #43: GitHub
- PHONG LE
- ...
- ...
TF Reviewer Comments & Instructor Notes — Addressed
All ResolvedConcern:
“It is not clear where the human-annotated hallucination rate comes from. The proposal does not mention how embedding methods will be trained to encode legal text.”
Response:
- •No human annotation required. Hallucination measurement is fully automated across three tiers:
- –Tier A — Retrieval ground truth: LePaRD 4M+ expert-annotated citation pairs serve as gold-standard retrieval ground truth; evaluation capped at 10K–50K pairs. Metrics: Recall@k, MRR, NDCG@10.
- –Tier B — NLI hallucination measurement:
MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanliclassifies each atomic claim independently against individual retrieved chunks, running fully locally with no API calls. 512-token limit handled via repo-certified overflow windowing (return_overflowing_tokens=True, max_length=512, stride=64,use_fast=False); window-level logits aggregated per chunk. Contradiction rate normalized by per-query claim count and per 1K tokens. Zero-claim responses excluded and reported separately. NLI confidence scores treated as diagnostic indicators only. - –Tier C — Citation existence verification: Local SQLite index (
check_same_thread=False; read-only) provides O(1) citation lookup. NULL → Hard Citation Hallucination logged, NLI skipped. Found with no local NLI support →CitationFound_NoLocalSupportlogged. Citation hash (opinion_id + anchor span) logged per lookup. Windowing: (1) Hybrid reranker; (2) keyword/regex; (3) sliding-window fallback. - •Embedding model training: BGE-M3 fine-tuned with
MultipleNegativesRankingLosson 500K–1M capped LePaRD pairs (lr=1e-5, warmup=10%, batch=32, epochs=3). CLS pooling enforced per BAAI config; runtime assertion inmodel_loader.py; pooling flags logged to W&B. Legal-BERT optional domain-reference (lr=2e-5, batch=32, epochs=3). BM25 requires no training (k1=1.5, b=0.75). All architectures evaluated withmistralai/Mistral-7B-Instruct-v0.2held constant (greedy decoding, chat template enforced).
Concern:
“Warning: groups should only consider this project if they have a plan for addressing the concerns regarding human-annotation of hallucinations and training of the embedding model. Without addressing the annotation problem the project will be infeasible.”
Response:
- •Human annotation bottleneck eliminated. LePaRD (ACL 2024) provides 4M+ expert-annotated legal citation pairs as gold-standard retrieval ground truth. DeBERTa-v3 NLI runs fully locally on cluster GPU (bfloat16, ~3GB VRAM); no API calls, no human reviewers, no annotation bottleneck.
- •Compute feasibility confirmed and capped. Training: 500K–1M pairs (not 3.2M full LePaRD). Retrieval eval: 10K–50K queries. Generation eval: 1,000 stratified queries (±2.5pp at 95% CI). Iteration corpus: ~150K opinions (10% subset) for fast iteration; full 1.46M for final runs.
- •Infrastructure already operational. 1,465,484 federal appellate opinions downloaded, filtered, sharded (7.6GB); DVC + S3 versioning active; all
src/modules implemented and tested. Environment assertstransformers.__version__ == "4.39.3",torch.cuda.is_bf16_supported(), andget_device_capability()[0] >= 8at startup. - •Sequential model loading prevents VRAM exhaustion. Single 23.7GB L4 (SLURM-allocated). BGE-M3 (~2.27GB), reranker (~2GB), Mistral (~14–15GB + KV cache), DeBERTa (~3GB) loaded one phase at a time; explicit DataLoader deletion +
torch.cuda.empty_cache()+gc.collect()between phases; memory stats + CUDA stream sync time +allow_tf32state logged per phase. - •Priority sequencing: LePaRD acquisition → 10–20% subset fast iteration → BM25 + BGE-M3 + Tier A → scale + Tier B/C.
Pipeline Status
Agile Sprint Plan — Coding Tasks
Sprint 1 — Environment & Data Infrastructure
Mar 24 – Apr 10
- Environment bootstrap: setup.sh, tests passing, coverage verified
- CourtListener: 1,465,484 opinions downloaded, 159 shards, 7.6GB
- DVC + S3 artifact versioning operational
- All src/ modules implemented and tested
- SQLite citation index built via src/extract.py
- ruff + mypy linting configured in pyproject.toml
- pip-audit CVE scan + CycloneDX SBOM generation in CI
Sprint 2 — Data Wrangling & LePaRD Acquisition
Apr 10 – Apr 17
- CourtListener RAG-readiness refinement (Cell 2 — tokenizer-aware chunking 1024 subwords)
- spaCy stripped pipeline setup (exclude=["ner","parser","lemmatizer"]), nlp.max_length set for full appellate opinions
- Citation-aware chunk splits with metadata per chunk: court_id, year, is_precedential, opinion_id, chunk_index
- LePaRD acquisition via HuggingFace — Priority 1 (cap 500K–1M pairs)
- DVC push data shards to S3 cs1090b-hallucinationlegalragchatbots
- Train/val/test split — src/split.py (500K train / 50K val / 10K–50K test)
Sprint 3 — Index Generation & Model Training
Apr 17 – Apr 24
- BM25 (bm25s) index over pre-chunked payloads from Stage 3
- BGE-M3 FAISS Flat index for validation (CLS pooling, bfloat16)
- BGE-M3 fine-tuning: MultipleNegativesRankingLoss, lr=1e-5, batch=32, epochs=3
- Hybrid: BM25+BGE-M3+bge-reranker-v2-m3 CrossEncoder (top-50→top-10)
- FAISS IVF for full-corpus: index.train() on 100K subset, assert index.is_trained
- Log recall@k vs nprobe on validation set to justify IVF parameters; log nprobe/nlist to W&B
- Legal-BERT bi-encoder: 512-subword chunks, MultipleNegativesRankingLoss, lr=2e-5, warmup=10%, batch=32, epochs=3 (optional domain-reference)
Sprint 4 — Evaluation: Tiers A/B/C
Apr 24 – May 5
- Tier A: LePaRD Recall@k, MRR, NDCG@10 on 10K–50K capped test set
- Tier B: DeBERTa-v3 NLI classifier — 1,000 stratified queries, contradiction rate
- Log window count distribution per chunk; log window index per label
- Tier C: SQLite citation lookup — Hard Citation Hallucination + CitationFound_NoLocalSupport
- Log citation anchor offset on sliding-window fallback
- Sequential loading: BGE-M3 → Reranker → Mistral-7B → NLI → SQLite
- W&B experiment tracking: VRAM, GPU hours, metrics per phase
Sprint 5 — Analysis, Ablations & Final Deliverables
May 5 – May 12
- Paired bootstrap significance tests (B=10,000), Cohen's d, BH-FDR
- Ablation: BGE-M3 vs Hybrid, w/o reranker, Legal-BERT, k∈{1,5,10,20}
- Ablation: training size 100K vs 500K vs 1M pairs
- Ablation: chunk overlap 128 vs 64 subwords on 10% subset
- Ablation: Stage 3 normalization on/off
- Ablation: Contradiction vs Neutral vs combined metric sensitivity
- wandb_logger.py: full per-phase VRAM, pooling flags, score distributions
- Final report: 2000–2500 words
- Video presentation: 6 minutes
Course Milestones
Milestone 1: Group Formation & Project Selection
Weight: 2%Due: March 24, 2025
Select top 5 project choices. Groups of 3–5 students. Staff assigns groups March 27.
Milestone 2: Data Wrangling & Project Redefinition
Weight: 10%Due: April 10, 2025
Data acquisition, preprocessing, missing data, imbalances, scaling. 10-min presentation.
Milestone 3: EDA, Initial Modeling & Pipeline Development
Weight: 20%Due: April 24, 2025
EDA, baseline model, training/testing pipeline, evaluation metrics. 10-min presentation.
Milestone 4: Final Modeling & Deliverables
Weight: 68%Due: May 12, 2025
2000–2500 word report, 6-min video, well-commented Python notebook.
Sprint Timeline
Mar 24 – Apr 10
Sprint 1 — Environment & Data Infrastructure
Apr 10 – Apr 17
Sprint 2 — Data Wrangling & LePaRD Acquisition
Apr 17 – Apr 24
Sprint 3 — Index Generation & Model Training
Apr 24 – May 5
Sprint 4 — Evaluation: Tiers A/B/C
May 5 – May 12
Sprint 5 — Analysis, Ablations & Final Deliverables