Reducing Hallucination in Legal RAG Chatbots

COMPSCI 1090B: Data Science 2: Advanced Topics in Data Science

Harvard University · [email protected] · 2025-2026 Spring

Project Group - #43: GitHub

  1. PHONG LE
  2. ...
  3. ...

TF Reviewer Comments & Instructor Notes — Addressed

All Resolved
TF ReviewerResolved

Concern:

“It is not clear where the human-annotated hallucination rate comes from. The proposal does not mention how embedding methods will be trained to encode legal text.”

Response:

  • No human annotation required. Hallucination measurement is fully automated across three tiers:
  • Tier A — Retrieval ground truth: LePaRD 4M+ expert-annotated citation pairs serve as gold-standard retrieval ground truth; evaluation capped at 10K–50K pairs. Metrics: Recall@k, MRR, NDCG@10.
  • Tier B — NLI hallucination measurement: MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli classifies each atomic claim independently against individual retrieved chunks, running fully locally with no API calls. 512-token limit handled via repo-certified overflow windowing (return_overflowing_tokens=True, max_length=512, stride=64, use_fast=False); window-level logits aggregated per chunk. Contradiction rate normalized by per-query claim count and per 1K tokens. Zero-claim responses excluded and reported separately. NLI confidence scores treated as diagnostic indicators only.
  • Tier C — Citation existence verification: Local SQLite index (check_same_thread=False; read-only) provides O(1) citation lookup. NULL → Hard Citation Hallucination logged, NLI skipped. Found with no local NLI support → CitationFound_NoLocalSupport logged. Citation hash (opinion_id + anchor span) logged per lookup. Windowing: (1) Hybrid reranker; (2) keyword/regex; (3) sliding-window fallback.
  • Embedding model training: BGE-M3 fine-tuned with MultipleNegativesRankingLoss on 500K–1M capped LePaRD pairs (lr=1e-5, warmup=10%, batch=32, epochs=3). CLS pooling enforced per BAAI config; runtime assertion in model_loader.py; pooling flags logged to W&B. Legal-BERT optional domain-reference (lr=2e-5, batch=32, epochs=3). BM25 requires no training (k1=1.5, b=0.75). All architectures evaluated with mistralai/Mistral-7B-Instruct-v0.2 held constant (greedy decoding, chat template enforced).
InstructorResolved

Concern:

“Warning: groups should only consider this project if they have a plan for addressing the concerns regarding human-annotation of hallucinations and training of the embedding model. Without addressing the annotation problem the project will be infeasible.”

Response:

  • Human annotation bottleneck eliminated. LePaRD (ACL 2024) provides 4M+ expert-annotated legal citation pairs as gold-standard retrieval ground truth. DeBERTa-v3 NLI runs fully locally on cluster GPU (bfloat16, ~3GB VRAM); no API calls, no human reviewers, no annotation bottleneck.
  • Compute feasibility confirmed and capped. Training: 500K–1M pairs (not 3.2M full LePaRD). Retrieval eval: 10K–50K queries. Generation eval: 1,000 stratified queries (±2.5pp at 95% CI). Iteration corpus: ~150K opinions (10% subset) for fast iteration; full 1.46M for final runs.
  • Infrastructure already operational. 1,465,484 federal appellate opinions downloaded, filtered, sharded (7.6GB); DVC + S3 versioning active; all src/ modules implemented and tested. Environment asserts transformers.__version__ == "4.39.3", torch.cuda.is_bf16_supported(), and get_device_capability()[0] >= 8 at startup.
  • Sequential model loading prevents VRAM exhaustion. Single 23.7GB L4 (SLURM-allocated). BGE-M3 (~2.27GB), reranker (~2GB), Mistral (~14–15GB + KV cache), DeBERTa (~3GB) loaded one phase at a time; explicit DataLoader deletion + torch.cuda.empty_cache() + gc.collect() between phases; memory stats + CUDA stream sync time + allow_tf32 state logged per phase.
  • Priority sequencing: LePaRD acquisition → 10–20% subset fast iteration → BM25 + BGE-M3 + Tier A → scale + Tier B/C.

Pipeline Status

Environment Bootstrap
CourtListener Download
DVC + S3
CourtListener RAG Prep
LePaRD Acquisition
Index Generation
Model Training
Evaluation Tiers A/B/C
Experiment Tracking W&B

Agile Sprint Plan — Coding Tasks

Sprint 1 — Environment & Data Infrastructure

Mar 24 – Apr 10

complete
  • Environment bootstrap: setup.sh, tests passing, coverage verified
  • CourtListener: 1,465,484 opinions downloaded, 159 shards, 7.6GB
  • DVC + S3 artifact versioning operational
  • All src/ modules implemented and tested
  • SQLite citation index built via src/extract.py
  • ruff + mypy linting configured in pyproject.toml
  • pip-audit CVE scan + CycloneDX SBOM generation in CI

Sprint 2 — Data Wrangling & LePaRD Acquisition

Apr 10 – Apr 17

in-progress
  • CourtListener RAG-readiness refinement (Cell 2 — tokenizer-aware chunking 1024 subwords)
  • spaCy stripped pipeline setup (exclude=["ner","parser","lemmatizer"]), nlp.max_length set for full appellate opinions
  • Citation-aware chunk splits with metadata per chunk: court_id, year, is_precedential, opinion_id, chunk_index
  • LePaRD acquisition via HuggingFace — Priority 1 (cap 500K–1M pairs)
  • DVC push data shards to S3 cs1090b-hallucinationlegalragchatbots
  • Train/val/test split — src/split.py (500K train / 50K val / 10K–50K test)

Sprint 3 — Index Generation & Model Training

Apr 17 – Apr 24

pending
  • BM25 (bm25s) index over pre-chunked payloads from Stage 3
  • BGE-M3 FAISS Flat index for validation (CLS pooling, bfloat16)
  • BGE-M3 fine-tuning: MultipleNegativesRankingLoss, lr=1e-5, batch=32, epochs=3
  • Hybrid: BM25+BGE-M3+bge-reranker-v2-m3 CrossEncoder (top-50→top-10)
  • FAISS IVF for full-corpus: index.train() on 100K subset, assert index.is_trained
  • Log recall@k vs nprobe on validation set to justify IVF parameters; log nprobe/nlist to W&B
  • Legal-BERT bi-encoder: 512-subword chunks, MultipleNegativesRankingLoss, lr=2e-5, warmup=10%, batch=32, epochs=3 (optional domain-reference)

Sprint 4 — Evaluation: Tiers A/B/C

Apr 24 – May 5

pending
  • Tier A: LePaRD Recall@k, MRR, NDCG@10 on 10K–50K capped test set
  • Tier B: DeBERTa-v3 NLI classifier — 1,000 stratified queries, contradiction rate
  • Log window count distribution per chunk; log window index per label
  • Tier C: SQLite citation lookup — Hard Citation Hallucination + CitationFound_NoLocalSupport
  • Log citation anchor offset on sliding-window fallback
  • Sequential loading: BGE-M3 → Reranker → Mistral-7B → NLI → SQLite
  • W&B experiment tracking: VRAM, GPU hours, metrics per phase

Sprint 5 — Analysis, Ablations & Final Deliverables

May 5 – May 12

pending
  • Paired bootstrap significance tests (B=10,000), Cohen's d, BH-FDR
  • Ablation: BGE-M3 vs Hybrid, w/o reranker, Legal-BERT, k∈{1,5,10,20}
  • Ablation: training size 100K vs 500K vs 1M pairs
  • Ablation: chunk overlap 128 vs 64 subwords on 10% subset
  • Ablation: Stage 3 normalization on/off
  • Ablation: Contradiction vs Neutral vs combined metric sensitivity
  • wandb_logger.py: full per-phase VRAM, pooling flags, score distributions
  • Final report: 2000–2500 words
  • Video presentation: 6 minutes

Course Milestones

Milestone 1: Group Formation & Project Selection

Weight: 2%

Due: March 24, 2025

Select top 5 project choices. Groups of 3–5 students. Staff assigns groups March 27.

Milestone 2: Data Wrangling & Project Redefinition

Weight: 10%

Due: April 10, 2025

Data acquisition, preprocessing, missing data, imbalances, scaling. 10-min presentation.

Milestone 3: EDA, Initial Modeling & Pipeline Development

Weight: 20%

Due: April 24, 2025

EDA, baseline model, training/testing pipeline, evaluation metrics. 10-min presentation.

Milestone 4: Final Modeling & Deliverables

Weight: 68%

Due: May 12, 2025

2000–2500 word report, 6-min video, well-commented Python notebook.

Sprint Timeline

  1. Mar 24 – Apr 10

    Sprint 1 — Environment & Data Infrastructure

  2. Apr 10 – Apr 17

    Sprint 2 — Data Wrangling & LePaRD Acquisition

  3. Apr 17 – Apr 24

    Sprint 3 — Index Generation & Model Training

  4. Apr 24 – May 5

    Sprint 4 — Evaluation: Tiers A/B/C

  5. May 5 – May 12

    Sprint 5 — Analysis, Ablations & Final Deliverables