Datasets

Two core datasets — all publicly available, no private data.

CourtListener Federal Appellate

✅ Complete

Size: 1,465,484 opinions → 7,813,273 chunks (1,024-subword, 128-overlap)

License: CC BY-ND 4.0

Role: Retrieval corpus — 7,813,273 chunks across 1,360,665 unique clusters, 13 federal circuits

LePaRD (ACL 2024)

✅ Complete

Size: 4M pairs → 2,429,533 verified (eyecite+rapidfuzz bridge, 60.74%) → 20,877 unique test queries

License: Open research (Mahari et al. ACL 2024)

Role: Hard-negative mining + retrieval evaluation

Dataset & DVC Provenance Summary

CourtListener: 159 shards, 7.6GB, SHA-256 manifest (corpus_manifest_sha: 7e5cbae1...), git_rev: 90a35201. LePaRD: 5.78GB JSONL, SHA-256: abe787c0..., HF revision: 0194f95c, DVC-tracked at repo root (123B pointer). Verified subset: 2,429,533 pairs (3.6GB), gold_pairs_test: 45,000 → 20,877 unique. DVC S3: s3://cs1090b-hallucinationlegalragchatbots, 191.64GB total, 206 objects, sync clean.