Ethics & Limitations

Responsible research practices and known limitations of the system.

Public Datasets Only

CourtListener data used under CC BY-ND 4.0. LePaRD is open research data. No private or proprietary legal data.

No Human Annotation

Hallucination measurement fully automated: LePaRD gold labels for retrieval (Tier A) + gpt-4o-mini LLM-as-judge (FAITHFUL/PARTIAL/HALLUCINATED) for generation (Tier B). No crowdsourced or paid annotators.

Open-Source Generator

Generation via Qwen2.5-7B-Instruct (open-source, locally deployed on 4x NVIDIA L4 GPU cluster). No query data sent to third parties during generation. LLM-as-judge uses gpt-4o-mini (OpenAI API) for hallucination labeling only; ~104,385 judgments at ~$53 total cost. Outputs used strictly for retrieval research under academic supervision.

PII Handling

PII handling follows CourtListener and LePaRD provider redaction practices. No additional PII collection or processing.

Scope Limitations

Retrieval ceiling: Hit@100=0.375 means ~56% hallucination floor is irreducible at current SOTA. System outputs are not legal advice. Academic research prototype only.

Academic Use Only

This system is a research prototype. Results should not be used for actual legal practice or decision-making.

Ethics & Limitations Summary

Generator: Qwen2.5-7B-Instruct (open-source, local, no data sent to third parties). Judge: gpt-4o-mini (OpenAI API, ~$53, 104,385 judgments). Hallucination floor ~56% at SOTA retrieval — not suitable for unsupervised legal deployment. Single judge model (gpt-4o-mini, ~80% FaithBench accuracy); relative differences across ablations robust, absolute rates would tighten under multi-judge consensus.