Methodology

Three-tier automated hallucination measurement — no human annotation bottleneck.

Tier A — Retrieval Grounding

LePaRD 4M+ expert-annotated citation pairs. Metrics: Recall@k, MRR, NDCG@10. Capped at 10K–50K test pairs.

Tier B — NLI Hallucination Detection

DeBERTa-v3-large-mnli classifies each atomic claim against retrieved chunks. 1,000 stratified queries. Contradiction rate normalized by claim count and per 1K tokens. Fully local, no API.

Tier C — Citation Existence Check

SQLite citation index lookup. NULL → Hard Citation Hallucination. Found + no NLI support → CitationFound_NoLocalSupport. Anchor-first windowing strategy.

API Placeholder

GET /api/methodology — returns evaluation protocol details (pending)