Methodology
Three-tier automated hallucination measurement — no human annotation bottleneck.
Tier A — Retrieval Grounding
LePaRD 4M+ expert-annotated citation pairs. Metrics: Recall@k, MRR, NDCG@10. Capped at 10K–50K test pairs.
Tier B — NLI Hallucination Detection
DeBERTa-v3-large-mnli classifies each atomic claim against retrieved chunks. 1,000 stratified queries. Contradiction rate normalized by claim count and per 1K tokens. Fully local, no API.
Tier C — Citation Existence Check
SQLite citation index lookup. NULL → Hard Citation Hallucination. Found + no NLI support → CitationFound_NoLocalSupport. Anchor-first windowing strategy.
API Placeholder
GET /api/methodology — returns evaluation protocol details (pending)