AI Verification Thesis

Domain Experts as Eval Builders

LLMs are general. Verification is specific. The people who know what "correct" looks like should be defining the tests — not ML engineers, not prompt hackers.

icon

The Core Mismatch

Generation is solved. The constraint has moved to trust. That trust layer is built by people who've spent decades in their fields, not by more parameters.

100M+ knowledge workers spend 4.3 hours per week verifying AI outputs. The companies that help domain experts encode their expertise into scalable eval systems will capture the verification gap.

$2.2T
Estimated verification gap — the cost of humans checking what AI produces

Three Eval Approaches

Entry Point
01
Test Case Libraries
Experts write input-output pairs: "Given this patient history, the model should flag X." Hundreds create a benchmark. Think of it as an exam bank written by practitioners, not academics.
Low technical lift · High impact · Encodes tacit knowledge no training dataset captures
Nuanced
02
Rubric-Based Judging
Beyond pass/fail. Experts define scoring criteria: "A good radiology report should mention laterality, comparison to priors, and clinical correlation." The rubric is the moat.
LLM-as-Judge compatible · What separates dangerous from merely mediocre
Highest Value
03
Adversarial Red Teaming
Experts probe for failure modes that generalists miss. A petroleum engineer knows edge cases where a model confuses upstream and midstream. Only domain experts catch the plausible-sounding hallucination that gets someone hurt.
Beyond SelfCheckGPT · Beyond Semantic Entropy · Domain-specific failure modes

The Verification Pipeline

How domain expertise flows from tacit knowledge to scalable, defensible eval infrastructure.

Input
Domain Expert Knowledge
Decades of pattern recognition, edge case awareness, and professional judgment
Encode
Eval Tooling Layer
Low-code interfaces for test cases, rubrics, and adversarial scenarios
Execute
Automated Verification
NLI-based checking, RAG verification, LLM-as-Judge scoring, hierarchical review
Output
Trust Score + Routing
High-stakes → full human review. Low-stakes → automated pass. The threshold is the expert's calibration.

The Contribution Stack

A domain expert building evals operates across three layers, each creating compounding defensibility.

icon
01
Ground Truth Curation
Collecting and labeling correct, incorrect, and ambiguous outputs. Feeds RAG-based verification and NLI checkers.
→ Defensible data asset
02
Threshold Calibration
Deciding what confidence score triggers human review. 90% may be fine for internal summaries but reckless for medical dosing.
→ Risk-appropriate routing
03
Feedback Loops
Every expert correction trains the next version. The data flywheel that creates AI-era defensibility. The expert becomes part of the system.
→ Compounding moat
100M+
Knowledge workers verifying AI outputs weekly
4.3 hrs
Average time per week spent on AI output verification
$2.2T
Estimated verification gap — the opportunity layer above generation

The Playbook

Find domains with high verification costs and clear experts. Give those experts simple tooling. Let them build the eval libraries. The library becomes the product.

01
Identify High-Stakes Domains
Cardiology, contract law, chemical engineering — where wrong outputs cause real harm and verification costs are highest.
02
Build Expert Tooling
Low-code interfaces for test case creation, rubric definition, and adversarial scenario design. Remove the technical barrier.
03
Scale the Eval Library
The library of expert-built evals becomes the defensible product. Each correction strengthens the flywheel. The expert network is the moat.