AI Verification Thesis

Domain Experts as Eval Builders

LLMs are general. Verification is specific. The people who know what "correct" looks like should be defining the tests — not ML engineers, not prompt hackers.

The Core Mismatch

Generation is solved. The constraint has moved to trust. That trust layer is built by people who've spent decades in their fields, not by more parameters.

100M+ knowledge workers spend 4.3 hours per week verifying AI outputs. The companies that help domain experts encode their expertise into scalable eval systems will capture the verification gap.

$2.2T

Estimated verification gap — the cost of humans checking what AI produces

Three Eval Approaches

Entry Point

Test Case Libraries

Experts write input-output pairs: "Given this patient history, the model should flag X." Hundreds create a benchmark. Think of it as an exam bank written by practitioners, not academics.

Low technical lift · High impact · Encodes tacit knowledge no training dataset captures

Nuanced

Rubric-Based Judging

Beyond pass/fail. Experts define scoring criteria: "A good radiology report should mention laterality, comparison to priors, and clinical correlation." The rubric is the moat.

LLM-as-Judge compatible · What separates dangerous from merely mediocre

Highest Value

Adversarial Red Teaming

Experts probe for failure modes that generalists miss. A petroleum engineer knows edge cases where a model confuses upstream and midstream. Only domain experts catch the plausible-sounding hallucination that gets someone hurt.

Beyond SelfCheckGPT · Beyond Semantic Entropy · Domain-specific failure modes

The Verification Pipeline

How domain expertise flows from tacit knowledge to scalable, defensible eval infrastructure.

Input

Domain Expert Knowledge

Decades of pattern recognition, edge case awareness, and professional judgment

→

Encode

Eval Tooling Layer

Low-code interfaces for test cases, rubrics, and adversarial scenarios

→

Execute

Automated Verification

NLI-based checking, RAG verification, LLM-as-Judge scoring, hierarchical review

→

Output

Trust Score + Routing

High-stakes → full human review. Low-stakes → automated pass. The threshold is the expert's calibration.

The Contribution Stack

A domain expert building evals operates across three layers, each creating compounding defensibility.

Ground Truth Curation

Collecting and labeling correct, incorrect, and ambiguous outputs. Feeds RAG-based verification and NLI checkers.

→ Defensible data asset

Threshold Calibration

Deciding what confidence score triggers human review. 90% may be fine for internal summaries but reckless for medical dosing.

→ Risk-appropriate routing

Feedback Loops

Every expert correction trains the next version. The data flywheel that creates AI-era defensibility. The expert becomes part of the system.

→ Compounding moat

The Playbook

Find domains with high verification costs and clear experts. Give those experts simple tooling. Let them build the eval libraries. The library becomes the product.

Identify High-Stakes Domains

Cardiology, contract law, chemical engineering — where wrong outputs cause real harm and verification costs are highest.

Build Expert Tooling

Low-code interfaces for test case creation, rubric definition, and adversarial scenario design. Remove the technical barrier.

Scale the Eval Library

The library of expert-built evals becomes the defensible product. Each correction strengthens the flywheel. The expert network is the moat.