Verification Is a Budget, Not a Default

A team ships an AI feature, and to make it safe they add a second model to check the first. An LLM judge on every output: grounding, tone, policy, all of it. It works, so they run it everywhere. Then two numbers start climbing. The token bill, because every output now costs a second inference. And the tail latency, because every response waits on a judge before it reaches the user. The email summarizer is paying the same verification tax as the tool that moves money. A lot of that spend is proving things that never needed proving.

Verifying everything is a good instinct applied without a budget. This post is about giving it one, and about the checks that beat an LLM judge on cost and certainty once you do.

The Default Trap

“Verify everything” is a policy that emerges from contexts where verification volume was low enough to be thorough. A compliance officer reviewing ten contracts a week can read each line. A reporter filing one story a day can call every source. Neither model survives contact with modern information volumes.

I lived this at AWS Identity. When you’re designing authentication flows that handle billions of requests, “verify the user” isn’t a binary switch you can afford to flip to maximum for every single transaction. The cost of verification failure (fraud, account takeover) is real. But so is the cost of verification overhead (latency, abandonment, support tickets).

The entire design philosophy behind FIDO2 and WebAuthn, which I worked on through the FIDO Alliance and W3C, is built on this insight: make the right verification cheap enough to be ubiquitous through hardware and cryptographic proof, then reserve expensive human-in-the-loop scrutiny for the cases that actually warrant it.

That is allocation, and identity systems have run on it for years.

Verification as a Budget

The useful mental model is this: your team has a finite verification budget. It’s measured in attention-hours, compute cycles, reviewer capacity, and calendar time. Every verification action you take has a cost, and that cost is paid whether or not the verification catches anything.

In my experience, the cost of verification follows a saturation curve. The first hour of review on a high-risk AI output catches real problems. By the fifth hour you are flagging formatting nits, and every hour after that catches nothing while quietly eroding reviewer trust in the process.

The software QA world treats this as a major line item rather than an afterthought. One software-testing cost model assumes internal testing represents roughly 30 percent of total software development cost. The precise number varies significantly by project, but the broader point holds: verification is a major engineering investment, and it deserves the same allocation discipline as any other budget.

QA lore has a widely cited rule of thumb: a defect gets roughly an order of magnitude more expensive to fix at each stage it survives, often quoted as 1x at requirements, 10x in testing, and 100x in production. The precise figures and their exact origin are debated, so treat this as a directional heuristic, not a settled measurement. One version of the framing appears in Galorath’s ROI research. The shape still maps onto AI governance. Catching a hallucination or a governance failure early, before it is embedded in a downstream decision, is far cheaper than catching it after deployment.

Bar chart showing relative cost to fix a defect by lifecycle stage: Requirements/design ($1), Testing phase ($10), Production ($100). Source: Galorath / Robbins Gioia Widely cited rule of thumb, exact provenance debated. One version: Galorath / Robbins Gioia, Software Test Costs and Return on Investment (ROI) Issues

The implication: your verification budget should be front-loaded, not evenly distributed.

What Determines the Allocation

When I’m advising teams on where to spend their verification budget, I use three variables:

Stakes: What’s the blast radius if this output is wrong? A customer-facing financial recommendation has different stakes than an internal summary document. The verification depth should reflect the consequence, not the content type.

Reversibility: Can you undo the decision if verification catches an error later? A draft blog post is reversible. A model deployed to production scoring credit applications is not (or at least, not cheaply). Irreversible decisions get more verification budget upfront.

Epistemic uncertainty: How confident are you in the system producing this output? A well-tested, constrained model with guardrails on a narrow domain needs less per-output verification than a general-purpose LLM operating in an open-ended context. I’ve written about this dynamic in the context of guardrails and governance: knowing where your system’s confidence drops is itself a verification input.

Treat these as a triage heuristic rather than a formula. In practice, you can classify most outputs into high/medium/low verification tiers in under a minute if you’ve done the upfront work of defining what “high stakes” and “irreversible” mean for your domain.

Verification Tier Assignment

What This Looks Like in Practice

AI output evaluation. A well-run evaluation program does not spend the same depth on every model output. A model generating internal documentation summaries can clear on automated checks alone: format, factual grounding against source docs, policy compliance. A model whose output informs a human decision in a regulated process needs those automated checks plus structured human review with explicit sign-off. Same model, different verification budget, because the stakes and reversibility differ.

I’ve written more about what it takes to build this kind of pipeline in building an automated model evaluation pipeline. The design bar there was simple: allocate compute and reviewer time to the outputs that justify it, and automate the rest.

Identity systems. A step-up authentication flow is verification budgeting in action. Low-risk session activity (browsing, reading) gets minimal re-verification. High-risk actions (fund transfer, permission change) trigger additional factors. During low-risk moments the system isn’t “trusting less”; it’s spending its verification budget where the expected loss is highest.

I’ve talked about why platform engineers should care about identity systems before, and this is the core reason: identity is the substrate that makes risk-proportionate verification possible.

Knowledge work. When I review my own blog agent’s output (yes, I built an AI agent that writes for this blog), I don’t verify every sentence with equal intensity. Factual claims with citations get the hard look: are they real? Technical specifics get checked for correctness, and the voice gets a read for whether it actually sounds like me. Formatting and grammar get a fast scan and nothing more. The scans I skip are the ones where a mistake would not have cost much anyway.

Verification Tier Model

Not Every Check Should Be an LLM

Once you accept that verification is a budget, the next question is what you spend it on. Most teams spend it on one thing: a large language model acting as a judge. An LLM judge is flexible and fast to stand up. It scores grounding, tone, completeness, and policy fit from a prompt, with no training data. That flexibility is why it became the default.

It is also the slowest and most expensive option, and it is probabilistic. A judge is a second inference stacked on the first: you pay for its tokens whether or not it catches anything, and every response it inspects waits on it before it reaches the user. That can materially increase both token cost and tail latency. How much depends on judge size, prompt length, retrieved context, and caching. The direction does not change: an extra inference is never free. And there is a deeper problem: an LLM judge shares correlated failure modes with the output it judges, because it is the same class of system. Using an LLM to certify an LLM is asking a probabilistic system to vouch for a probabilistic system.

That may be sufficient when the stakes are low. It becomes insufficient as the sole control when the stakes rise, and the answer is a different substrate, not a bigger judge. Think of it as a ladder you climb only as far as the risk tier demands.

Rung	Substrate	Nature	Cost and latency	Best for
1	LLM as judge	Probabilistic	Highest per check	Fuzzy, subjective checks where a wrong answer is cheap to reverse
2	Small or deterministic models and rules	Deterministic for rules; more reproducible but still probabilistic for small models	Low, fast	Well-scoped checks: PII, schema conformance, numeric ranges
3	Formal methods and automated reasoning	Mathematically verifies a specified property against a formalized policy	Expensive to set up, cheap to run	Specifiable, high-stakes, irreversible policy checks

Rung two is where an earlier finding on this blog lands. In an earlier post on output validation I showed that smaller models at temperature zero produced consistent outputs where a much larger model did not. For a compliance bar, reproducibility is the point, and a small task-specific classifier or a plain deterministic rule beats a frontier judge on cost, latency, and auditability for well-scoped checks. Vendors like Plurai are pushing small-model guardrails in this direction. Bench them against a deterministic baseline before you assume they beat a regular expression.

Rung three, formal methods, is the one most enterprises have not reached yet.

What Proving It Actually Looks Like

I watched this work before it had anything to do with generative AI. I worked in identity at AWS, and identity is where automated reasoning earned its keep. AWS has a system called Zelkova that takes an access policy, an IAM policy or an S3 bucket policy, and translates it into precise mathematical logic. As AWS laid out in proving security at scale with automated reasoning, it hands that logic to SMT solvers (satisfiability modulo theories solvers, the industrial descendants of the SAT solvers you meet in school) and answers questions like “can anyone outside this account read this bucket.”

The kind of answer matters. A test checks the requests you thought to try. Zelkova checks every possible request, because it reasons over the math instead of sampling the behavior. IAM Access Analyzer, the capability many AWS customers use to find resources exposed outside their zone of trust, is Zelkova underneath, and by Amazon’s own account the engine fields more than a billion SMT queries a day. AWS calls it provable security.

Byron Cook, who co-authored the Zelkova work, now leads the AWS group that took the same technique to generative AI. In August 2025 AWS made Automated Reasoning checks in Amazon Bedrock Guardrails generally available. The mechanism is the tell: a language model translates a natural-language policy and a model output into logical formulas, then a constraint solver verifies whether the output is consistent with the policy. AWS reports up to 99% accuracy at identifying verifiable responses, alongside an explanation of which rule was satisfied or violated. The same class of mathematical reasoning used to prove access-control properties is now being used to check model responses against formalized business policies.

The top rung comes with real limits. Automated reasoning works only where you can state the policy precisely. A capital-adequacy rule, a disclosure requirement, an entitlement boundary, a numeric threshold: specifiable, and provable. “Was this answer helpful and on-brand” is not, and no solver will help you. There is a second catch. The first stage still uses a language model to turn natural language into formal logic, and as Byron Cook and Kathleen Fisher discuss, that translation can be imperfect in ways that verifying hand-written code is not. You are proving the output against the formalized policy, which is only as good as the formalization. That is exactly why this substrate belongs at the top of the ladder and not under every check. You reserve it for the tier where the cost of being wrong justifies the cost of formalizing the rule.

Who Gets To Say What Good Means

There is a reason evaluation feels like a data-science problem. The people who can specify what a correct answer looks like are usually not the people who can build an evaluator. A compliance officer knows the disclosure rule cold. A relationship manager knows what a suitable recommendation is. Neither of them writes prompts or SMT formulas, so “what good looks like” gets translated, lossily, by an engineer who does not own the risk.

The most important thing about the Zelkova model was never the solver; it was the interface. AWS did not ask customers to write logic. Its services ask the questions on the customer’s behalf, so a customer gets a mathematical guarantee without knowing what an SMT solver is. Byron Cook makes the same point about the generative-AI version. In a conversation with Werner Vogels he frames the goal as putting automated reasoning in the hands of non-scientists.

That is the design principle an enterprise evaluation platform needs. The business owner authors the policy in language they own. The platform compiles it down to the right substrate for the risk tier: a prompt for the judge, a rule for the deterministic check, a formal policy for the solver. The owner never sees the substrate. They see whether the answer met the bar they set. AI is a business transformation, and evaluation only earns that word when the business can state its own definition of good and trust that the machinery underneath enforces it.

Where Teams Get This Wrong

Verifying cheap things expensively. I’ve seen governance teams spend hours reviewing AI outputs that, if wrong, would cost nothing to fix and affect no one outside the team. The verification cost exceeded the error cost by orders of magnitude. When I ask “what happens if this is wrong?”, the answer is often “well, nothing really.” Then why are we spending reviewer hours on it?

Under-verifying expensive things because the process is uniform. When everything gets the same review, reviewers develop a constant pace. They can’t shift into deep-scrutiny mode for the high-stakes item because they’ve been trained by the process to treat all items equally. The result is flat checking: the same shallow pass everywhere, so the high-stakes item never gets the depth it needed.

Confusing verification with validation. Verification asks “is this output correct?” Validation asks “is this the right output to produce?” Most teams over-invest in verification (tactical, per-item checking) and under-invest in validation (strategic, system-level questioning).

I’ve seen teams meticulously verify every field in a model’s output while never asking whether the model should be producing that output category at all. That’s spending the budget on the wrong ledger entirely. I explored this distinction further in the evaluation stack is not the governance platform.

Treating automation as a verification replacement rather than a budget multiplier. Automation doesn’t eliminate the need for verification budgeting. It changes the cost curve. Automated checks are cheap per-unit, so you can afford to run them broadly. Human review is expensive per-unit, so it should be reserved for the cases automation can’t resolve.

The mistake is thinking “we automated verification, so we’re done.” The right framing is: automation expanded our budget, now we need to reallocate the human portion to higher-value targets.

Next Steps

If you’re running an AI governance program or any system where output quality matters, start by making the implicit explicit:

Map your current verification spend. Not in dollars (though that helps), but in hours. Where are your reviewers spending time? What’s the stakes-to-effort ratio for each category? Most teams have never actually measured this, and the answer is usually surprising.

Define your tiers. If you’re at startup scale, three tiers (high/medium/low) with clear criteria for each is plenty. If you’re in a regulated enterprise, you probably need four or five tiers with documented rationale for each boundary. The criteria should be based on stakes, reversibility, and uncertainty, not on content type or team ownership.

Automate the low-tier verification and reinvest the savings. The point of automating routine checks isn’t cost reduction (though you’ll get that). It’s freeing up human attention for the verification tasks that actually require judgment. I’ve written about what this looks like architecturally in from periodic reviews to continuous governance.

Audit quarterly. Verification budgets drift. The model that was high-uncertainty six months ago might now have enough production data to move to a lower tier. The process that was low-stakes might have been promoted to customer-facing. Revisit your allocation on a regular cadence, not just when something fails.

A note on where this one came from. A colleague, Richard Song, asked me recently how I decide what to write and how I get from a blank page to a draft. The honest answer is unglamorous. I record voice notes when I am driving, I keep scratch notes that are half formal and half nonsense, and I collate the fragments until an argument shows up. This post started as one of those fragments, after I watched Byron Cook talk about automated reasoning and realized the thing that wowed me in identity years ago had quietly become the answer to a question my evaluation work kept hitting. Kudos to Richard for the nudge. The best posts usually start as a question someone else asked you.

Verification is a budget. Spend it where the risk is, on the cheapest substrate that clears the bar. The trusted path should be the fastest and cheapest path when the risk does not justify the spend, and a proven one when it does.