Fourteen Significant Digits of Nonsense: On the growing gap between code we can generate and code we can trust

Here's something worth saying plainly: we are building extraordinary things on compute that cannot tell you whether it's correct.

Not "might be inaccurate." Not "could contain errors." Cannot tell you. There is no theoretical framework for bounding the error of a free-text generation. The model's internal probability over tokens has nothing to do with factual or logical correctness. This isn't a criticism. It's a property of the system, the same way non-associativity is a property of IEEE 754 floating-point. You don't get mad at floating-point for being non-associative. You learn when it matters and when it doesn't.

The problem is that we don't have a shared vocabulary for doing that with LLM-generated code yet. And right now, those outputs account for 42% of all committed code at companies using AI tools, with developers expecting that to hit 65% by 2027.

So let's talk about what happens when the bottom of the pyramid falls out.

The Pyramid

I've been working on an interactive visualization that frames computational methods as a hierarchy, like Maslow's pyramid but inverted in a way that should make you uncomfortable. The base, where most of the FLOPS are being burned, is the least deterministic compute. The apex, where comparatively little runs, is the most deterministic. Seven tiers, from contract law to confabulation:

Fixed-point mainframe arithmetic (every digit exact, every result reproducible across decades), integer and cryptographic computation (SHA-256 doesn't care what hardware you run it on), rule-based and symbolic systems (SQL returns the same result set on any conforming database), IEEE 754 scientific computing (deterministic in theory, non-reproducible in practice the moment you parallelize), probabilistic and statistical methods (intentionally stochastic but with known distributions), deterministic DNN inference (a fixed function with no error bounds), and generative AI (temperature-scaled sampling with unquantifiable uncertainty).

The pyramid is not a quality ranking. It's an assurance map. Each tier trades off four properties differently: reproducibility, error characterization, auditability, and well-posedness. A Monte Carlo simulation can tell you exactly how confident it is. Your CFD simulation probably can't. A COBOL COMPUTE statement on a 1985 IBM 3090 and a 2024 z16 produces identical output. An LLM can't reproduce its own output from ten seconds ago.

Computational Assurance Matrix

The pyramid gives a useful hierarchy, but one dimension still hides the real trade-offs. The infographic below is the matrix view: four axes that distinguish tiers, and that a single "determinism score" would flatten.

Reproducibility

Can you get the same answer twice?

Error Characterization

Can you bound how wrong the answer might be?

Auditability

Can you explain why you got that answer?

Well-Posedness

Does the answer space have a 'correct' value?

Tier	Reproducibility	Error Characterization	Auditability	Well-Posedness
Fixed-Point / BCD	100% (Definitive)	100% (Definitive)	100% (Definitive)	100% (Definitive)
Integer / Cryptography	100% (Definitive)	100% (Definitive)	95% (Definitive)	90% (Strong)
Rule-Based / Symbolic	100% (Definitive)	85% (Strong)	100% (Definitive)	75% (Moderate)
IEEE 754 / HPC	50% (Partial)	80% (Moderate)	85% (Strong)	95% (Definitive)
Probabilistic / MCMC	40% (Weak)	95% (Definitive)	75% (Moderate)	85% (Strong)
DNN Inference	60% (Partial)	20% (Minimal)	15% (Minimal)	80% (Moderate)
Generative AI / LLMs	5% (Absent)	5% (Absent)	10% (Minimal)	25% (Minimal)

The point is not that one tier is "good" and another is "bad." The point is that each tier fails differently, and those failure modes matter more than a single headline score.

The Razor

Werner Vogels called the growing gap between code generation speed and our ability to verify it "verification debt." The CACM formalized it: technical debt is a bet about future cost of change; verification debt is unknown risk you are running right now. Leo de Moura, creator of Lean, put the mandate plainly: verification must scale with generation.

It hasn't. It won't at the current trajectory. And I'd argue it can't within the non-deterministic tier, because verification requires an oracle, and for free-text generation, the oracle is incomplete by definition.

Here's the razor I'd propose, and I'll be blunt about it:

The Verification Razor: The value of generated output cannot exceed the cost you are willing to pay to verify it. If you cannot afford to verify it, you cannot afford to ship it.

This is not a conservative position. This is arithmetic. METR ran a randomized controlled trial on experienced open-source developers and found that AI tools increased task completion time by 19% on average. Not decreased. Increased. The generation got faster. The verification ate the savings and then some. The developers themselves didn't notice. They estimated AI saved them 20% even as it cost them 19%. The vibes were good. The clock disagreed.

96% of developers don't fully trust AI-generated code to be functionally correct. Only 48% always check it before committing. That's a coin flip on whether anyone verified the code that's running your billing system.

The Tier-Crossing Problem

The real danger isn't using Tier 7 compute. It's using Tier 7 compute and pretending you're in Tier 4.

When an LLM generates engineering parameters that feed directly into a reservoir simulation, you have crossed a tier boundary without a verification gate. The model's unquantifiable uncertainty now contaminates a system that was designed around quantifiable precision. The CFD solver has no way to know its inputs came from something that can't do arithmetic reliably. It will dutifully compute fourteen significant digits of nonsense.

When a "vibe coded" web app handles financial transactions, you're running what is functionally BCD-grade arithmetic (Tier 1 requirements) on Tier 7 foundations. That can be fine, if you've verified the arithmetic paths. It's a problem if nobody thought to check because the code "looked right."

This isn't hypothetical. It's Tuesday.

Software Is Free Now (So Now What?)

The smartest people in the room are already adapting. At Unprompted Con this week, Rami McCarthy said on stage that software is throwaway now. He doesn't even bother keeping it. OpenAI researchers have been saying software is "free." And on Friday, Maggie Appleton spotted the logical endpoint: OpenAI's new Symphony orchestrator doesn't ship software at all. It ships a spec. The README tells you to prompt your agent to build the software yourself.

They're all right. And this is genuinely exciting.

For a huge class of problems, the artifact doesn't matter. Generate it, use it, throw it away, regenerate it tomorrow when the model is better. Rami is right that scripting, prototyping, internal tooling, and exploratory work have been permanently transformed. Maggie is right that the spec is becoming the durable artifact and the code is the ephemeral instantiation. This is a real paradigm shift, not hype, and fighting it is a waste of energy.

But here's what I keep turning over: if the spec is the code now, then the spec is also where verification lives. And the spec has to know which tier of assurance it requires.

A spec that says "build me a dashboard to visualize sensor data" can tolerate Tier 7 instantiation. Regenerate it every morning. Who cares. A spec that says "build me a pipeline safety interlock system" cannot. The question isn't whether ephemeral software is legitimate. It's whether we have a shared vocabulary for annotating specs with their assurance requirements, so the person (or agent) instantiating them knows which tier they're operating in.

Right now, we don't. And the verification razor still applies: when you ship a spec instead of tested software, you haven't eliminated verification cost. You've distributed it to every consumer. For most use cases, that's a perfectly rational tradeoff, because the verification cost is low (does the dashboard render? good enough). For some use cases, it's an externality that nobody has priced yet.

The pyramid isn't an argument against ephemeral software. It's the taxonomy that tells you which software is allowed to be ephemeral.

The Bold Claim

Here it is: within five years, the ability to classify which tier of compute is appropriate for a given decision, and to enforce verification gates at tier boundaries, will be a regulated professional competence. Not a best practice. Not a framework. A competence, in the same way that a PE stamp means an engineer has attested that the structural calculations meet code.

The EU AI Act is already classifying applications by risk tier. FERC and PHMSA are watching. The SEC has issued guidance on AI-generated disclosures. The question is not whether this regulatory surface expands. The question is whether engineers and organizations build the internal taxonomy before regulators impose one.

The pyramid gives you that taxonomy. The assurance matrix gives you the multi-dimensional view that a single "determinism score" would flatten into uselessness. And the verification razor gives you the decision rule: if you can't verify it at the tier the decision requires, you don't ship it.

Software is free now. Verification never was. The organizations that thrive will be the ones that understand the difference, not the ones that pretend it doesn't exist.

Explore the interactive visualization →

RGHTEOUS GAMBT

Fourteen Significant Digits of Nonsense