Nobody Knows What "Uncertainty" Means: A Socratic inquiry into the most overloaded word in depth estimation

Socrates was executed for asking people to define their terms. The depth estimation community has so far been spared this fate, which may explain why the word "uncertainty" appears in over two hundred papers and means something different in each one.

This is not a complaint about imprecise language. It is a claim that the field is producing a quantity it cannot define, evaluating it against no standard, and shipping it to no consumer. If Socrates wandered into a computer vision lab and asked "what do you mean by uncertainty?", the ensuing dialogue would be instructive, and nobody would enjoy it.

Let us try it anyway.

The First Answer: "It's the Model's Confidence"

This is the answer you get from the first paper you read. The model produces a depth map and a scalar per pixel, often from a or output head, that is "higher where the model is less certain." It correlates with error magnitude in visualizations. It looks reasonable. The paper calls it uncertainty and moves on.

Socrates would ask: what is the unit of this quantity?

It has no unit. It is a learned feature, jointly optimized with the depth prediction, that the training objective pushes to correlate with something. With what, exactly? With the magnitude of the loss at that pixel during training. It is a learned proxy for "this pixel was hard to fit," which is not the same as "this pixel's depth estimate could reasonably be off by this much."

A thermometer reading is not a temperature uncertainty. A difficulty score is not an error bound. Calling a learned scalar "confidence" and then inverting it to get "uncertainty" is a category error so widespread that pointing it out feels rude. But Socrates was never polite.

What this actually is: a learned per-pixel feature with no statistical semantics, no calibration, and no defined relationship to the error distribution. You cannot use it to construct an interval. You cannot state a coverage probability. You cannot combine it with another measurement's uncertainty to assess whether a detected difference is real. It is, in the most precise sense, not an uncertainty estimate. It is a number.

The Second Answer: "It's Uncertainty"

Now we are in territory. The model predicts depth and a per-pixel , trained under a loss with a or observation model. The variance is the uncertainty: irreducible ambiguity in the data at that pixel. Textureless sky? High variance. Sharp edge with good contrast? Low variance.

This is a real answer. It has formal probabilistic semantics. The loss defines an explicit likelihood, and the predicted variance is a parameter of that likelihood. Socrates would nod, briefly.

Then he would ask: is the likelihood correctly specified?

The Gaussian assumption means you are asserting that depth errors are normally distributed with the predicted variance at each pixel, independently across pixels, with no spatial correlation, no range-dependent bias, and no systematic component. None of these things are true. Depth errors are spatially correlated (nearby pixels share features), range-dependent (errors grow with distance), systematically biased (models underpredict at range boundaries), and non-Gaussian (heavy tails from occlusion boundaries and reflections).

And then: have you checked whether the predicted variance matches the observed error distribution?

Almost never. The standard evaluation is to report the on a held-out set as a scalar, which tells you the average quality of the probabilistic fit, aggregated over the entire dataset. It does not tell you whether the model's predicted 95% intervals contain 95% of the true values. That is a check, and it is essentially absent from the MDE literature.

What this actually is: a variance parameter in a misspecified likelihood, trained end-to-end, never validated against empirical error distributions, and reported as a dataset-aggregate scalar. It has formal semantics that are violated in practice and no empirical check that would catch the violation. Socrates would call this a belief about uncertainty that has never been tested against reality, which in classical Athens was approximately the definition of opinion rather than knowledge.

The Third Answer: "It's Uncertainty"

Now we reach the sophisticated interlocutor. Epistemic uncertainty captures model ignorance: the model has been trained on finite data and has finite capacity, so different plausible models would give different predictions. You approximate this by running multiple forward passes with variation — , , — and measuring the spread.

Socrates would ask: spread of what, relative to what?

The spread is a function of the approximation method. MC dropout with rate 0.1 gives different spread than rate 0.3. An ensemble of 5 gives different spread than an ensemble of 20. Test-time augmentation with horizontal flips gives different spread than augmentation with color jitter. The "uncertainty" you measure is not a property of the model's ignorance. It is a property of how you chose to probe the model's ignorance.

And: what is the ground truth for epistemic uncertainty?

There is none. You can measure that the model disagrees with itself, but you have no oracle for how much it should disagree. High epistemic uncertainty might mean "this is out of distribution" or "dropout happened to destabilize this particular feature map" or "the augmentation changed the scale in a way that shifts predictions." You cannot distinguish these without additional information that the method does not provide.

What this actually is: a dispersion statistic across multiple stochastic forward passes, with magnitude determined by the approximation method, no ground truth, and no calibration. It is uncertainty about the model's uncertainty, which is either profound or useless depending on your tolerance for recursion.

The Fourth Answer: The One Nobody Gives

definition: a non-negative parameter characterizing the dispersion of the quantity values being attributed to a , based on the information used.

In practice: you identify every source of error (, , model, sensor, environmental), propagate them through a , combine them, and produce an interval with a stated that is validated empirically. If you say ±8mm at , 95% coverage, then on independent test data the interval contains the true depth 95% of the time.

This is what measurement uncertainty means in every other field that measures things. In dimensional metrology, in surveying, in clinical chemistry, in analytical physics. The vocabulary is standardized (: Guide to the Expression of Uncertainty in Measurement). The validation methods are standardized (). The calibration artifacts are standardized ( for optical 3D systems).

In monocular depth estimation, none of this exists. Not because it is impossible. Not because the field tried and failed. Because nobody has applied the framing. The metrological vocabulary, the evaluation protocol, the validation methodology — all of it is sitting in measurement science textbooks, waiting for someone to connect it to the learned depth community.

Socrates would nod longer this time. Then he would ask three questions, each worse than the last.

First: you say the interval is validated empirically. Validated on what? Your thirty calibration sites? And you claim this generalizes to unseen infrastructure?

Fair. The empirical validation is only as strong as the test data is representative. You have replaced an unvalidated model assumption (Gaussian likelihood, definition 2) with a validated empirical claim (95% coverage on held-out data). But the empirical claim is itself bounded by the diversity of the calibration corpus. You have traded a theoretical weakness for an empirical one. This is progress — empirical weaknesses can be measured and addressed — but it is not certainty.

Second: how do you know your transfers? A model calibrated on steel bridges in Louisiana — does the coverage hold on concrete cooling towers in Nevada?

This is the problem, and it is genuinely open. The classical metrology answer is that you re-calibrate for each new measurement context. The practical answer for a learned system is that you need enough diversity in the calibration corpus to cover the major axes of variation — infrastructure type, material, scale, sensor configuration, environmental conditions — and then you validate on held-out combinations. Whether 30 sites or 300 sites suffices is an empirical question. The honest answer is: we do not know yet, and characterizing the boundaries of calibration transfer is part of what makes this research necessary.

Third, the one that draws blood: assumes you understand the well enough to propagate uncertainty through it analytically. Your measurement model is a neural network with millions of parameters. You can validate the output empirically, but you cannot trace the uncertainty through the model the way a metrologist traces it through a chain of calibrated instruments. You have an opaque instrument that produces intervals with correct coverage on your test set. Is empirical validation without mechanistic sufficient to call it a measurement?

This is the hardest question, and it deserves an honest answer rather than a dodge. Classical metrology was built for instruments you can take apart — a micrometer, a theodolite, a coordinate measuring machine. You understand each component's error contribution. You propagate uncertainty through the measurement equation because you have a measurement equation. A neural network does not offer this. The function exists, but it is not interpretable in a way that permits component-wise uncertainty decomposition.

The honest position is: empirical validation is necessary and sufficient for the user. The engineer stamping the drawing cares that the intervals hold on independent test data under documented conditions. She does not need to trace the uncertainty through layer 47 of a . But empirical validation is insufficient for the metrologist, who would want the mechanistic account — the uncertainty budget with named components and propagation rules.

This is a genuine tension, not a flaw to be papered over. Resolving it — or arguing convincingly that end-to-end empirical validation is a legitimate mode of metrological certification for learned measurement systems — is part of what makes this research interesting rather than routine. The was written for a world of understood instruments. We are building instruments we do not fully understand. The metrological framework must stretch to accommodate that, or it will break.

Socrates would have appreciated the honesty. He still would have kept asking.

What this actually is: the thing uncertainty should mean if you intend for the output to be used as a measurement. The definition is not novel. The application to MDE is. And the application raises a foundational question — whether empirical coverage guarantees are sufficient for metrological legitimacy when the instrument is opaque — that the field has not yet confronted because it has not yet needed to.

Why This Matters Beyond Vocabulary

This is not a pedantic taxonomy. The four definitions produce four different decisions when a downstream system consumes the output:

	Bridge inspector asking: "Is the clearance above 4.5m?"
Definition 1 (learned score)	"The confidence is 0.73." What does that mean? Can I stamp this?
Definition 2 (aleatoric variance)	"The predicted variance is 0.04 m²." Is that calibrated? What coverage? Against what reference?
Definition 3 (epistemic spread)	"Five forward passes gave depths between 4.48m and 4.67m." Is that spread meaningful? Would ten passes give a different answer?
Definition 4 (metrological)	"The depth is 4.58m ± 0.07m (k=2, 95% coverage), validated against calibration artifacts under documented conditions." Yes, the clearance exceeds 4.5m with quantified confidence. Stamp it.

Only one of these answers is usable by an engineer making a decision with liability attached. The gap between definitions 1–3 and definition 4 is not a gap of precision. It is a gap of kind. The first three are statements about the model. The fourth is a statement about the physical world.

The Socratic Conclusion

Socrates did not resolve his dialogues. He left his interlocutors aware that they did not know what they thought they knew, which was considered the beginning of wisdom and also the beginning of the end for Socrates.

The conclusion here is similar but less fatal: the depth estimation community has been producing "uncertainty" estimates for nearly a decade without agreeing on what the word means, what the output's semantics are, how to validate it, or who would use it. This is not a criticism of any individual paper. Every paper is internally consistent within its own definition. The problem is that there are four definitions, they are not equivalent, and the field talks about them as if they are.

The does not render the other definitions useless. variance is a valid component of a full uncertainty budget. uncertainty flags out-of-distribution inputs, which is operationally valuable. Learned confidence scores have their uses in downstream optimization.

But if the claim is that a depth model produces output with quantified uncertainty — language that implies a consumer can make a decision based on the stated bounds — then only the metrological definition delivers on the promise. Everything else is a draft of a promissory note, written in a currency that has not yet been defined.

Socrates would have found this interesting. He also would not have funded it.

RGHTEOUS GAMBT

Nobody Knows What "Uncertainty" Means