Socrates was executed for asking people to define their terms. The depth estimation community has so far been spared this fate, which may explain why the word "uncertainty" appears in over two hundred papers and means something different in each one.
This is not a complaint about imprecise language. It is a claim that the field is producing a quantity it cannot define, evaluating it against no standard, and shipping it to no consumer. If Socrates wandered into a computer vision lab and asked "what do you mean by uncertainty?", the ensuing dialogue would be instructive, and nobody would enjoy it.
Let us try it anyway.
The First Answer: "It's the Model's Confidence"
This is the answer you get from the first paper you read. The model produces a depth map and a scalar per pixel, often from a Sigmoid (logistic function): σ(x) = 1/(1+e⁻ˣ). Squashes any real value into (0, 1). Sometimes used as an output activation for "confidence" scalars, but the output has no inherent probabilistic calibration unless explicitly trained for it. or Softplus: log(1+eˣ). A smooth approximation to ReLU that maps any real value to a positive number. Used to parameterize variance or scale outputs that must be positive. output head, that is "higher where the model is less certain." It correlates with error magnitude in visualizations. It looks reasonable. The paper calls it uncertainty and moves on.
Socrates would ask: what is the unit of this quantity?
It has no unit. It is a learned feature, jointly optimized with the depth prediction, that the training objective pushes to correlate with something. With what, exactly? With the magnitude of the loss at that pixel during training. It is a learned proxy for "this pixel was hard to fit," which is not the same as "this pixel's depth estimate could reasonably be off by this much."
A thermometer reading is not a temperature uncertainty. A difficulty score is not an error bound. Calling a learned scalar "confidence" and then inverting it to get "uncertainty" is a category error so widespread that pointing it out feels rude. But Socrates was never polite.
What this actually is: a learned per-pixel feature with no statistical semantics, no calibration, and no defined relationship to the error distribution. You cannot use it to construct an interval. You cannot state a coverage probability. You cannot combine it with another measurement's uncertainty to assess whether a detected difference is real. It is, in the most precise sense, not an uncertainty estimate. It is a number.
The Second Answer: "It's Aleatoric uncertainty: irreducible ambiguity in the observation (blur, sensor noise)—as opposed to epistemic (model ignorance) uncertainty. Uncertainty"
Now we are in Kendall & Gal, "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" (NeurIPS 2017). Canonical reference for decomposing uncertainty into aleatoric (data) and epistemic (model) components in vision tasks. Introduced heteroscedastic aleatoric uncertainty via learned per-pixel variance trained with NLL loss. territory. The model predicts depth and a per-pixel Log-variance parameterization: the network predicts log σ² (or similar) so variance stays positive and gradients are stable when training heteroscedastic (per-pixel) noise models., trained under a Negative log-likelihood: loss that treats depth plus a predicted noise scale as a probabilistic model—trains mean and uncertainty jointly when the likelihood is well-specified. loss with a Laplacian vs Gaussian likelihood: assume heavy-tailed (Laplace) or Gaussian noise on depth errors when building the NLL; choice affects sensitivity to outliers and the shape of calibrated intervals. or Laplacian vs Gaussian likelihood: assume heavy-tailed (Laplace) or Gaussian noise on depth errors when building the NLL; choice affects sensitivity to outliers and the shape of calibrated intervals. observation model. The variance is the Aleatoric uncertainty: irreducible ambiguity in the observation (blur, sensor noise)—as opposed to epistemic (model ignorance) uncertainty. uncertainty: irreducible ambiguity in the data at that pixel. Textureless sky? High variance. Sharp edge with good contrast? Low variance.
This is a real answer. It has formal probabilistic semantics. The Negative log-likelihood: loss that treats depth plus a predicted noise scale as a probabilistic model—trains mean and uncertainty jointly when the likelihood is well-specified. loss defines an explicit likelihood, and the predicted variance is a parameter of that likelihood. Socrates would nod, briefly.
Then he would ask: is the likelihood correctly specified?
The Gaussian assumption means you are asserting that depth errors are normally distributed with the predicted variance at each pixel, independently across pixels, with no spatial correlation, no range-dependent bias, and no systematic component. None of these things are true. Depth errors are spatially correlated (nearby pixels share features), range-dependent (errors grow with distance), systematically biased (models underpredict at range boundaries), and non-Gaussian (heavy tails from occlusion boundaries and reflections).
And then: have you checked whether the predicted variance matches the observed error distribution?
Almost never. The standard evaluation is to report the Negative log-likelihood: loss that treats depth plus a predicted noise scale as a probabilistic model—trains mean and uncertainty jointly when the likelihood is well-specified. on a held-out set as a scalar, which tells you the average quality of the probabilistic fit, aggregated over the entire dataset. It does not tell you whether the model's predicted 95% intervals contain 95% of the true values. That is a Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. check, and it is essentially absent from the MDE literature.
What this actually is: a variance parameter in a misspecified likelihood, trained end-to-end, never validated against empirical error distributions, and reported as a dataset-aggregate scalar. It has formal semantics that are violated in practice and no empirical check that would catch the violation. Socrates would call this a belief about uncertainty that has never been tested against reality, which in classical Athens was approximately the definition of opinion rather than knowledge.
The Third Answer: "It's Epistemic uncertainty: ignorance reducible with more data or a better model (ambiguity about the true function). Distinct from aleatoric (data-limited) noise; both matter for full UQ but are estimated differently. Uncertainty"
Now we reach the sophisticated interlocutor. Epistemic uncertainty captures model ignorance: the model has been trained on finite data and has finite capacity, so different plausible models would give different predictions. You approximate this by running multiple forward passes with variation — Monte Carlo dropout: leave dropout active at test time, run multiple forward passes, and treat the distribution of outputs as an approximation to Bayesian posterior uncertainty. Cheap but the quality of the approximation depends heavily on dropout rate and architecture., Deep ensembles (Lakshminarayanan et al. 2017): train N independent models from different random initializations and use the spread of their predictions as an uncertainty estimate. Often the strongest baseline for epistemic uncertainty, but N× the compute cost., Test-time augmentation (TTA): run inference on multiple transformed versions of the input (flips, crops, color jitter) and aggregate. The spread captures sensitivity to input perturbations, which correlates with but is not identical to model uncertainty. — and measuring the spread.
Socrates would ask: spread of what, relative to what?
The spread is a function of the approximation method. MC dropout with rate 0.1 gives different spread than rate 0.3. An ensemble of 5 gives different spread than an ensemble of 20. Test-time augmentation with horizontal flips gives different spread than augmentation with color jitter. The "uncertainty" you measure is not a property of the model's ignorance. It is a property of how you chose to probe the model's ignorance.
And: what is the ground truth for epistemic uncertainty?
There is none. You can measure that the model disagrees with itself, but you have no oracle for how much it should disagree. High epistemic uncertainty might mean "this is out of distribution" or "dropout happened to destabilize this particular feature map" or "the augmentation changed the scale in a way that shifts predictions." You cannot distinguish these without additional information that the method does not provide.
What this actually is: a dispersion statistic across multiple stochastic forward passes, with magnitude determined by the approximation method, no ground truth, and no calibration. It is uncertainty about the model's uncertainty, which is either profound or useless depending on your tolerance for recursion.
The Fourth Answer: The One Nobody Gives
In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated..
VIM (International Vocabulary of Metrology): the ISO/IEC guide defining terms like measurand, uncertainty, and traceability—the vocabulary the post maps onto depth estimation. definition: a non-negative parameter characterizing the dispersion of the quantity values being attributed to a Measurand: the quantity being measured—here, true depth at a specified pixel and imaging condition., based on the information used.
In practice: you identify every source of error (Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range., Random error: unpredictable variation across repeated measurements—noise that averages down with many samples but does not cancel in a single frame., model, sensor, environmental), propagate them through a Measurement model (VIM): the mathematical relation among all quantities known to be involved in a measurement. For depth, this includes the imaging geometry, sensor noise model, scene priors, and any learned components. Propagating uncertainty through the measurement model is what produces a calibrated interval rather than a point estimate with a decorative error bar., combine them, and produce an interval with a stated Coverage (calibration): fraction of ground-truth values that fall inside the model’s predicted intervals. A calibrated UQ model’s coverage should match the nominal level (e.g. 95% intervals cover ~95% of pixels where defined). that is validated empirically. If you say ±8mm at k=2 expanded uncertainty: roughly a 95% coverage interval if errors are well-modeled as Gaussian—common reporting convention in metrology alongside the stated coverage probability., 95% coverage, then on independent test data the interval contains the true depth 95% of the time.
This is what measurement uncertainty means in every other field that measures things. In dimensional metrology, in surveying, in clinical chemistry, in analytical physics. The vocabulary is standardized (GUM: Guide to the Expression of Uncertainty in Measurement (JCGM 100:2008). The international standard for evaluating and expressing measurement uncertainty. Defines Type A (statistical) and Type B (other) evaluation methods and the law of propagation of uncertainty. The vocabulary and framework that the depth estimation community has not yet adopted.: Guide to the Expression of Uncertainty in Measurement). The validation methods are standardized (ISO 5725: international standard on accuracy (trueness and precision) of measurement methods and results—vocabulary for repeatability, reproducibility, and bias.). The calibration artifacts are standardized (VDI/VDE 2634: German guideline for optical 3D measuring systems; defines probing error, sphere/plane artifacts, and length measurement tests familiar in industrial metrology. for optical 3D systems).
In monocular depth estimation, none of this exists. Not because it is impossible. Not because the field tried and failed. Because nobody has applied the framing. The metrological vocabulary, the evaluation protocol, the validation methodology — all of it is sitting in measurement science textbooks, waiting for someone to connect it to the learned depth community.
Socrates would nod longer this time. Then he would ask three questions, each worse than the last.
First: you say the interval is validated empirically. Validated on what? Your thirty calibration sites? And you claim this generalizes to unseen infrastructure?
Fair. The empirical validation is only as strong as the test data is representative. You have replaced an unvalidated model assumption (Gaussian likelihood, definition 2) with a validated empirical claim (95% coverage on held-out data). But the empirical claim is itself bounded by the diversity of the calibration corpus. You have traded a theoretical weakness for an empirical one. This is progress — empirical weaknesses can be measured and addressed — but it is not certainty.
Second: how do you know your Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. transfers? A model calibrated on steel bridges in Louisiana — does the coverage hold on concrete cooling towers in Nevada?
This is the Calibration transfer: the ability of uncertainty calibration learned on a set of reference sites to generalize to unseen infrastructure. Analogous to how a calibrated instrument maintains its accuracy specification across its rated operating range, not just at the specific points used for calibration. problem, and it is genuinely open. The classical metrology answer is that you re-calibrate for each new measurement context. The practical answer for a learned system is that you need enough diversity in the calibration corpus to cover the major axes of variation — infrastructure type, material, scale, sensor configuration, environmental conditions — and then you validate on held-out combinations. Whether 30 sites or 300 sites suffices is an empirical question. The honest answer is: we do not know yet, and characterizing the boundaries of calibration transfer is part of what makes this research necessary.
Third, the one that draws blood: GUM: Guide to the Expression of Uncertainty in Measurement (JCGM 100:2008). The international standard for evaluating and expressing measurement uncertainty. Defines Type A (statistical) and Type B (other) evaluation methods and the law of propagation of uncertainty. The vocabulary and framework that the depth estimation community has not yet adopted. assumes you understand the Measurement model (VIM): the mathematical relation among all quantities known to be involved in a measurement. For depth, this includes the imaging geometry, sensor noise model, scene priors, and any learned components. Propagating uncertainty through the measurement model is what produces a calibrated interval rather than a point estimate with a decorative error bar. well enough to propagate uncertainty through it analytically. Your measurement model is a neural network with millions of parameters. You can validate the output empirically, but you cannot trace the uncertainty through the model the way a metrologist traces it through a chain of calibrated instruments. You have an opaque instrument that produces intervals with correct coverage on your test set. Is empirical validation without mechanistic Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties. sufficient to call it a measurement?
This is the hardest question, and it deserves an honest answer rather than a dodge. Classical metrology was built for instruments you can take apart — a micrometer, a theodolite, a coordinate measuring machine. You understand each component's error contribution. You propagate uncertainty through the measurement equation because you have a measurement equation. A neural network does not offer this. The function exists, but it is not interpretable in a way that permits component-wise uncertainty decomposition.
The honest position is: empirical validation is necessary and sufficient for the user. The engineer stamping the drawing cares that the intervals hold on independent test data under documented conditions. She does not need to trace the uncertainty through layer 47 of a ViT (Vision Transformer): transformer architecture applied to image patches, used as backbone encoder in many modern depth and reconstruction models. DINOv2-based ViTs are the dominant backbone for current state-of-the-art depth estimation.. But empirical validation is insufficient for the metrologist, who would want the mechanistic account — the uncertainty budget with named components and propagation rules.
This is a genuine tension, not a flaw to be papered over. Resolving it — or arguing convincingly that end-to-end empirical validation is a legitimate mode of metrological certification for learned measurement systems — is part of what makes this research interesting rather than routine. The GUM: Guide to the Expression of Uncertainty in Measurement (JCGM 100:2008). The international standard for evaluating and expressing measurement uncertainty. Defines Type A (statistical) and Type B (other) evaluation methods and the law of propagation of uncertainty. The vocabulary and framework that the depth estimation community has not yet adopted. was written for a world of understood instruments. We are building instruments we do not fully understand. The metrological framework must stretch to accommodate that, or it will break.
Socrates would have appreciated the honesty. He still would have kept asking.
What this actually is: the thing uncertainty should mean if you intend for the output to be used as a measurement. The definition is not novel. The application to MDE is. And the application raises a foundational question — whether empirical coverage guarantees are sufficient for metrological legitimacy when the instrument is opaque — that the field has not yet confronted because it has not yet needed to.
Why This Matters Beyond Vocabulary
This is not a pedantic taxonomy. The four definitions produce four different decisions when a downstream system consumes the output:
| Bridge inspector asking: "Is the clearance above 4.5m?" | |
|---|---|
| Definition 1 (learned score) | "The confidence is 0.73." What does that mean? Can I stamp this? |
| Definition 2 (aleatoric variance) | "The predicted variance is 0.04 m²." Is that calibrated? What coverage? Against what reference? |
| Definition 3 (epistemic spread) | "Five forward passes gave depths between 4.48m and 4.67m." Is that spread meaningful? Would ten passes give a different answer? |
| Definition 4 (metrological) | "The depth is 4.58m ± 0.07m (k=2, 95% coverage), validated against NIST-traceable: calibrated against artifacts whose chain leads to NIST (or another NMI)—the gold standard for claiming physical traceability in the US context. calibration artifacts under documented conditions." Yes, the clearance exceeds 4.5m with quantified confidence. Stamp it. |
Only one of these answers is usable by an engineer making a decision with liability attached. The gap between definitions 1–3 and definition 4 is not a gap of precision. It is a gap of kind. The first three are statements about the model. The fourth is a statement about the physical world.
The Socratic Conclusion
Socrates did not resolve his dialogues. He left his interlocutors aware that they did not know what they thought they knew, which was considered the beginning of wisdom and also the beginning of the end for Socrates.
The conclusion here is similar but less fatal: the depth estimation community has been producing "uncertainty" estimates for nearly a decade without agreeing on what the word means, what the output's semantics are, how to validate it, or who would use it. This is not a criticism of any individual paper. Every paper is internally consistent within its own definition. The problem is that there are four definitions, they are not equivalent, and the field talks about them as if they are.
The In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” does not render the other definitions useless. Aleatoric uncertainty: irreducible ambiguity in the observation (blur, sensor noise)—as opposed to epistemic (model ignorance) uncertainty. variance is a valid component of a full uncertainty budget. Epistemic uncertainty: ignorance reducible with more data or a better model (ambiguity about the true function). Distinct from aleatoric (data-limited) noise; both matter for full UQ but are estimated differently. uncertainty flags out-of-distribution inputs, which is operationally valuable. Learned confidence scores have their uses in downstream optimization.
But if the claim is that a depth model produces output with quantified uncertainty — language that implies a consumer can make a decision based on the stated bounds — then only the metrological definition delivers on the promise. Everything else is a draft of a promissory note, written in a currency that has not yet been defined.
Socrates would have found this interesting. He also would not have funded it.