Your Depth Map Is Not a Measurement: On the difference between metric and metrological depth estimation, and a two-paper plan for making the field care about it

Monocular has had an extraordinary two years. and scaled the and unified monocular and . won Best Paper at 2025. , , , , and pushed accuracy to the point where you can almost reconstruct a room from your phone.

Almost. Because none of these models produce measurements.

They produce predictions with impressive scores on and , but nothing about the uncertainty of any individual pixel, bias as a function of range, under varying conditions, or the of the reported scale to any physical reference standard. (The word "uncertainty" itself is dangerously overloaded; the field uses it to mean at least four different things, only one of which has the statistical semantics required for a measurement.)

The distinction matters, and it has a name.

Metric vs. Metrological

When the MDE literature says "," it means depth in real-world units (meters), as opposed to or . has "metric" in the name. All it means is "not ."

In measurement science, "" means something far more demanding: a measurement result that is to a reference standard, with a stated , under documented conditions, with characterized and components, such that the result is reproducible by an independent observer following the same protocol.

These are not the same thing. They are not close to the same thing. A model that predicts depth in meters with 5% is producing . A model that can tell you the depth is 3.42m ± 0.08m (, 95% confidence) with a systematic underestimate of 1.2% at ranges beyond 4m, characterized against a calibration artifact under documented illumination and viewpoint conditions, is producing metrological depth.

No model in the current literature does the second thing. No benchmark evaluates for it. This is the opening.

A necessary caveat: classical metrology was built for instruments with understood measurement chains, where you can analytically propagate uncertainty through each component. Learned depth models are not that. Their internal representations are opaque, and componentwise traceability in the traditional sense may not be achievable. That does not excuse the total absence of measurement-side evaluation. It means the field needs an adapted validation regime, one that takes the normative target from measurement science (outputs usable as measurement results under documented conditions with validated uncertainty behavior) without pretending that every classical decomposition transplants intact to opaque learned systems. The right framing is metrological aspiration with empirically validated calibration, not a naive demand that neural networks behave like coordinate measuring machines.

What follows is the outline of two papers. The first defines what metrological evaluation means for monocular depth and benchmarks every major model against it. The second proposes an architecture that optimizes for metrological properties directly. The first paper creates the evaluation framework. The second fills it.

Paper 1: The Benchmark

Paper I — Beyond Metric: A Metrological Evaluation Framework for Monocular Depth Estimation

Thesis

Existing MDE benchmarks (, , , ) evaluate prediction accuracy: how close is the estimated depth to , aggregated over a dataset? They do not evaluate measurement quality: can this estimate be used as a measurement in an engineering, inspection, or regulatory context? The distinction matters because downstream applications in infrastructure inspection, autonomous systems, construction surveying, and with need not just accurate depth, but depth with quantified , characterized error structure, and calibration.

This paper (a) formalizes the metric/metrological distinction for the MDE community, (b) defines a metrological evaluation framework specifying the properties a depth estimate must satisfy to qualify as a measurement, and (c) evaluates every major MDE model against this framework using both synthetic and physical calibration protocols.

Conceptual Framework (Sections 1–2)

The introduction draws the line between and depth with concrete examples. A bridge inspector using monocular depth to estimate clearance needs to know the bound, not just the expected error. A deformation monitoring system tracking millimeter-scale displacement over time needs guarantees, not dataset-aggregate . The existing evaluation paradigm cannot answer these questions because it was never designed to.

The conceptual framework section maps the vocabulary of measurement science (: International Vocabulary of Metrology) onto depth estimation. Key definitions:

Key definitions

: The true depth at a specific pixel location under specified conditions.
: A non-negative parameter characterizing the dispersion of values attributed to the depth measurand, based on information used.
: The property of a measurement result whereby it can be related to a reference through a documented unbroken chain of calibrations.
: The component of measurement error that in replicate measurements remains constant or varies predictably.
: The component of measurement error that varies unpredictably in replicate measurements.

This is not novel measurement science. It is established metrological vocabulary applied to a field that has never used it. The contribution is the mapping, not the metrology.

Formal Metrological Properties (Section 3)

The paper defines six metrological properties that a depth estimation system can be evaluated against. These are drawn from (optical 3D measuring systems) and (accuracy of measurement methods), adapted for the monocular depth setting:

Property	Definition	Current Status
Probing Error	Range of depth values obtained when measuring a single point on a calibration sphere, characterizing local noise	Evaluated in one study; not standardized
Length Measurement Error	Deviation between measured and calibrated distance between two known points on a standard	Same single study; not adopted by the community
Planarity Deviation	deviation of measured points from best-fit plane on a known planar surface	Reported as a metric in but not in a metrological framework
Range-Dependent Bias	Systematic error characterized as a function of ground-truth depth (bias curve)	Not part of standard MDE benchmark practice
Uncertainty Calibration	Agreement between predicted and observed error distributions ( curve)	Not systematically evaluated; most models produce no uncertainty estimate
Reproducibility	Variation in results when the same scene is measured under changed conditions (lighting, viewpoint, )	Not systematically evaluated across current benchmark suites

These six properties are not arbitrary. They jointly cover local repeatability, geometric fidelity over known spans and surfaces, systematic structure of error, validity of uncertainty claims, and stability under operational variation. The column "Current Status" is the point: four of six have never been systematically evaluated for MDE models. This is the benchmark gap.

Physical Calibration Protocol (Section 4)

The paper specifies a capture protocol using commercially available calibration artifacts: (length standards), standards (per ), and planar reference surfaces. These are imaged under controlled conditions (fixed illumination, known ) and in-the-wild conditions (varying illumination, handheld capture) at multiple ranges.

The physical protocol is complemented by a evaluation using rendered scenes with exact , enabling characterization of error structure at ranges and geometries that are impractical to calibrate physically.

The key design decision: the benchmark evaluates models as "measurement instruments," not as "predictors." The evaluation asks: if I treat this model's output as a measurement, what are the metrological properties of that measurement? This framing is the contribution.

Evaluation Campaign (Section 5)

Every major model gets evaluated. No exceptions. The minimum set:

Model	Paradigm




/ MoGe-2

v1.1

For each model, the paper reports all six metrological properties plus the standard accuracy metrics for cross-reference with existing literature. The analysis section characterizes how paradigm ( vs. vs. ), backbone ( scale), and training data ( vs. vs. mixed) correlate with metrological performance. The hypothesis is that metrological properties do not track neatly with prediction accuracy: a model with lower may have worse or higher bias at specific ranges.

Paper 2: The Method

Paper II — Metrological Monocular Depth Estimation with Calibrated Uncertainty Quantification Depends on Paper I establishing the evaluation framework.

Thesis

Given the metrological evaluation framework from Paper I, this paper proposes an MDE architecture designed from the ground up to produce calibrated, metrologically depth estimates. The model does not merely predict depth with a confidence score. It emits a measurement contract: a per-pixel depth estimate, a calibrated , a characterized correction term, and validity metadata indicating the constraint regime (geometric, learned prior, or extrapolated) under which the estimate was produced.

The contribution is not "better depth estimation." If the model matches DA3 on , that is sufficient. The contribution is that the output constitutes a measurement result in the sense: uncertainty estimates are calibrated (predicted match observed error distributions), bias structure is characterized and correctable, and the consumer can distinguish geometrically grounded estimates from learned-prior gap fills. The model behaves like an instrument, not a score-producing black box.

Architecture (Section 3)

The backbone is a ( or ), following the DA2/DA3 precedent. The produces four outputs: a depth map, a uncertainty map ( parameterization), a predicted bias correction term, and a per-pixel constraint-source classification (geometric, learned prior, or extrapolated). Together these constitute the measurement contract. The loss function combines:

loss under a , which jointly trains the depth prediction and the uncertainty estimate. This is the mechanism that makes the uncertainty rather than decorative.
loss for edge quality, following Marigold's lesson that boundary sharpness matters for geometric accuracy downstream.
loss following DA3's finding that supervision improves geometric quality of the teacher model.

The key architectural addition is a : a lightweight that takes the raw uncertainty output and maps it to calibrated via , trained on a calibration set with physical (not the same data used for depth training). This is analogous to for classification confidence, applied to .

Design decision
conditioning (a la ) is a natural extension but is deliberately excluded from the first version of this paper to keep the contribution clean: the model must produce metrologically characterized output from RGB alone. The LiDAR-conditioned variant is a follow-up experiment or a third paper, not a in the method contribution.

Training Regime (Section 4)

distillation following the DA2/DA3 paradigm. The teacher is trained on large-scale labeled data ( + ). The student distills from the teacher on unlabeled data with . The uncertainty is trained on a separate calibration dataset containing physical from calibration artifacts and high-quality scans.

Critical detail: the calibration dataset must be independent of the training data. This is a metrological requirement, not a convenience. A measurement instrument calibrated against the same data used to build it has . The paper must demonstrate that the calibration holds on data the model has never seen, captured with equipment the model was not trained on.

Experiments (Section 5)

The experimental section has two halves. The first half is the standard accuracy evaluation on existing benchmarks (, , ) to demonstrate competitive depth quality. The model does not need to set new . It needs to be within the pack.

The second half evaluates against the Paper I benchmark. This is where the contribution lives. The key results:

curves (predicted confidence vs. observed ) showing that the model's uncertainty estimates are well-calibrated where existing models either produce no uncertainty or produce uncalibrated confidence.
Range-dependent bias curves showing characterized , with demonstration that the bias can be corrected using the model's own output (a practical feature no existing model offers).
and length measurement error on physical calibration artifacts, compared to all Paper I baselines.
Reproducibility under varied conditions, demonstrating that the model's stochastic variation (if any) is bounded and characterized.

Dependency Graph

The two papers are designed to be self-reinforcing but independently publishable. Here is the dependency structure:

Paper I: Benchmark
├── Conceptual framework (metric vs. metrological)
├── Six formal metrological properties
├── Physical calibration protocol
├── Evaluation of 10+ existing models
└── Public benchmark release
    │ establishes evaluation framework for
    ▼
Paper II: Method
├── DINOv2 + measurement contract decoder
│   (depth, calibrated interval, bias term, constraint source)
├── NLL loss for calibrated uncertainty
├── Post-hoc calibration head (Platt-style)
├── Standard benchmarks (competitive, not SOTA)
└── Paper I benchmark (this is where you win)
    │ enables
    ▼
[Future] LiDAR-conditioned variant
[Future] Video temporal consistency
[Future] Multi-view metrological fusion

What This Isn't

This is not a plan to beat DA3 on . I will not beat a team with a trained on their internal data engine. That is not the game.

But it is worth asking why is the game at all. MDE currently ranks predictors as if the relevant consumer were a leaderboard, not an engineer with liability. RMSE is a dataset-aggregate statistic. It collapses all spatial, range-dependent, and conditional variation into a single number. Two models with identical RMSE can have wildly different curves, profiles, and failure modes. One may be mediocre everywhere. The other may be excellent at close range and catastrophically wrong at distance. The single number cannot distinguish them. The metric was designed for ranking research papers, not for answering "can I use this measurement?"

Figure 1. Two models with nearly identical aggregate RMSE (0.170 m vs. 0.162 m) but completely different fitness for measurement. Model A is mediocre everywhere. Model B is excellent at close range and unusable beyond 15 m. The single RMSE number cannot distinguish them. A metrological evaluation — range-dependent bias curves, calibrated uncertainty, probing error — can.

This is a plan to redefine what "good" means for depth estimation in contexts where the output is treated as a rather than a visual prior. The benchmark paper says: you have been evaluating predictions when you should have been evaluating measurements. The method paper says: here is what it looks like when you optimize for measurement quality.

The bet is bigger than applied relevance. The claim is that models optimized for properties will produce better scores as a side effect. Current depth models optimize for pixel-wise prediction accuracy against ground-truth maps — a loss function that treats every pixel independently, ignores structure, and discards information about how confidently wrong the model is. This is a poor facsimile of what the actual objective should be. A loss that forces the model to jointly predict depth and calibrated uncertainty penalizes confident errors more heavily than uncertain ones, which reshapes the loss landscape in ways that and losses cannot. A model that characterizes its own bias curve can self-correct. A model that knows where it is uncertain allocates capacity toward the regions that matter. The metrological framing is not a tradeoff against prediction accuracy — it is a strictly better optimization target that subsumes it.

I am calling this shot explicitly so it can be explicitly falsified. The claim is testable and the conditions for falsification are concrete:

The claim is false if: a model trained with loss (jointly predicting depth and uncertainty) produces significantly worse than the same architecture trained with or loss on the same data, at the same compute budget, across standard benchmarks. "Significantly" means outside the noise floor of training variance — not 0.2% on , but a consistent degradation across datasets and scales that cannot be closed by using the model's own uncertainty output.

The claim survives if: metrological optimization produces RMSE within the competitive envelope of point-estimate losses and the output (using the model's characterized error structure) matches or exceeds the uncorrected point-estimate baseline. The uncertainty and calibration are then genuinely free — no accuracy was traded to get them.

The strong form of the claim is confirmed if: metrological optimization produces lower RMSE than point-estimate losses at matched architecture and data, because the weighting and bias self-correction mechanisms provide a better inductive bias than treating all pixels and all errors as equal.

Paper II is designed to run exactly this experiment. The ablation is straightforward: same backbone, same data, same compute, different loss. If the metrological loss wins on RMSE while also producing calibrated uncertainty, that is a result that reshapes how the field thinks about its optimization targets. If it ties on RMSE while adding calibration, that is still a strong contribution but a weaker thesis. If it loses on RMSE, the claim is wrong and I will say so.

As monocular depth leaves the leaderboard and enters contexts with tolerances, liability, and traceability requirements, the distinction between metric and metrological depth will stop sounding philosophical and start sounding overdue. The first people to formalize it get to define the terms.

RGHTEOUS GAMBT

Your Depth Map Is Not a Measurement

Metric vs. Metrological

Paper 1: The Benchmark

Thesis

Conceptual Framework (Sections 1–2)

Formal Metrological Properties (Section 3)

Physical Calibration Protocol (Section 4)

Evaluation Campaign (Section 5)

Paper 2: The Method

Thesis

Architecture (Section 3)

Training Regime (Section 4)

Experiments (Section 5)

Dependency Graph

What This Isn't