WESLEY LADD

Associate Director, LSU Center for Internal Auditing & Cybersecurity Risk • CTO, Polaris EcoSystems • Coauthor, “Practical AI for Professionals”

← Back to Blog

Your Depth Map Is Not a Measurement

On the difference between metric and metrological depth estimation, and a two-paper plan for making the field care about it

By Wesley Ladd • March 2026

Computer VisionMonocular Depth EstimationMetrologyResearch Strategy

The computer vision community has spent the last two years making extraordinary progress on monocular . scaled the . won Best Paper at 2025 by predicting cameras, depth maps, point maps, and 3D tracks in a single pass. unified monocular and under a plain transformer with a target. proved that priors produce astonishing boundary sharpness. , , , and pushed accuracy to the point where you can almost reconstruct a room from your phone.

Almost. Because none of these models produce measurements.

They produce predictions. Predictions with impressive scores on and , evaluated against metrics that tell you how close the median pixel is to the , but nothing about the uncertainty of any individual pixel, bias as a function of range, under varying conditions, or the of the reported scale to any physical reference standard.

This is the gap. And I think it is a publishable, important, underexplored gap.

What follows is the outline of two papers. The first defines what metrological evaluation means for monocular depth and benchmarks every major model against it. The second proposes an architecture that optimizes for metrological properties directly. The first paper creates the evaluation framework. The second fills it.

The Conflation

When the MDE literature says "," it means depth in real-world units (meters), as opposed to or . has "metric" in the name. All it means is "not ."

In measurement science, "" means something far more demanding: a measurement result that is to a reference standard, with a stated , under documented conditions, with characterized and components, such that the result is reproducible by an independent observer following the same protocol.

These are not the same thing. They are not close to the same thing. A model that predicts depth in meters with 5% is producing . A model that can tell you the depth is 3.42m ± 0.08m (, 95% confidence) with a systematic underestimate of 1.2% at ranges beyond 4m, characterized against a calibration artifact under documented illumination and viewpoint conditions, is producing metrological depth.

No model in the current literature does the second thing. No benchmark evaluates for it. This is the opening.

Paper 1: The Benchmark

Paper I — Beyond Metric: A Metrological Evaluation Framework for Monocular Depth Estimation
Working title. The final title should be shorter and punchier.

Type: Benchmark / Evaluation · Compute: Inference only · Timeline: 6 months · Priority: First

Thesis

Existing MDE benchmarks (, , , ) evaluate prediction accuracy: how close is the estimated depth to , aggregated over a dataset? They do not evaluate measurement quality: can this estimate be used as a measurement in an engineering, inspection, or regulatory context? The distinction matters because downstream applications in infrastructure inspection, autonomous systems, construction surveying, and with need not just accurate depth, but depth with quantified , characterized error structure, and calibration.

This paper (a) formalizes the metric/metrological distinction for the MDE community, (b) defines a metrological evaluation framework specifying the properties a depth estimate must satisfy to qualify as a measurement, and (c) evaluates every major MDE model against this framework using both synthetic and physical calibration protocols.

Conceptual Framework (Sections 1–2)

The introduction draws the line between and depth with concrete examples. A bridge inspector using monocular depth to estimate clearance needs to know the bound, not just the expected error. A deformation monitoring system tracking millimeter-scale displacement over time needs guarantees, not dataset-aggregate . The existing evaluation paradigm cannot answer these questions because it was never designed to.

The conceptual framework section maps the vocabulary of measurement science (: International Vocabulary of Metrology) onto depth estimation. Key definitions:

Key definitions

: The true depth at a specific pixel location under specified conditions.
: A non-negative parameter characterizing the dispersion of values attributed to the depth measurand, based on information used.
: The property of a measurement result whereby it can be related to a reference through a documented unbroken chain of calibrations.
: The component of measurement error that in replicate measurements remains constant or varies predictably.
: The component of measurement error that varies unpredictably in replicate measurements.

This is not novel measurement science. It is established metrological vocabulary applied to a field that has never used it. The contribution is the mapping, not the metrology.

Formal Metrological Properties (Section 3)

The paper defines six metrological properties that a depth estimation system can be evaluated against. These are drawn from (optical 3D measuring systems) and (accuracy of measurement methods), adapted for the monocular depth setting:

PropertyDefinitionCurrent Status
Probing ErrorRange of depth values obtained when measuring a single point on a calibration sphere, characterizing local noiseEvaluated in one study; not standardized
Length Measurement ErrorDeviation between measured and calibrated distance between two known points on a standardSame single study; not adopted by the community
Planarity Deviation deviation of measured points from best-fit plane on a known planar surfaceReported as a metric in but not in a metrological framework
Range-Dependent BiasSystematic error characterized as a function of ground-truth depth (bias curve)Not evaluated in any benchmark
Uncertainty CalibrationAgreement between predicted and observed error distributions ( curve)Not evaluated; most models produce no uncertainty estimate
ReproducibilityVariation in results when the same scene is measured under changed conditions (lighting, viewpoint, )Not evaluated in any benchmark

The column "Current Status" is the point. Four of six properties have never been systematically evaluated for MDE models. This is the benchmark gap.

Physical Calibration Protocol (Section 4)

The paper specifies a capture protocol using commercially available calibration artifacts: (length standards), standards (per ), and planar reference surfaces. These are imaged under controlled conditions (fixed illumination, known ) and in-the-wild conditions (varying illumination, handheld capture) at multiple ranges.

The physical protocol is complemented by a evaluation using rendered scenes with exact , enabling characterization of error structure at ranges and geometries that are impractical to calibrate physically.

The key design decision: the benchmark evaluates models as "measurement instruments," not as "predictors." The evaluation asks: if I treat this model's output as a measurement, what are the metrological properties of that measurement? This framing is the contribution.

Evaluation Campaign (Section 5)

Every major model gets evaluated. No exceptions. The minimum set:

ModelParadigm
/ MoGe-2
v1.1

For each model, the paper reports all six metrological properties plus the standard accuracy metrics for cross-reference with existing literature. The analysis section characterizes how paradigm ( vs. vs. ), backbone ( scale), and training data ( vs. vs. mixed) correlate with metrological performance. The hypothesis is that metrological properties do not track neatly with prediction accuracy: a model with lower may have worse or higher bias at specific ranges.

Target Venues

Venue strategy
Primary: or (benchmark/dataset track). Both venues have established precedent for benchmark papers. The metrological framing is differentiated enough from existing benchmarks to clear novelty.
Alternatives: (strong fit for geometric evaluation), (metrological rigor valued as primary contribution), (photogrammetry and surveying community cares about this natively).
Workshop path: Workshop at CVPR for early framing, then full paper to a main conference.


Paper 2: The Method

Paper II — Metrological Monocular Depth Estimation with Calibrated Uncertainty Quantification
Depends on Paper I being published or at least available as a preprint on . Submitted 3–6 months after Paper I.

Type: Method · Compute: Training required · Timeline: 9–12 months · Priority: Second

Thesis

Given the metrological evaluation framework from Paper I, this paper proposes an MDE architecture that is designed from the ground up to produce calibrated, metrologically depth estimates. The model does not merely predict depth; it predicts depth with a per-pixel uncertainty estimate that is calibrated against physical , and it exposes the structure of its predictions such that a downstream consumer can apply corrections appropriate to their operating conditions.

The contribution is not "better depth estimation." If the model matches DA3 on , that is sufficient. The contribution is that the uncertainty estimates are calibrated (predicted match observed error distributions), the bias structure is characterized and correctable, and the full output constitutes a measurement result in the sense.

Architecture (Section 3)

The backbone is a ( or ), following the DA2/DA3 precedent. The produces two outputs: a depth map and a uncertainty map ( parameterization). The loss function combines:

  1. loss under a , which jointly trains the depth prediction and the uncertainty estimate. This is the mechanism that makes the uncertainty rather than decorative.
  2. loss for edge quality, following Marigold's lesson that boundary sharpness matters for geometric accuracy downstream.
  3. loss following DA3's finding that supervision improves geometric quality of the teacher model.

The key architectural addition is a : a lightweight that takes the raw uncertainty output and maps it to calibrated via , trained on a calibration set with physical (not the same data used for depth training). This is analogous to for classification confidence, applied to .

Design decision
conditioning (a la ) is a natural extension but is deliberately excluded from the first version of this paper to keep the contribution clean: the model must produce metrologically characterized output from RGB alone. The LiDAR-conditioned variant is a follow-up experiment or a third paper, not a in the method contribution.

Training Regime (Section 4)

distillation following the DA2/DA3 paradigm. The teacher is trained on large-scale labeled data ( + ). The student distills from the teacher on unlabeled data with . The uncertainty is trained on a separate calibration dataset containing physical from calibration artifacts and high-quality scans.

Critical detail: the calibration dataset must be independent of the training data. This is a metrological requirement, not a convenience. A measurement instrument calibrated against the same data used to build it has . The paper must demonstrate that the calibration holds on data the model has never seen, captured with equipment the model was not trained on.

Experiments (Section 5)

The experimental section has two halves. The first half is the standard accuracy evaluation on existing benchmarks (, , ) to demonstrate competitive depth quality. The model does not need to set new . It needs to be within the pack.

The second half evaluates against the Paper I benchmark. This is where the contribution lives. The key results:

  1. curves (predicted confidence vs. observed ) showing that the model's uncertainty estimates are well-calibrated where existing models either produce no uncertainty or produce uncalibrated confidence.
  2. Range-dependent bias curves showing characterized , with demonstration that the bias can be corrected using the model's own output (a practical feature no existing model offers).
  3. and length measurement error on physical calibration artifacts, compared to all Paper I baselines.
  4. Reproducibility under varied conditions, demonstrating that the model's stochastic variation (if any) is bounded and characterized.

Target Venues

Venue strategy
Primary: , , or main conference. The method paper benefits from Paper I having established the evaluation framework. Reviewers can look up the benchmark.
Alternative: (if the UQ angle is framed as a probabilistic modeling contribution— / uncertainty, calibrated likelihoods), AAAI, or WACV (lower bar, still solid).


Dependency Graph

The two papers are designed to be self-reinforcing but independently publishable. Here is the dependency structure:

Paper I: Benchmark
├── Conceptual framework (metric vs. metrological)
├── Six formal metrological properties
├── Physical calibration protocol
├── Evaluation of 10+ existing models
└── Public benchmark release
    │ establishes evaluation framework for
    ▼
Paper II: Method
├── DINOv2 + heteroscedastic uncertainty decoder
├── NLL loss for calibrated uncertainty
├── Post-hoc calibration head (Platt-style)
├── Standard benchmarks (competitive, not SOTA)
└── Paper I benchmark (this is where you win)
    │ enables
    ▼
[Future] LiDAR-conditioned variant
[Future] Video temporal consistency
[Future] Multi-view metrological fusion

What This Isn't

This is not a plan to beat DA3 on . I will not beat a team with a trained on their internal data engine. That is not the game.

This is a plan to redefine what "good" means for depth estimation in contexts where the output is treated as a rather than a visual prior. The benchmark paper says: you have been evaluating predictions when you should have been evaluating measurements. The method paper says: here is what it looks like when you optimize for measurement quality.

The bet is that as monocular depth moves from research artifact to deployed measurement tool in infrastructure, surveying, construction, autonomous systems, and , the framing will become necessary rather than optional. The first people to formalize it get to define the terms.

That is a game worth playing, and one where an applied researcher with a measurement science background and a philosophy of rigorous attention has a genuine, non-commoditized advantage.

© 2026 Wesley Ladd. All rights reserved.

Last updated: 3/22/2026