Metrological depth estimation: a benchmark and a method for turning depth predictions into measurements
Two papers designed as a self-reinforcing pair. Paper I establishes a metrological evaluation framework for monocular depth estimation — formalizing what it means for a depth prediction to qualify as a measurement. Paper II proposes an architecture that optimizes for metrological properties directly, evaluated against the Paper I benchmark.
The papers are independently publishable but strategically sequenced: Paper I creates the evaluation framework and establishes the vocabulary; Paper II fills it with a method that produces calibrated, traceable depth.
Paper I: Benchmark
├── Conceptual framework (metric vs. metrological)
├── Six formal metrological properties
├── Physical calibration protocol
├── Evaluation of 10+ existing models
└── Public benchmark release
│ establishes evaluation framework for
▼
Paper II: Method
├── DINOv2 + heteroscedastic uncertainty decoder
├── NLL loss for calibrated uncertainty
├── Post-hoc calibration head (Platt-style)
├── Standard benchmarks (competitive, not SOTA)
└── Paper I benchmark (this is where you win)
│ enables
▼
[Future] LiDAR-conditioned variant
[Future] Sensor-conditioned multi-modal inference
[Future] Temporal coherence & change detection
[Future] Infrastructure-specific geometric priorsBoth papers will be posted to arXiv before or concurrent with conference submission. Paper I must be citable as a preprint before Paper II is submitted, so that reviewers can reference the evaluation framework. Early arXiv posting establishes priority on the metric/metrological distinction for MDE and invites community engagement before peer review.
Primary targets are CVPR, ECCV, and ICCV for both papers. Paper I's benchmark/evaluation framing fits naturally into dataset-track submissions. Paper II's method contribution is a standard main-conference paper. 3DV is a strong alternative venue where geometric measurement work is valued. IEEE T-IM and ISPRS are journal alternatives where metrological rigor is a first-class contribution.
The MDEC Workshop at CVPR provides an early framing opportunity: present the conceptual framework at the workshop, collect feedback, then submit the full Paper I to a main conference the following cycle.
Paper I (benchmark) is submitted first. It requires only inference compute and can be completed in ~6 months. Paper II (method) follows 3–6 months later, referencing the Paper I preprint. This sequencing means reviewers of Paper II already have an established, independently reviewed evaluation framework to assess the contribution against.
Working title
Existing MDE benchmarks (KITTI, NYU Depth v2, ETH3D, SYNS-Patches) evaluate prediction accuracy — how close is the estimated depth to ground truth, aggregated over a dataset. They do not evaluate measurement quality — can this estimate be used as a measurement in an engineering, inspection, or regulatory context? This paper formalizes the metric/metrological distinction, defines six metrological properties a depth estimate must satisfy to qualify as a measurement, and evaluates every major MDE model against this framework.
Draw the line between metric depth (real units) and metrological depth (traceable, uncertainty-quantified measurements). Map the International Vocabulary of Metrology (VIM) onto depth estimation: measurand, measurement uncertainty, traceability, systematic error, random error.
Define six evaluation properties drawn from VDI/VDE 2634 and ISO 5725, adapted for monocular depth: probing error, length measurement error, planarity deviation, range-dependent bias, uncertainty calibration, and reproducibility. Document that four of six have never been systematically evaluated for MDE models.
Specify a capture protocol using commercially available calibration artifacts: gauge blocks, ball-bar standards, and planar reference surfaces. Complement with synthetic evaluation using rendered scenes with exact ground truth. Evaluate models as "measurement instruments," not "predictors."
Evaluate all major models: Depth Anything V2, Depth Anything 3, Metric3D v2, UniDepth V2, MoGe/MoGe-2, VGGT, Marigold v1.1, Prompt DA, DepthPro, DAR (2B). Report all six metrological properties plus standard accuracy metrics for cross-reference. Analyze how paradigm, backbone scale, and training data correlate with metrological performance.
Test the hypothesis that metrological properties do not track neatly with prediction accuracy: a model with lower RMSE may have worse uncertainty calibration or higher systematic bias at specific ranges. Discuss implications for downstream deployment in infrastructure inspection, autonomous systems, and digital twins.
Release the metrological evaluation toolkit, calibration protocol specification, and evaluation results as a public resource. Establish the framework that Paper II and the broader community will build on.
Post to arXiv before or concurrent with conference submission. The benchmark needs to be citable for Paper II, and early visibility allows the community to engage with the framework. Preprint establishes priority on the metric/metrological distinction for MDE.
Working title
Given the metrological evaluation framework from Paper I, this paper proposes an MDE architecture designed from the ground up to produce calibrated, metrologically traceable depth estimates. The model predicts depth with a per-pixel uncertainty estimate that is calibrated against physical ground truth, and it exposes the systematic error structure of its predictions. The contribution is not "better depth estimation" — it is that the uncertainty estimates are calibrated, the bias structure is characterized and correctable, and the full output constitutes a measurement result in the metrological sense.
Motivate the need for metrological depth in infrastructure inspection, autonomous systems, construction surveying, and digital twins. Reference the Paper I framework as the evaluation standard.
Survey MDE architectures (DA2/DA3, Marigold, VGGT, Metric3D), uncertainty quantification in depth estimation (heteroscedastic models, MC Dropout, ensembles), and calibration methods (temperature scaling, Platt scaling, conformal prediction).
DINOv2 encoder (ViT-L or ViT-G) following DA2/DA3 precedent. Decoder produces two outputs: depth map and heteroscedastic aleatoric uncertainty map (log-variance parameterization). Key addition: a calibration head — lightweight MLP mapping raw uncertainty to calibrated prediction intervals via temperature scaling, trained post-hoc on a held-out calibration set with physical ground truth.
Teacher–student distillation following DA2/DA3 paradigm. NLL loss under Laplacian/Gaussian observation model for joint depth+uncertainty training. Scale-invariant gradient loss for edges. Normal consistency loss for geometric coherence. Critical: calibration dataset independent of training data to avoid circular traceability.
Two halves. (A) Standard accuracy evaluation on KITTI, NYU, ETH3D to demonstrate competitive depth quality (within the pack, not necessarily SOTA). (B) Paper I benchmark evaluation: calibration curves, range-dependent bias curves, probing error, length measurement error on physical artifacts, and reproducibility under varied conditions.
Ablate the calibration head, NLL loss formulation (Laplacian vs. Gaussian), training data composition (synthetic vs. real vs. mixed calibration sets), and backbone scale. Demonstrate that each component contributes to metrological quality independent of prediction accuracy.
Address LiDAR conditioning exclusion (deliberate: keep RGB-only contribution clean). Discuss generalization of calibration across domains. Outline future work: multi-modal sensor conditioning, temporal coherence, infrastructure-specific priors.
Post to arXiv 3–6 months after Paper I, timed so that reviewers of the conference submission can reference the benchmark preprint. The method paper establishes that the metrological framework from Paper I is not merely diagnostic but architecturally actionable.
The two-paper foundation enables a research program that extends into the model attributes described in the model architecture strategy. Each future direction builds on the metrological framework (Paper I) and calibrated architecture (Paper II).
Extend Paper II with optional LiDAR/sparse depth conditioning (a la Prompt Depth Anything). Measure how additional geometric input tightens metrological bounds while maintaining calibration. Deliberately excluded from Paper II to keep the RGB-only contribution clean.
One model, variable input. Accept whatever sensor streams are available, determine achievable precision, produce output at the appropriate fidelity tier. Requires paired multi-modal training corpus.
Metrologically calibrated differencing from multi-temporal captures. "This beam has deflected 12mm ± 3mm since last inspection." Requires repeat visits in the calibration corpus.
Soft, uncertainty-aware priors for infrastructure classes (cylinders, I-beams, lattice structures). Deviation from prior widens uncertainty and flags anomalies. Damage detection as a natural output of reconstruction.
© 2026 Wesley Ladd. All rights reserved.
Last updated: 3/23/2026