Monocular Monocular depth estimation: predicting per-pixel distance (or disparity) from a single RGB image—ill-posed without priors or scale cues. has had an extraordinary two years. Depth Anything V2: large-scale discriminative monocular depth (2024), built on a strong data engine and ViT backbones—SOTA-class accuracy on standard benchmarks. and Depth Anything 3: combines monocular and multi-view cues with a DINOv2-style backbone and a depth-ray formulation. scaled the Data engine: the full pipeline that curates, filters, and scales training pairs (web video, multi-dataset mixing, pseudo-labeling). Often as important as architecture for benchmark RMSE. and unified monocular and Multi-view geometry: reasoning from multiple images with known or unknown poses—triangulation, epipolar constraints, and bundle adjustment. Monocular depth can be unified with these cues in models like DA3 or VGGT.. VGGT (2025): unified model predicting cameras, depth, point maps, and 3D tracks in one forward pass—CVPR 2025 Best Paper. won Best Paper at CVPR: IEEE/CVF Conference on Computer Vision and Pattern Recognition—major venue for vision and depth benchmarks. 2025. Marigold: diffusion-based (Stable Diffusion prior) monocular depth—often very sharp boundaries in zero-shot use., MoGe / MoGe-2: metric geometry from single images, pushing robust metric predictions., Metric3D: line of work predicting metric depth and intrinsics-related scale from a single image; “metric” refers to real units, not full metrological certification., UniDepth v2: updated UniDepth with improved metric depth and uncertainty-related modeling in some variants., and Prompt Depth Anything: optional LiDAR/sparse depth conditioning to refine predictions—strong when extra geometry is available. pushed Metric depth (in MDE): each pixel’s depth is expressed in real SI units (typically meters) with a fixed global scale—what you need for physical clearance, volume, or fusion with surveyed coordinates. Contrast with affine-invariant / relative maps, where only ordering and shape up to scale+shift are identified until you bolt on a separate scale solver. accuracy to the point where you can almost reconstruct a room from your phone.
Almost. Because none of these models produce measurements.
They produce predictions with impressive Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range. scores on KITTI Vision Benchmark Suite: outdoor driving scenes around Karlsruhe, with sparse “ground truth” from a Velodyne LiDAR projected into the camera. The Eigen et al. crop is the usual subset for monocular depth. Not a controlled metrology dataset—scale and sparsity reflect the sensor and projection pipeline. and NYU Depth Dataset v2: ~1.4k RGB-D pairs of indoor scenes captured with a Microsoft Kinect. Widely used for indoor MDE; depth is structured-light based, with known limitations at edges, shiny surfaces, and range., but nothing about the uncertainty of any individual pixel, Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. bias as a function of range, Reproducibility: closeness of agreement between results on the same measurand under changed conditions (operator, setup, time, lighting, or stochastic inference). Metrology cares about bounded variation; benchmarks often average it away. under varying conditions, or the Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties. of the reported scale to any physical reference standard. (The word "uncertainty" itself is dangerously overloaded; the field uses it to mean at least four different things, only one of which has the statistical semantics required for a measurement.)
The distinction matters, and it has a name.
Metric vs. Metrological
When the MDE literature says "Metric depth (in MDE): each pixel’s depth is expressed in real SI units (typically meters) with a fixed global scale—what you need for physical clearance, volume, or fusion with surveyed coordinates. Contrast with affine-invariant / relative maps, where only ordering and shape up to scale+shift are identified until you bolt on a separate scale solver.," it means depth in real-world units (meters), as opposed to Affine-invariant depth: the map is determined only up to an unknown positive scale α and offset β (in practice often applied in disparity or inverse-depth space: z ↦ α·z + β). Parallel surfaces and many discontinuities are preserved; absolute meters are not. Training on mixed datasets often uses this because different sources disagree on global scale. Metric pipelines add a scale head, focal-length cues, or post-alignment to recover meters. or Relative depth: depth or inverse-depth that encodes “farther vs. nearer” and local geometry without committing to a global meter scale—typical raw output before metric calibration or when only ordinal loss is used.. Metric3D: line of work predicting metric depth and intrinsics-related scale from a single image; “metric” refers to real units, not full metrological certification. has "metric" in the name. All it means is "not Affine-invariant depth: the map is determined only up to an unknown positive scale α and offset β (in practice often applied in disparity or inverse-depth space: z ↦ α·z + β). Parallel surfaces and many discontinuities are preserved; absolute meters are not. Training on mixed datasets often uses this because different sources disagree on global scale. Metric pipelines add a scale head, focal-length cues, or post-alignment to recover meters.."
In measurement science, "In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.”" means something far more demanding: a measurement result that is Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties. to a reference standard, with a stated Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated., under documented conditions, with characterized Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. and Random error: unpredictable variation across repeated measurements—noise that averages down with many samples but does not cancel in a single frame. components, such that the result is reproducible by an independent observer following the same protocol.
These are not the same thing. They are not close to the same thing. A model that predicts depth in meters with 5% Relative error: error normalized by magnitude of the quantity (e.g. |ŷ−y|/|y|). A small percentage in meters can still violate tolerances if uncertainty is unknown. is producing Metric depth (in MDE): each pixel’s depth is expressed in real SI units (typically meters) with a fixed global scale—what you need for physical clearance, volume, or fusion with surveyed coordinates. Contrast with affine-invariant / relative maps, where only ordering and shape up to scale+shift are identified until you bolt on a separate scale solver.. A model that can tell you the depth is 3.42m ± 0.08m (k=2 expanded uncertainty: roughly a 95% coverage interval if errors are well-modeled as Gaussian—common reporting convention in metrology alongside the stated coverage probability., 95% confidence) with a systematic underestimate of 1.2% at ranges beyond 4m, characterized against a NIST-traceable: calibrated against artifacts whose chain leads to NIST (or another NMI)—the gold standard for claiming physical traceability in the US context. calibration artifact under documented illumination and viewpoint conditions, is producing metrological depth.
No model in the current literature does the second thing. No benchmark evaluates for it. This is the opening.
A necessary caveat: classical metrology was built for instruments with understood measurement chains, where you can analytically propagate uncertainty through each component. Learned depth models are not that. Their internal representations are opaque, and componentwise traceability in the traditional sense may not be achievable. That does not excuse the total absence of measurement-side evaluation. It means the field needs an adapted validation regime, one that takes the normative target from measurement science (outputs usable as measurement results under documented conditions with validated uncertainty behavior) without pretending that every classical decomposition transplants intact to opaque learned systems. The right framing is metrological aspiration with empirically validated calibration, not a naive demand that neural networks behave like coordinate measuring machines.
What follows is the outline of two papers. The first defines what metrological evaluation means for monocular depth and benchmarks every major model against it. The second proposes an architecture that optimizes for metrological properties directly. The first paper creates the evaluation framework. The second fills it.
Paper 1: The Benchmark
Paper I — Beyond Metric: A Metrological Evaluation Framework for Monocular Depth Estimation
Thesis
Existing MDE benchmarks (KITTI Vision Benchmark Suite: outdoor driving scenes around Karlsruhe, with sparse “ground truth” from a Velodyne LiDAR projected into the camera. The Eigen et al. crop is the usual subset for monocular depth. Not a controlled metrology dataset—scale and sparsity reflect the sensor and projection pipeline., NYU Depth Dataset v2: ~1.4k RGB-D pairs of indoor scenes captured with a Microsoft Kinect. Widely used for indoor MDE; depth is structured-light based, with known limitations at edges, shiny surfaces, and range., ETH3D: high-resolution multi-view indoor/outdoor dataset with laser-scanned depth used as reference. Often used for multi-view and some depth tasks; still an academic benchmark, not a calibration artifact., SYNS-Patches: synthetic patches dataset for depth/normal estimation with rendered ground truth—useful for controlled statistics but not physical traceability.) evaluate prediction accuracy: how close is the estimated depth to Ground truth: reference measurements from calibrated instruments (total stations, survey-grade LiDAR) used to train and validate the model. Unlike benchmark "ground truth" (often noisy LiDAR projections), metrological ground truth has its own stated uncertainty and traceability chain., aggregated over a dataset? They do not evaluate measurement quality: can this estimate be used as a measurement in an engineering, inspection, or regulatory context? The distinction matters because downstream applications in infrastructure inspection, autonomous systems, construction surveying, and Digital twin: a continuously or periodically updated virtual replica of a physical asset. When the twin carries dimensional tolerances (e.g., beam deflection limits), the geometric data feeding it must have stated uncertainty and traceability — not just low average error. with Dimensional tolerance: allowable deviation from nominal size in design or manufacturing. Metrological depth matters when decisions depend on staying inside those bounds, not on leaderboard RMSE alone. need not just accurate depth, but depth with quantified Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated., characterized error structure, and Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties. calibration.
This paper (a) formalizes the metric/metrological distinction for the MDE community, (b) defines a metrological evaluation framework specifying the properties a depth estimate must satisfy to qualify as a measurement, and (c) evaluates every major MDE model against this framework using both synthetic and physical calibration protocols.
Conceptual Framework (Sections 1–2)
The introduction draws the line between Metric depth (in MDE): each pixel’s depth is expressed in real SI units (typically meters) with a fixed global scale—what you need for physical clearance, volume, or fusion with surveyed coordinates. Contrast with affine-invariant / relative maps, where only ordering and shape up to scale+shift are identified until you bolt on a separate scale solver. and In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” depth with concrete examples. A bridge inspector using monocular depth to estimate clearance needs to know the Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated. bound, not just the expected error. A deformation monitoring system tracking millimeter-scale displacement over time needs Reproducibility: closeness of agreement between results on the same measurand under changed conditions (operator, setup, time, lighting, or stochastic inference). Metrology cares about bounded variation; benchmarks often average it away. guarantees, not dataset-aggregate Absolute relative error: mean of |d_pred − d_gt| / d_gt over valid pixels. Scale-invariant in the sense that it normalizes by true depth; still a dataset-level aggregate, not a per-pixel measurement statement.. The existing evaluation paradigm cannot answer these questions because it was never designed to.
The conceptual framework section maps the vocabulary of measurement science (VIM (International Vocabulary of Metrology): the ISO/IEC guide defining terms like measurand, uncertainty, and traceability—the vocabulary the post maps onto depth estimation.: International Vocabulary of Metrology) onto depth estimation. Key definitions:
Key definitions
Measurand: the quantity being measured—here, true depth at a specified pixel and imaging condition.: The true depth at a specific pixel location under specified conditions.
Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated.: A non-negative parameter characterizing the dispersion of values attributed to the depth measurand, based on information used.
Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties.: The property of a measurement result whereby it can be related to a reference through a documented unbroken chain of calibrations.
Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range.: The component of measurement error that in replicate measurements remains constant or varies predictably.
Random error: unpredictable variation across repeated measurements—noise that averages down with many samples but does not cancel in a single frame.: The component of measurement error that varies unpredictably in replicate measurements.
This is not novel measurement science. It is established metrological vocabulary applied to a field that has never used it. The contribution is the mapping, not the metrology.
Formal Metrological Properties (Section 3)
The paper defines six metrological properties that a depth estimation system can be evaluated against. These are drawn from VDI/VDE 2634: German guideline for optical 3D measuring systems; defines probing error, sphere/plane artifacts, and length measurement tests familiar in industrial metrology. (optical 3D measuring systems) and ISO 5725: international standard on accuracy (trueness and precision) of measurement methods and results—vocabulary for repeatability, reproducibility, and bias. (accuracy of measurement methods), adapted for the monocular depth setting:
| Property | Definition | Current Status |
|---|---|---|
| Probing Error | Range of depth values obtained when measuring a single point on a calibration sphere, characterizing local noise | Evaluated in one UniDepth: family of monocular metric depth models; one paper reports metrology-flavored diagnostics, but they are not yet a community-wide benchmark. study; not standardized |
| Length Measurement Error | Deviation between measured and calibrated distance between two known points on a Ball-bar standard: a calibrated artifact with two spheres at a known center-to-center distance; used to test length measurement error of optical 3D systems. standard | Same single study; not adopted by the community |
| Planarity Deviation | Root mean square: √(mean of squares). Used here for deviation from a plane; related spirit to RMSE but on point-to-plane residuals. deviation of measured points from best-fit plane on a known planar surface | Reported as a metric in MDEC: CVPR workshop on monocular depth estimation challenges—community venue where some geometric metrics (e.g. planarity) appear outside a full metrological frame. but not in a metrological framework |
| Range-Dependent Bias | Systematic error characterized as a function of ground-truth depth (bias curve) | Not part of standard MDE benchmark practice |
| Uncertainty Calibration | Agreement between predicted Confidence / prediction interval: a range intended to contain the true value with a stated probability (e.g., 95%). Well-calibrated intervals match empirical coverage; uncalibrated "confidence scores" from neural networks often do not — a model claiming 95% may actually cover only 70% of true values. and observed error distributions (Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. curve) | Not systematically evaluated; most models produce no uncertainty estimate |
| Reproducibility | Variation in results when the same scene is measured under changed conditions (lighting, viewpoint, Stochastic inference: randomness in the forward pass (dropout sampling, diffusion noise, test-time augmentation). Reproducibility metrics ask how much outputs vary under those sources of variation.) | Not systematically evaluated across current benchmark suites |
These six properties are not arbitrary. They jointly cover local repeatability, geometric fidelity over known spans and surfaces, systematic structure of error, validity of uncertainty claims, and stability under operational variation. The column "Current Status" is the point: four of six have never been systematically evaluated for MDE models. This is the benchmark gap.
Physical Calibration Protocol (Section 4)
The paper specifies a capture protocol using commercially available calibration artifacts: Gauge blocks: precision-machined length standards (Jo blocks); used as physical references for distance and scale verification. (length standards), Ball-bar standard: a calibrated artifact with two spheres at a known center-to-center distance; used to test length measurement error of optical 3D systems. standards (per VDI/VDE 2634: German guideline for optical 3D measuring systems; defines probing error, sphere/plane artifacts, and length measurement tests familiar in industrial metrology.), and planar reference surfaces. These are imaged under controlled conditions (fixed illumination, known Camera intrinsics: focal length, principal point, skew, and distortion coefficients that map pixels to 3D rays. They fix the geometry linking depth along a ray to metric 3D points and are required to project LiDAR “truth” into the image.) and in-the-wild conditions (varying illumination, handheld capture) at multiple ranges.
The physical protocol is complemented by a Synthetic training data: renders from CAD or game engines with exact depth buffers—unlimited variety and perfect pixel alignment, but domain gap to real sensors and materials. evaluation using rendered scenes with exact Ground truth: reference measurements from calibrated instruments (total stations, survey-grade LiDAR) used to train and validate the model. Unlike benchmark "ground truth" (often noisy LiDAR projections), metrological ground truth has its own stated uncertainty and traceability chain., enabling characterization of error structure at ranges and geometries that are impractical to calibrate physically.
The key design decision: the benchmark evaluates models as "measurement instruments," not as "predictors." The evaluation asks: if I treat this model's output as a measurement, what are the metrological properties of that measurement? This framing is the contribution.
Evaluation Campaign (Section 5)
Every major model gets evaluated. No exceptions. The minimum set:
| Model | Paradigm |
|---|---|
| Depth Anything V2: large-scale discriminative monocular depth (2024), built on a strong data engine and ViT backbones—SOTA-class accuracy on standard benchmarks. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| Depth Anything 3: combines monocular and multi-view cues with a DINOv2-style backbone and a depth-ray formulation. | Unified model (e.g. VGGT, DA3-style): one backbone jointly predicts several geometric quantities—depth, poses, tracks, or rays—so multi-view and monocular cues share representation. |
| Metric3D: line of work predicting metric depth and intrinsics-related scale from a single image; “metric” refers to real units, not full metrological certification. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| UniDepth v2: updated UniDepth with improved metric depth and uncertainty-related modeling in some variants. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| MoGe / MoGe-2: metric geometry from single images, pushing robust metric predictions. / MoGe-2 | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| VGGT (2025): unified model predicting cameras, depth, point maps, and 3D tracks in one forward pass—CVPR 2025 Best Paper. | Unified model (e.g. VGGT, DA3-style): one backbone jointly predicts several geometric quantities—depth, poses, tracks, or rays—so multi-view and monocular cues share representation. |
| Marigold: diffusion-based (Stable Diffusion prior) monocular depth—often very sharp boundaries in zero-shot use. v1.1 | Generative depth (e.g. diffusion): depth is sampled from a conditional generative model; sharp details can come from image priors, but calibration and per-pixel uncertainty require explicit design. |
| Prompt Depth Anything: optional LiDAR/sparse depth conditioning to refine predictions—strong when extra geometry is available. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| Depth Pro: monocular metric depth model targeting sharp boundaries and competitive benchmark numbers. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| DAR (2B): large autoregressive depth model—different scaling paradigm from feed-forward ViT decoders. | Autoregressive model: generates depth (or tokens representing geometry) sequentially, conditioning each step on previous outputs—higher latency than single forward pass, different inductive bias than CNN/ViT decoders. |
For each model, the paper reports all six metrological properties plus the standard accuracy metrics for cross-reference with existing literature. The analysis section characterizes how paradigm (Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. vs. Generative depth (e.g. diffusion): depth is sampled from a conditional generative model; sharp details can come from image priors, but calibration and per-pixel uncertainty require explicit design. vs. Unified model (e.g. VGGT, DA3-style): one backbone jointly predicts several geometric quantities—depth, poses, tracks, or rays—so multi-view and monocular cues share representation.), backbone (ViT (Vision Transformer): transformer architecture applied to image patches, used as backbone encoder in many modern depth and reconstruction models. DINOv2-based ViTs are the dominant backbone for current state-of-the-art depth estimation. scale), and training data (Synthetic training data: renders from CAD or game engines with exact depth buffers—unlimited variety and perfect pixel alignment, but domain gap to real sensors and materials. vs. Real-world training data: captures from actual cameras paired with sensor or reconstructed depth. Closer to deployment statistics but noisier labels and mixed provenance. vs. mixed) correlate with metrological performance. The hypothesis is that metrological properties do not track neatly with prediction accuracy: a model with lower Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range. may have worse Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. or higher Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. bias at specific ranges.
Paper 2: The Method
Paper II — Metrological Monocular Depth Estimation with Calibrated Uncertainty Quantification Depends on Paper I establishing the evaluation framework.
Thesis
Given the metrological evaluation framework from Paper I, this paper proposes an MDE architecture designed from the ground up to produce calibrated, metrologically Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties. depth estimates. The model does not merely predict depth with a confidence score. It emits a measurement contract: a per-pixel depth estimate, a calibrated Confidence / prediction interval: a range intended to contain the true value with a stated probability (e.g., 95%). Well-calibrated intervals match empirical coverage; uncalibrated "confidence scores" from neural networks often do not — a model claiming 95% may actually cover only 70% of true values., a characterized Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. correction term, and validity metadata indicating the constraint regime (geometric, learned prior, or extrapolated) under which the estimate was produced.
The contribution is not "better depth estimation." If the model matches DA3 on Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range., that is sufficient. The contribution is that the output constitutes a measurement result in the In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” sense: uncertainty estimates are calibrated (predicted Confidence / prediction interval: a range intended to contain the true value with a stated probability (e.g., 95%). Well-calibrated intervals match empirical coverage; uncalibrated "confidence scores" from neural networks often do not — a model claiming 95% may actually cover only 70% of true values. match observed error distributions), bias structure is characterized and correctable, and the consumer can distinguish geometrically grounded estimates from learned-prior gap fills. The model behaves like an instrument, not a score-producing black box.
Architecture (Section 3)
The backbone is a DINOv2 (Oquab et al., 2023): self-supervised Vision Transformer encoder from Meta, trained on 142M curated images. Dominant backbone for monocular depth models (Depth Anything V2/V3, UniDepth) due to strong spatial features without task-specific supervision. Encoder: backbone (here often DINOv2 ViT) that maps the image to a spatial feature map or tokens consumed by the decoder. (ViT (Vision Transformer): transformer architecture applied to image patches, used as backbone encoder in many modern depth and reconstruction models. DINOv2-based ViTs are the dominant backbone for current state-of-the-art depth estimation. or ViT (Vision Transformer): transformer architecture applied to image patches, used as backbone encoder in many modern depth and reconstruction models. DINOv2-based ViTs are the dominant backbone for current state-of-the-art depth estimation.), following the DA2/DA3 precedent. The Decoder: network head that upsamples or refines features into dense per-pixel outputs (depth, variance) from a backbone’s representation. produces four outputs: a depth map, a Heteroscedastic uncertainty: noise magnitude that varies by pixel or region (e.g. textureless sky vs. rich edges)—modeled with per-pixel log-variance in many UQ setups. Aleatoric uncertainty: irreducible ambiguity in the observation (blur, sensor noise)—as opposed to epistemic (model ignorance) uncertainty. uncertainty map (Log-variance parameterization: the network predicts log σ² (or similar) so variance stays positive and gradients are stable when training heteroscedastic (per-pixel) noise models. parameterization), a predicted bias correction term, and a per-pixel constraint-source classification (geometric, learned prior, or extrapolated). Together these constitute the measurement contract. The loss function combines:
- Negative log-likelihood: loss that treats depth plus a predicted noise scale as a probabilistic model—trains mean and uncertainty jointly when the likelihood is well-specified. loss under a Laplacian vs Gaussian likelihood: assume heavy-tailed (Laplace) or Gaussian noise on depth errors when building the NLL; choice affects sensitivity to outliers and the shape of calibrated intervals., which jointly trains the depth prediction and the uncertainty estimate. This is the mechanism that makes the uncertainty Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. rather than decorative.
- Scale-invariant gradient loss: compares depth gradients in a way that is insensitive to global scale—helps sharp edges without fixing metric scale alone. loss for edge quality, following Marigold's lesson that boundary sharpness matters for geometric accuracy downstream.
- Normal consistency loss: matches predicted surface normals to supervisory normals—DA3 and others use it to improve geometric coherence. loss following DA3's finding that Surface normal: unit vector perpendicular to the tangent plane of a surface at a point. Supervising or matching normals encourages consistent local 3D shape, not just scalar depth. supervision improves geometric quality of the teacher model.
The key architectural addition is a Calibration head: a small network or scalar mapping (e.g. temperature scaling) that adjusts raw uncertainty outputs so predicted intervals match observed error rates on a held-out set.: a lightweight MLP (multi-layer perceptron): fully connected network; here a small head on top of decoder features to map raw logits or variance to calibrated scales or intervals. that takes the raw uncertainty output and maps it to calibrated Confidence / prediction interval: a range intended to contain the true value with a stated probability (e.g., 95%). Well-calibrated intervals match empirical coverage; uncalibrated "confidence scores" from neural networks often do not — a model claiming 95% may actually cover only 70% of true values. via Temperature scaling: divide logits (or here, uncertainty-related logits) by a learned scalar so calibrated probabilities/intervals match empirical frequencies—Platt-like post-hoc calibration., trained on a Held-out set: data reserved from main training for validation, calibration, or testing—metrologically important so uncertainty and bias claims are not circularly fit to the training distribution. calibration set with physical Ground truth: reference measurements from calibrated instruments (total stations, survey-grade LiDAR) used to train and validate the model. Unlike benchmark "ground truth" (often noisy LiDAR projections), metrological ground truth has its own stated uncertainty and traceability chain. (not the same data used for depth training). This is analogous to Platt scaling: classic post-hoc calibration for classifier scores via a sigmoid; analogous ideas apply to scaling regression uncertainty. for classification confidence, applied to Regression uncertainty: probabilistic or interval predictions for continuous depth rather than class probabilities—calibration means predicted spreads match empirical error, often checked via coverage or NLL..
Design decision
LiDAR (Light Detection and Ranging): active sensor emitting laser pulses and measuring return time to produce 3D point clouds. Survey-grade terrestrial LiDAR (e.g., Leica RTC360) achieves sub-millimeter precision at close range; airborne and mobile LiDAR trades precision for coverage. conditioning (a la Prompt Depth Anything: optional LiDAR/sparse depth conditioning to refine predictions—strong when extra geometry is available.) is a natural extension but is deliberately excluded from the first version of this paper to keep the contribution clean: the model must produce metrologically characterized output from RGB alone. The LiDAR-conditioned variant is a follow-up experiment or a third paper, not a Confound: a factor mixed with the variable of interest (e.g. LiDAR conditioning carrying scale information) that muddies what the experiment attributes to RGB-only metrological performance. in the method contribution.
Training Regime (Section 4)
Teacher–student distillation: a large or strong model produces targets for a smaller or student network—common in DA2/DA3-style pipelines. distillation following the DA2/DA3 paradigm. The teacher is trained on large-scale labeled data (Synthetic training data: renders from CAD or game engines with exact depth buffers—unlimited variety and perfect pixel alignment, but domain gap to real sensors and materials. + Real-world training data: captures from actual cameras paired with sensor or reconstructed depth. Closer to deployment statistics but noisier labels and mixed provenance.). The student distills from the teacher on unlabeled data with Pseudo-labels: targets generated by a teacher model or heuristic on unlabeled data—used to scale training; quality and bias propagate to the student.. The uncertainty Calibration head: a small network or scalar mapping (e.g. temperature scaling) that adjusts raw uncertainty outputs so predicted intervals match observed error rates on a held-out set. is trained Post-hoc: fitted after the main training stage (e.g. temperature scaling on a held-out set) so calibration is assessed on data not used to learn the primary depth weights. on a separate calibration dataset containing physical Ground truth: reference measurements from calibrated instruments (total stations, survey-grade LiDAR) used to train and validate the model. Unlike benchmark "ground truth" (often noisy LiDAR projections), metrological ground truth has its own stated uncertainty and traceability chain. from calibration artifacts and high-quality LiDAR (Light Detection and Ranging): active sensor emitting laser pulses and measuring return time to produce 3D point clouds. Survey-grade terrestrial LiDAR (e.g., Leica RTC360) achieves sub-millimeter precision at close range; airborne and mobile LiDAR trades precision for coverage. scans.
Critical detail: the calibration dataset must be independent of the training data. This is a metrological requirement, not a convenience. A measurement instrument calibrated against the same data used to build it has Circular traceability: using the same data both to train the model and to “calibrate” or certify it, so uncertainty claims are not independent of the fitting set—invalid in metrology.. The paper must demonstrate that the calibration holds on data the model has never seen, captured with equipment the model was not trained on.
Experiments (Section 5)
The experimental section has two halves. The first half is the standard accuracy evaluation on existing benchmarks (KITTI Vision Benchmark Suite: outdoor driving scenes around Karlsruhe, with sparse “ground truth” from a Velodyne LiDAR projected into the camera. The Eigen et al. crop is the usual subset for monocular depth. Not a controlled metrology dataset—scale and sparsity reflect the sensor and projection pipeline., NYU Depth Dataset v2: ~1.4k RGB-D pairs of indoor scenes captured with a Microsoft Kinect. Widely used for indoor MDE; depth is structured-light based, with known limitations at edges, shiny surfaces, and range., ETH3D: high-resolution multi-view indoor/outdoor dataset with laser-scanned depth used as reference. Often used for multi-view and some depth tasks; still an academic benchmark, not a calibration artifact.) to demonstrate competitive depth quality. The model does not need to set new SOTA: state of the art—best reported numbers on a benchmark at a given time; orthogonal to whether the model’s outputs qualify as measurements with uncertainty.. It needs to be within the pack.
The second half evaluates against the Paper I benchmark. This is where the contribution lives. The key results:
- Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. curves (predicted confidence vs. observed Coverage (calibration): fraction of ground-truth values that fall inside the model’s predicted intervals. A calibrated UQ model’s coverage should match the nominal level (e.g. 95% intervals cover ~95% of pixels where defined).) showing that the model's uncertainty estimates are well-calibrated where existing models either produce no uncertainty or produce uncalibrated confidence.
- Range-dependent bias curves showing characterized Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range., with demonstration that the bias can be corrected using the model's own output (a practical feature no existing model offers).
- Probing error: spread of repeated measurements at nominally the same surface point—characterizes local noise and repeatability of the “instrument” (here, the model plus pipeline). and length measurement error on physical calibration artifacts, compared to all Paper I baselines.
- Reproducibility under varied conditions, demonstrating that the model's stochastic variation (if any) is bounded and characterized.
Dependency Graph
The two papers are designed to be self-reinforcing but independently publishable. Here is the dependency structure:
Paper I: Benchmark
├── Conceptual framework (metric vs. metrological)
├── Six formal metrological properties
├── Physical calibration protocol
├── Evaluation of 10+ existing models
└── Public benchmark release
│ establishes evaluation framework for
▼
Paper II: Method
├── DINOv2 + measurement contract decoder
│ (depth, calibrated interval, bias term, constraint source)
├── NLL loss for calibrated uncertainty
├── Post-hoc calibration head (Platt-style)
├── Standard benchmarks (competitive, not SOTA)
└── Paper I benchmark (this is where you win)
│ enables
▼
[Future] LiDAR-conditioned variant
[Future] Video temporal consistency
[Future] Multi-view metrological fusion
What This Isn't
This is not a plan to beat DA3 on Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range.. I will not beat a ByteDance: industrial lab behind Depth Anything 3 and large-scale training—resource advantage on data and compute vs. typical academic teams. team with a ViT (Vision Transformer): transformer architecture applied to image patches, used as backbone encoder in many modern depth and reconstruction models. DINOv2-based ViTs are the dominant backbone for current state-of-the-art depth estimation. trained on their internal data engine. That is not the game.
But it is worth asking why Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range. is the game at all. MDE currently ranks predictors as if the relevant consumer were a leaderboard, not an engineer with liability. RMSE is a dataset-aggregate statistic. It collapses all spatial, range-dependent, and conditional variation into a single number. Two models with identical RMSE can have wildly different Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. curves, Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated. profiles, and failure modes. One may be mediocre everywhere. The other may be excellent at close range and catastrophically wrong at distance. The single number cannot distinguish them. The metric was designed for ranking research papers, not for answering "can I use this measurement?"
This is a plan to redefine what "good" means for depth estimation in contexts where the output is treated as a In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” rather than a visual prior. The benchmark paper says: you have been evaluating predictions when you should have been evaluating measurements. The method paper says: here is what it looks like when you optimize for measurement quality.
The bet is bigger than applied relevance. The claim is that models optimized for In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” properties will produce better Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range. scores as a side effect. Current depth models optimize for pixel-wise prediction accuracy against ground-truth maps — a loss function that treats every pixel independently, ignores Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. structure, and discards information about how confidently wrong the model is. This is a poor facsimile of what the actual objective should be. A Negative log-likelihood: loss that treats depth plus a predicted noise scale as a probabilistic model—trains mean and uncertainty jointly when the likelihood is well-specified. loss that forces the model to jointly predict depth and calibrated uncertainty penalizes confident errors more heavily than uncertain ones, which reshapes the loss landscape in ways that L1 loss (mean absolute error): average of |predicted − true| over pixels. Simple, robust to outliers relative to L2, but treats every pixel equally regardless of model confidence and provides no mechanism for uncertainty estimation. and Scale-invariant logarithmic loss (Eigen et al. 2014): operates on log-depth differences so the penalty is proportional to relative rather than absolute error. Standard in MDE training but still a point-estimate loss—no uncertainty signal, no bias characterization. losses cannot. A model that characterizes its own bias curve can self-correct. A model that knows where it is uncertain allocates capacity toward the regions that matter. The metrological framing is not a tradeoff against prediction accuracy — it is a strictly better optimization target that subsumes it.
I am calling this shot explicitly so it can be explicitly falsified. The claim is testable and the conditions for falsification are concrete:
The claim is false if: a model trained with Negative log-likelihood: loss that treats depth plus a predicted noise scale as a probabilistic model—trains mean and uncertainty jointly when the likelihood is well-specified. loss (jointly predicting depth and Heteroscedastic uncertainty: noise magnitude that varies by pixel or region (e.g. textureless sky vs. rich edges)—modeled with per-pixel log-variance in many UQ setups. uncertainty) produces significantly worse Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range. than the same architecture trained with L1 loss (mean absolute error): average of |predicted − true| over pixels. Simple, robust to outliers relative to L2, but treats every pixel equally regardless of model confidence and provides no mechanism for uncertainty estimation. or Scale-invariant logarithmic loss (Eigen et al. 2014): operates on log-depth differences so the penalty is proportional to relative rather than absolute error. Standard in MDE training but still a point-estimate loss—no uncertainty signal, no bias characterization. loss on the same data, at the same compute budget, across standard benchmarks. "Significantly" means outside the noise floor of training variance — not 0.2% on KITTI Vision Benchmark Suite: outdoor driving scenes around Karlsruhe, with sparse “ground truth” from a Velodyne LiDAR projected into the camera. The Eigen et al. crop is the usual subset for monocular depth. Not a controlled metrology dataset—scale and sparsity reflect the sensor and projection pipeline., but a consistent degradation across datasets and scales that cannot be closed by Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. using the model's own uncertainty output.
The claim survives if: metrological optimization produces RMSE within the competitive envelope of point-estimate losses and the Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. output (using the model's characterized error structure) matches or exceeds the uncorrected point-estimate baseline. The uncertainty and calibration are then genuinely free — no accuracy was traded to get them.
The strong form of the claim is confirmed if: metrological optimization produces lower RMSE than point-estimate losses at matched architecture and data, because the Heteroscedastic uncertainty: noise magnitude that varies by pixel or region (e.g. textureless sky vs. rich edges)—modeled with per-pixel log-variance in many UQ setups. weighting and bias self-correction mechanisms provide a better inductive bias than treating all pixels and all errors as equal.
Paper II is designed to run exactly this experiment. The ablation is straightforward: same backbone, same data, same compute, different loss. If the metrological loss wins on RMSE while also producing calibrated uncertainty, that is a result that reshapes how the field thinks about its optimization targets. If it ties on RMSE while adding calibration, that is still a strong contribution but a weaker thesis. If it loses on RMSE, the claim is wrong and I will say so.
As monocular depth leaves the leaderboard and enters contexts with tolerances, liability, and traceability requirements, the distinction between metric and metrological depth will stop sounding philosophical and start sounding overdue. The first people to formalize it get to define the terms.