The computer vision community has spent the last two years making extraordinary progress on monocular Monocular depth estimation: predicting per-pixel distance (or disparity) from a single RGB image—ill-posed without priors or scale cues.. Depth Anything V2: large-scale discriminative monocular depth (2024), built on a strong data engine and ViT backbones—SOTA-class accuracy on standard benchmarks. scaled the Data engine: the full pipeline that curates, filters, and scales training pairs (web video, multi-dataset mixing, pseudo-labeling). Often as important as architecture for benchmark RMSE.. VGGT (2025): unified model predicting cameras, depth, point maps, and 3D tracks in one forward pass—CVPR 2025 Best Paper. won Best Paper at CVPR: IEEE/CVF Conference on Computer Vision and Pattern Recognition—major venue for vision and depth benchmarks. 2025 by predicting cameras, depth maps, point maps, and 3D tracks in a single Feed-forward inference: one (or fixed few) forward passes through the network to produce the full map—low latency; contrasts with iterative optimization or long autoregressive generation. pass. Depth Anything 3: combines monocular and multi-view cues with a DINOv2-style backbone and a depth-ray formulation. unified monocular and Multi-view geometry: reasoning from multiple images with known or unknown poses—triangulation, epipolar constraints, and bundle adjustment. Monocular depth can be unified with these cues in models like DA3 or VGGT. under a plain DINOv2: self-supervised ViT image encoder from Meta—common backbone for DA2, DA3, and many geometry models. transformer with a Depth-ray target: predict depth along each pixel ray in a shared camera frame (instead of a bare disparity map), tying the decoder to projective geometry and easing fusion with poses or multi-view losses. target. Marigold: diffusion-based (Stable Diffusion prior) monocular depth—often very sharp boundaries in zero-shot use. proved that Stable Diffusion: latent diffusion image model used as a strong visual prior; Marigold-style depth treats depth prediction as a conditioned denoising task. priors produce astonishing Zero-shot (depth): evaluate a model on a dataset or scene type it was not explicitly trained on, without fine-tuning—tests generalization of priors rather than dataset-specific overfitting. boundary sharpness. MoGe / MoGe-2: metric geometry from single images, pushing robust metric predictions., Metric3D: line of work predicting metric depth and intrinsics-related scale from a single image; “metric” refers to real units, not full metrological certification., UniDepth v2: updated UniDepth with improved metric depth and uncertainty-related modeling in some variants., and Prompt Depth Anything: optional LiDAR/sparse depth conditioning to refine predictions—strong when extra geometry is available. pushed Metric depth (in MDE): each pixel’s depth is expressed in real SI units (typically meters) with a fixed global scale—what you need for physical clearance, volume, or fusion with surveyed coordinates. Contrast with affine-invariant / relative maps, where only ordering and shape up to scale+shift are identified until you bolt on a separate scale solver. accuracy to the point where you can almost reconstruct a room from your phone.
Almost. Because none of these models produce measurements.
They produce predictions. Predictions with impressive Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range. scores on KITTI Vision Benchmark Suite: outdoor driving scenes around Karlsruhe, with sparse “ground truth” from a Velodyne LiDAR projected into the camera. The Eigen et al. crop is the usual subset for monocular depth. Not a controlled metrology dataset—scale and sparsity reflect the sensor and projection pipeline. and NYU Depth Dataset v2: ~1.4k RGB-D pairs of indoor scenes captured with a Microsoft Kinect. Widely used for indoor MDE; depth is structured-light based, with known limitations at edges, shiny surfaces, and range., evaluated against metrics that tell you how close the median pixel is to the Ground truth (in benchmarks): the reference depth or geometry used for supervision and metrics—usually LiDAR, structured light, or laser scans aligned to the camera. It is itself measured, interpolated, and sometimes sparse; “truth” is operational, not error-free., but nothing about the uncertainty of any individual pixel, Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. bias as a function of range, Reproducibility: closeness of agreement between results on the same measurand under changed conditions (operator, setup, time, lighting, or stochastic inference). Metrology cares about bounded variation; benchmarks often average it away. under varying conditions, or the Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties. of the reported scale to any physical reference standard.
This is the gap. And I think it is a publishable, important, underexplored gap.
What follows is the outline of two papers. The first defines what metrological evaluation means for monocular depth and benchmarks every major model against it. The second proposes an architecture that optimizes for metrological properties directly. The first paper creates the evaluation framework. The second fills it.
The Conflation
When the MDE literature says "Metric depth (in MDE): each pixel’s depth is expressed in real SI units (typically meters) with a fixed global scale—what you need for physical clearance, volume, or fusion with surveyed coordinates. Contrast with affine-invariant / relative maps, where only ordering and shape up to scale+shift are identified until you bolt on a separate scale solver.," it means depth in real-world units (meters), as opposed to Affine-invariant depth: the map is determined only up to an unknown positive scale α and offset β (in practice often applied in disparity or inverse-depth space: z ↦ α·z + β). Parallel surfaces and many discontinuities are preserved; absolute meters are not. Training on mixed datasets often uses this because different sources disagree on global scale. Metric pipelines add a scale head, focal-length cues, or post-alignment to recover meters. or Relative depth: depth or inverse-depth that encodes “farther vs. nearer” and local geometry without committing to a global meter scale—typical raw output before metric calibration or when only ordinal loss is used.. Metric3D: line of work predicting metric depth and intrinsics-related scale from a single image; “metric” refers to real units, not full metrological certification. has "metric" in the name. All it means is "not Affine-invariant depth: the map is determined only up to an unknown positive scale α and offset β (in practice often applied in disparity or inverse-depth space: z ↦ α·z + β). Parallel surfaces and many discontinuities are preserved; absolute meters are not. Training on mixed datasets often uses this because different sources disagree on global scale. Metric pipelines add a scale head, focal-length cues, or post-alignment to recover meters.."
In measurement science, "In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.”" means something far more demanding: a measurement result that is Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties. to a reference standard, with a stated Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated., under documented conditions, with characterized Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. and Random error: unpredictable variation across repeated measurements—noise that averages down with many samples but does not cancel in a single frame. components, such that the result is reproducible by an independent observer following the same protocol.
These are not the same thing. They are not close to the same thing. A model that predicts depth in meters with 5% Relative error: error normalized by magnitude of the quantity (e.g. |ŷ−y|/|y|). A small percentage in meters can still violate tolerances if uncertainty is unknown. is producing Metric depth (in MDE): each pixel’s depth is expressed in real SI units (typically meters) with a fixed global scale—what you need for physical clearance, volume, or fusion with surveyed coordinates. Contrast with affine-invariant / relative maps, where only ordering and shape up to scale+shift are identified until you bolt on a separate scale solver.. A model that can tell you the depth is 3.42m ± 0.08m (k=2 expanded uncertainty: roughly a 95% coverage interval if errors are well-modeled as Gaussian—common reporting convention in metrology alongside the stated coverage probability., 95% confidence) with a systematic underestimate of 1.2% at ranges beyond 4m, characterized against a NIST-traceable: calibrated against artifacts whose chain leads to NIST (or another NMI)—the gold standard for claiming physical traceability in the US context. calibration artifact under documented illumination and viewpoint conditions, is producing metrological depth.
No model in the current literature does the second thing. No benchmark evaluates for it. This is the opening.
Paper 1: The Benchmark
Paper I — Beyond Metric: A Metrological Evaluation Framework for Monocular Depth Estimation
Working title. The final title should be shorter and punchier.Type: Benchmark / Evaluation · Compute: Inference only · Timeline: 6 months · Priority: First
Thesis
Existing MDE benchmarks (KITTI Vision Benchmark Suite: outdoor driving scenes around Karlsruhe, with sparse “ground truth” from a Velodyne LiDAR projected into the camera. The Eigen et al. crop is the usual subset for monocular depth. Not a controlled metrology dataset—scale and sparsity reflect the sensor and projection pipeline., NYU Depth Dataset v2: ~1.4k RGB-D pairs of indoor scenes captured with a Microsoft Kinect. Widely used for indoor MDE; depth is structured-light based, with known limitations at edges, shiny surfaces, and range., ETH3D: high-resolution multi-view indoor/outdoor dataset with laser-scanned depth used as reference. Often used for multi-view and some depth tasks; still an academic benchmark, not a calibration artifact., SYNS-Patches: synthetic patches dataset for depth/normal estimation with rendered ground truth—useful for controlled statistics but not physical traceability.) evaluate prediction accuracy: how close is the estimated depth to Ground truth (in benchmarks): the reference depth or geometry used for supervision and metrics—usually LiDAR, structured light, or laser scans aligned to the camera. It is itself measured, interpolated, and sometimes sparse; “truth” is operational, not error-free., aggregated over a dataset? They do not evaluate measurement quality: can this estimate be used as a measurement in an engineering, inspection, or regulatory context? The distinction matters because downstream applications in infrastructure inspection, autonomous systems, construction surveying, and Digital twin: a live or periodically updated virtual copy of a physical asset (building, plant, infrastructure). If the twin carries dimensional tolerances, depth inputs need stated uncertainty and traceability, not only low average error. with Dimensional tolerance: allowable deviation from nominal size in design or manufacturing. Metrological depth matters when decisions depend on staying inside those bounds, not on leaderboard RMSE alone. need not just accurate depth, but depth with quantified Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated., characterized error structure, and Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties. calibration.
This paper (a) formalizes the metric/metrological distinction for the MDE community, (b) defines a metrological evaluation framework specifying the properties a depth estimate must satisfy to qualify as a measurement, and (c) evaluates every major MDE model against this framework using both synthetic and physical calibration protocols.
Conceptual Framework (Sections 1–2)
The introduction draws the line between Metric depth (in MDE): each pixel’s depth is expressed in real SI units (typically meters) with a fixed global scale—what you need for physical clearance, volume, or fusion with surveyed coordinates. Contrast with affine-invariant / relative maps, where only ordering and shape up to scale+shift are identified until you bolt on a separate scale solver. and In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” depth with concrete examples. A bridge inspector using monocular depth to estimate clearance needs to know the Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated. bound, not just the expected error. A deformation monitoring system tracking millimeter-scale displacement over time needs Reproducibility: closeness of agreement between results on the same measurand under changed conditions (operator, setup, time, lighting, or stochastic inference). Metrology cares about bounded variation; benchmarks often average it away. guarantees, not dataset-aggregate Absolute relative error: mean of |d_pred − d_gt| / d_gt over valid pixels. Scale-invariant in the sense that it normalizes by true depth; still a dataset-level aggregate, not a per-pixel measurement statement.. The existing evaluation paradigm cannot answer these questions because it was never designed to.
The conceptual framework section maps the vocabulary of measurement science (VIM (International Vocabulary of Metrology): the ISO/IEC guide defining terms like measurand, uncertainty, and traceability—the vocabulary the post maps onto depth estimation.: International Vocabulary of Metrology) onto depth estimation. Key definitions:
Key definitions
Measurand: the quantity being measured—here, true depth at a specified pixel and imaging condition.: The true depth at a specific pixel location under specified conditions.
Measurement uncertainty: a parameter describing how dispersed plausible values of the measurand are, given everything you know (model, sensor, environment)—not the same as “model confidence” unless calibrated.: A non-negative parameter characterizing the dispersion of values attributed to the depth measurand, based on information used.
Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties.: The property of a measurement result whereby it can be related to a reference through a documented unbroken chain of calibrations.
Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range.: The component of measurement error that in replicate measurements remains constant or varies predictably.
Random error: unpredictable variation across repeated measurements—noise that averages down with many samples but does not cancel in a single frame.: The component of measurement error that varies unpredictably in replicate measurements.
This is not novel measurement science. It is established metrological vocabulary applied to a field that has never used it. The contribution is the mapping, not the metrology.
Formal Metrological Properties (Section 3)
The paper defines six metrological properties that a depth estimation system can be evaluated against. These are drawn from VDI/VDE 2634: German guideline for optical 3D measuring systems; defines probing error, sphere/plane artifacts, and length measurement tests familiar in industrial metrology. (optical 3D measuring systems) and ISO 5725: international standard on accuracy (trueness and precision) of measurement methods and results—vocabulary for repeatability, reproducibility, and bias. (accuracy of measurement methods), adapted for the monocular depth setting:
| Property | Definition | Current Status |
|---|---|---|
| Probing Error | Range of depth values obtained when measuring a single point on a calibration sphere, characterizing local noise | Evaluated in one UniDepth: family of monocular metric depth models; one paper reports metrology-flavored diagnostics, but they are not yet a community-wide benchmark. study; not standardized |
| Length Measurement Error | Deviation between measured and calibrated distance between two known points on a Ball-bar standard: a calibrated artifact with two spheres at a known center-to-center distance; used to test length measurement error of optical 3D systems. standard | Same single study; not adopted by the community |
| Planarity Deviation | Root mean square: √(mean of squares). Used here for deviation from a plane; related spirit to RMSE but on point-to-plane residuals. deviation of measured points from best-fit plane on a known planar surface | Reported as a metric in MDEC: CVPR workshop on monocular depth estimation challenges—community venue where some geometric metrics (e.g. planarity) appear outside a full metrological frame. but not in a metrological framework |
| Range-Dependent Bias | Systematic error characterized as a function of ground-truth depth (bias curve) | Not evaluated in any benchmark |
| Uncertainty Calibration | Agreement between predicted Confidence / prediction interval: range intended to contain the true depth with a stated probability (e.g. 95%). Well-calibrated intervals match empirical coverage; uncalibrated “confidence” from softmax-like scores often does not. and observed error distributions (Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. curve) | Not evaluated; most models produce no uncertainty estimate |
| Reproducibility | Variation in results when the same scene is measured under changed conditions (lighting, viewpoint, Stochastic inference: randomness in the forward pass (dropout sampling, diffusion noise, test-time augmentation). Reproducibility metrics ask how much outputs vary under those sources of variation.) | Not evaluated in any benchmark |
The column "Current Status" is the point. Four of six properties have never been systematically evaluated for MDE models. This is the benchmark gap.
Physical Calibration Protocol (Section 4)
The paper specifies a capture protocol using commercially available calibration artifacts: Gauge blocks: precision-machined length standards (Jo blocks); used as physical references for distance and scale verification. (length standards), Ball-bar standard: a calibrated artifact with two spheres at a known center-to-center distance; used to test length measurement error of optical 3D systems. standards (per VDI/VDE 2634: German guideline for optical 3D measuring systems; defines probing error, sphere/plane artifacts, and length measurement tests familiar in industrial metrology.), and planar reference surfaces. These are imaged under controlled conditions (fixed illumination, known Camera intrinsics: focal length, principal point, skew, and distortion coefficients that map pixels to 3D rays. They fix the geometry linking depth along a ray to metric 3D points and are required to project LiDAR “truth” into the image.) and in-the-wild conditions (varying illumination, handheld capture) at multiple ranges.
The physical protocol is complemented by a Synthetic training data: renders from CAD or game engines with exact depth buffers—unlimited variety and perfect pixel alignment, but domain gap to real sensors and materials. evaluation using rendered scenes with exact Ground truth (in benchmarks): the reference depth or geometry used for supervision and metrics—usually LiDAR, structured light, or laser scans aligned to the camera. It is itself measured, interpolated, and sometimes sparse; “truth” is operational, not error-free., enabling characterization of error structure at ranges and geometries that are impractical to calibrate physically.
The key design decision: the benchmark evaluates models as "measurement instruments," not as "predictors." The evaluation asks: if I treat this model's output as a measurement, what are the metrological properties of that measurement? This framing is the contribution.
Evaluation Campaign (Section 5)
Every major model gets evaluated. No exceptions. The minimum set:
| Model | Paradigm |
|---|---|
| Depth Anything V2: large-scale discriminative monocular depth (2024), built on a strong data engine and ViT backbones—SOTA-class accuracy on standard benchmarks. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| Depth Anything 3: combines monocular and multi-view cues with a DINOv2-style backbone and a depth-ray formulation. | Unified model (e.g. VGGT, DA3-style): one backbone jointly predicts several geometric quantities—depth, poses, tracks, or rays—so multi-view and monocular cues share representation. |
| Metric3D: line of work predicting metric depth and intrinsics-related scale from a single image; “metric” refers to real units, not full metrological certification. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| UniDepth v2: updated UniDepth with improved metric depth and uncertainty-related modeling in some variants. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| MoGe / MoGe-2: metric geometry from single images, pushing robust metric predictions. / MoGe-2 | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| VGGT (2025): unified model predicting cameras, depth, point maps, and 3D tracks in one forward pass—CVPR 2025 Best Paper. | Unified model (e.g. VGGT, DA3-style): one backbone jointly predicts several geometric quantities—depth, poses, tracks, or rays—so multi-view and monocular cues share representation. |
| Marigold: diffusion-based (Stable Diffusion prior) monocular depth—often very sharp boundaries in zero-shot use. v1.1 | Generative depth (e.g. diffusion): depth is sampled from a conditional generative model; sharp details can come from image priors, but calibration and per-pixel uncertainty require explicit design. |
| Prompt Depth Anything: optional LiDAR/sparse depth conditioning to refine predictions—strong when extra geometry is available. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| Depth Pro: monocular metric depth model targeting sharp boundaries and competitive benchmark numbers. | Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. |
| DAR (2B): large autoregressive depth model—different scaling paradigm from feed-forward ViT decoders. | Autoregressive model: generates depth (or tokens representing geometry) sequentially, conditioning each step on previous outputs—higher latency than single forward pass, different inductive bias than CNN/ViT decoders. |
For each model, the paper reports all six metrological properties plus the standard accuracy metrics for cross-reference with existing literature. The analysis section characterizes how paradigm (Discriminative (here): a feed-forward network trained directly to map RGB → depth (or related targets) with supervised or distillation losses—contrasts with generative diffusion sampling or autoregressive token prediction. vs. Generative depth (e.g. diffusion): depth is sampled from a conditional generative model; sharp details can come from image priors, but calibration and per-pixel uncertainty require explicit design. vs. Unified model (e.g. VGGT, DA3-style): one backbone jointly predicts several geometric quantities—depth, poses, tracks, or rays—so multi-view and monocular cues share representation.), backbone (ViT (Vision Transformer): image patch transformer backbone; “L/G” sizes indicate large/giant width and depth (more parameters, usually better representation capacity). scale), and training data (Synthetic training data: renders from CAD or game engines with exact depth buffers—unlimited variety and perfect pixel alignment, but domain gap to real sensors and materials. vs. Real-world training data: captures from actual cameras paired with sensor or reconstructed depth. Closer to deployment statistics but noisier labels and mixed provenance. vs. mixed) correlate with metrological performance. The hypothesis is that metrological properties do not track neatly with prediction accuracy: a model with lower Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range. may have worse Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. or higher Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. bias at specific ranges.
Target Venues
Venue strategy
Primary: CVPR: IEEE/CVF Conference on Computer Vision and Pattern Recognition—major venue for vision and depth benchmarks. or ECCV: European Conference on Computer Vision—peer to CVPR/ICCV for full papers and datasets. (benchmark/dataset track). Both venues have established precedent for benchmark papers. The metrological framing is differentiated enough from existing benchmarks to clear novelty.
Alternatives: 3DV: International Conference on 3D Vision—strong fit for geometry, reconstruction, and evaluation methodology. (strong fit for geometric evaluation), IEEE Transactions on Instrumentation and Measurement: journal where rigorous metrology and sensor characterization are first-class. (metrological rigor valued as primary contribution), ISPRS Journal: photogrammetry and remote sensing outlet—surveying and geometric measurement culture aligns with traceable depth. (photogrammetry and surveying community cares about this natively).
Workshop path: MDEC: CVPR workshop on monocular depth estimation challenges—community venue where some geometric metrics (e.g. planarity) appear outside a full metrological frame. Workshop at CVPR for early framing, then full paper to a main conference.
Paper 2: The Method
Paper II — Metrological Monocular Depth Estimation with Calibrated Uncertainty Quantification
Depends on Paper I being published or at least available as a preprint on arXiv: open preprint server; common for releasing benchmarks and methods before peer review so follow-on work can cite a stable technical report.. Submitted 3–6 months after Paper I.Type: Method · Compute: Training required · Timeline: 9–12 months · Priority: Second
Thesis
Given the metrological evaluation framework from Paper I, this paper proposes an MDE architecture that is designed from the ground up to produce calibrated, metrologically Metrological traceability: the property that a measurement can be related to a national or international standard through an unbroken chain of calibrations, each with stated uncertainties. depth estimates. The model does not merely predict depth; it predicts depth with a per-pixel uncertainty estimate that is calibrated against physical Ground truth (in benchmarks): the reference depth or geometry used for supervision and metrics—usually LiDAR, structured light, or laser scans aligned to the camera. It is itself measured, interpolated, and sometimes sparse; “truth” is operational, not error-free., and it exposes the Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range. structure of its predictions such that a downstream consumer can apply corrections appropriate to their operating conditions.
The contribution is not "better depth estimation." If the model matches DA3 on Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range., that is sufficient. The contribution is that the uncertainty estimates are calibrated (predicted Confidence / prediction interval: range intended to contain the true depth with a stated probability (e.g. 95%). Well-calibrated intervals match empirical coverage; uncalibrated “confidence” from softmax-like scores often does not. match observed error distributions), the bias structure is characterized and correctable, and the full output constitutes a measurement result in the In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” sense.
Architecture (Section 3)
The backbone is a DINOv2: self-supervised ViT image encoder from Meta—common backbone for DA2, DA3, and many geometry models. Encoder: backbone (here often DINOv2 ViT) that maps the image to a spatial feature map or tokens consumed by the decoder. (ViT (Vision Transformer): image patch transformer backbone; “L/G” sizes indicate large/giant width and depth (more parameters, usually better representation capacity). or ViT (Vision Transformer): image patch transformer backbone; “L/G” sizes indicate large/giant width and depth (more parameters, usually better representation capacity).), following the DA2/DA3 precedent. The Decoder: network head that upsamples or refines features into dense per-pixel outputs (depth, variance) from a backbone’s representation. produces two outputs: a depth map and a Heteroscedastic uncertainty: noise magnitude that varies by pixel or region (e.g. textureless sky vs. rich edges)—modeled with per-pixel log-variance in many UQ setups. Aleatoric uncertainty: irreducible ambiguity in the observation (blur, sensor noise)—as opposed to epistemic (model ignorance) uncertainty. uncertainty map (Log-variance parameterization: the network predicts log σ² (or similar) so variance stays positive and gradients are stable when training heteroscedastic (per-pixel) noise models. parameterization). The loss function combines:
- Negative log-likelihood: loss that treats depth plus a predicted noise scale as a probabilistic model—trains mean and uncertainty jointly when the likelihood is well-specified. loss under a Laplacian vs Gaussian likelihood: assume heavy-tailed (Laplace) or Gaussian noise on depth errors when building the NLL; choice affects sensitivity to outliers and the shape of calibrated intervals., which jointly trains the depth prediction and the uncertainty estimate. This is the mechanism that makes the uncertainty Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. rather than decorative.
- Scale-invariant gradient loss: compares depth gradients in a way that is insensitive to global scale—helps sharp edges without fixing metric scale alone. loss for edge quality, following Marigold's lesson that boundary sharpness matters for geometric accuracy downstream.
- Normal consistency loss: matches predicted surface normals to supervisory normals—DA3 and others use it to improve geometric coherence. loss following DA3's finding that Surface normal: unit vector perpendicular to the tangent plane of a surface at a point. Supervising or matching normals encourages consistent local 3D shape, not just scalar depth. supervision improves geometric quality of the teacher model.
The key architectural addition is a Calibration head: a small network or scalar mapping (e.g. temperature scaling) that adjusts raw uncertainty outputs so predicted intervals match observed error rates on a held-out set.: a lightweight MLP (multi-layer perceptron): fully connected network; here a small head on top of decoder features to map raw logits or variance to calibrated scales or intervals. that takes the raw uncertainty output and maps it to calibrated Confidence / prediction interval: range intended to contain the true depth with a stated probability (e.g. 95%). Well-calibrated intervals match empirical coverage; uncalibrated “confidence” from softmax-like scores often does not. via Temperature scaling: divide logits (or here, uncertainty-related logits) by a learned scalar so calibrated probabilities/intervals match empirical frequencies—Platt-like post-hoc calibration., trained on a Held-out set: data reserved from main training for validation, calibration, or testing—metrologically important so uncertainty and bias claims are not circularly fit to the training distribution. calibration set with physical Ground truth (in benchmarks): the reference depth or geometry used for supervision and metrics—usually LiDAR, structured light, or laser scans aligned to the camera. It is itself measured, interpolated, and sometimes sparse; “truth” is operational, not error-free. (not the same data used for depth training). This is analogous to Platt scaling: classic post-hoc calibration for classifier scores via a sigmoid; analogous ideas apply to scaling regression uncertainty. for classification confidence, applied to Regression uncertainty: probabilistic or interval predictions for continuous depth rather than class probabilities—calibration means predicted spreads match empirical error, often checked via coverage or NLL..
Design decision
LiDAR: light-ranging sensor producing sparse or dense 3D points; often used as “ground truth” for depth benchmarks and for conditioning depth models. conditioning (a la Prompt Depth Anything: optional LiDAR/sparse depth conditioning to refine predictions—strong when extra geometry is available.) is a natural extension but is deliberately excluded from the first version of this paper to keep the contribution clean: the model must produce metrologically characterized output from RGB alone. The LiDAR-conditioned variant is a follow-up experiment or a third paper, not a Confound: a factor mixed with the variable of interest (e.g. LiDAR conditioning carrying scale information) that muddies what the experiment attributes to RGB-only metrological performance. in the method contribution.
Training Regime (Section 4)
Teacher–student distillation: a large or strong model produces targets for a smaller or student network—common in DA2/DA3-style pipelines. distillation following the DA2/DA3 paradigm. The teacher is trained on large-scale labeled data (Synthetic training data: renders from CAD or game engines with exact depth buffers—unlimited variety and perfect pixel alignment, but domain gap to real sensors and materials. + Real-world training data: captures from actual cameras paired with sensor or reconstructed depth. Closer to deployment statistics but noisier labels and mixed provenance.). The student distills from the teacher on unlabeled data with Pseudo-labels: targets generated by a teacher model or heuristic on unlabeled data—used to scale training; quality and bias propagate to the student.. The uncertainty Calibration head: a small network or scalar mapping (e.g. temperature scaling) that adjusts raw uncertainty outputs so predicted intervals match observed error rates on a held-out set. is trained Post-hoc: fitted after the main training stage (e.g. temperature scaling on a held-out set) so calibration is assessed on data not used to learn the primary depth weights. on a separate calibration dataset containing physical Ground truth (in benchmarks): the reference depth or geometry used for supervision and metrics—usually LiDAR, structured light, or laser scans aligned to the camera. It is itself measured, interpolated, and sometimes sparse; “truth” is operational, not error-free. from calibration artifacts and high-quality LiDAR: light-ranging sensor producing sparse or dense 3D points; often used as “ground truth” for depth benchmarks and for conditioning depth models. scans.
Critical detail: the calibration dataset must be independent of the training data. This is a metrological requirement, not a convenience. A measurement instrument calibrated against the same data used to build it has Circular traceability: using the same data both to train the model and to “calibrate” or certify it, so uncertainty claims are not independent of the fitting set—invalid in metrology.. The paper must demonstrate that the calibration holds on data the model has never seen, captured with equipment the model was not trained on.
Experiments (Section 5)
The experimental section has two halves. The first half is the standard accuracy evaluation on existing benchmarks (KITTI Vision Benchmark Suite: outdoor driving scenes around Karlsruhe, with sparse “ground truth” from a Velodyne LiDAR projected into the camera. The Eigen et al. crop is the usual subset for monocular depth. Not a controlled metrology dataset—scale and sparsity reflect the sensor and projection pipeline., NYU Depth Dataset v2: ~1.4k RGB-D pairs of indoor scenes captured with a Microsoft Kinect. Widely used for indoor MDE; depth is structured-light based, with known limitations at edges, shiny surfaces, and range., ETH3D: high-resolution multi-view indoor/outdoor dataset with laser-scanned depth used as reference. Often used for multi-view and some depth tasks; still an academic benchmark, not a calibration artifact.) to demonstrate competitive depth quality. The model does not need to set new SOTA: state of the art—best reported numbers on a benchmark at a given time; orthogonal to whether the model’s outputs qualify as measurements with uncertainty.. It needs to be within the pack.
The second half evaluates against the Paper I benchmark. This is where the contribution lives. The key results:
- Uncertainty calibration: agreement between predicted uncertainty and actual error magnitudes—checked via coverage curves, reliability diagrams, or NLL on held-out data. curves (predicted confidence vs. observed Coverage (calibration): fraction of ground-truth values that fall inside the model’s predicted intervals. A calibrated UQ model’s coverage should match the nominal level (e.g. 95% intervals cover ~95% of pixels where defined).) showing that the model's uncertainty estimates are well-calibrated where existing models either produce no uncertainty or produce uncalibrated confidence.
- Range-dependent bias curves showing characterized Systematic error (bias): predictable or repeatable deviation from the true value—e.g. depth that is always slightly short at long range., with demonstration that the bias can be corrected using the model's own output (a practical feature no existing model offers).
- Probing error: spread of repeated measurements at nominally the same surface point—characterizes local noise and repeatability of the “instrument” (here, the model plus pipeline). and length measurement error on physical calibration artifacts, compared to all Paper I baselines.
- Reproducibility under varied conditions, demonstrating that the model's stochastic variation (if any) is bounded and characterized.
Target Venues
Venue strategy
Primary: CVPR: IEEE/CVF Conference on Computer Vision and Pattern Recognition—major venue for vision and depth benchmarks., ECCV: European Conference on Computer Vision—peer to CVPR/ICCV for full papers and datasets., or ICCV: International Conference on Computer Vision—held in odd years; similar tier to CVPR/ECCV. main conference. The method paper benefits from Paper I having established the evaluation framework. Reviewers can look up the benchmark.
Alternative: NeurIPS: flagship machine learning conference—occasional home for probabilistic / UQ-heavy depth work. (if the UQ angle is framed as a probabilistic modeling contribution—Aleatoric uncertainty: irreducible ambiguity in the observation (blur, sensor noise)—as opposed to epistemic (model ignorance) uncertainty. / Epistemic uncertainty: ignorance reducible with more data or a better model (ambiguity about the true function). Distinct from aleatoric (data-limited) noise; both matter for full UQ but are estimated differently. uncertainty, calibrated likelihoods), AAAI, or WACV (lower bar, still solid).
Dependency Graph
The two papers are designed to be self-reinforcing but independently publishable. Here is the dependency structure:
Paper I: Benchmark
├── Conceptual framework (metric vs. metrological)
├── Six formal metrological properties
├── Physical calibration protocol
├── Evaluation of 10+ existing models
└── Public benchmark release
│ establishes evaluation framework for
▼
Paper II: Method
├── DINOv2 + heteroscedastic uncertainty decoder
├── NLL loss for calibrated uncertainty
├── Post-hoc calibration head (Platt-style)
├── Standard benchmarks (competitive, not SOTA)
└── Paper I benchmark (this is where you win)
│ enables
▼
[Future] LiDAR-conditioned variant
[Future] Video temporal consistency
[Future] Multi-view metrological fusion
What This Isn't
This is not a plan to beat DA3 on Root mean squared error: √(mean of squared depth errors). Standard aggregate benchmark metric; it summarizes typical pixel error but says nothing about per-pixel uncertainty or bias vs. range.. I will not beat a ByteDance: industrial lab behind Depth Anything 3 and large-scale training—resource advantage on data and compute vs. typical academic teams. team with a ViT (Vision Transformer): image patch transformer backbone; “L/G” sizes indicate large/giant width and depth (more parameters, usually better representation capacity). trained on their internal data engine. That is not the game.
This is a plan to redefine what "good" means for depth estimation in contexts where the output is treated as a In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” rather than a visual prior. The benchmark paper says: you have been evaluating predictions when you should have been evaluating measurements. The method paper says: here is what it looks like when you optimize for measurement quality.
The bet is that as monocular depth moves from research artifact to deployed measurement tool in infrastructure, surveying, construction, autonomous systems, and Digital twin: a live or periodically updated virtual copy of a physical asset (building, plant, infrastructure). If the twin carries dimensional tolerances, depth inputs need stated uncertainty and traceability, not only low average error., the In measurement science: a result tied to a reference through traceability, with stated uncertainty and known error structure under documented conditions—not just “numbers in meters.” framing will become necessary rather than optional. The first people to formalize it get to define the terms.
That is a game worth playing, and one where an applied researcher with a measurement science background and a philosophy of rigorous attention has a genuine, non-commoditized advantage.