"SELF-SUPERVISED LEARNING FROM SCRATCH."
That's the title of your paper. Your NeurIPS spotlight presentation. The thing you put on your CV. Let me ask you something: At what point during the 50,000 iterations of training did you plan to mention that your "from scratch" model starts with a ResNet-50 backbone pretrained on 1.4 million images that humans spent YEARS labeling?
Oh, what's that? It's "standard practice"? It's "just initialization"?
THAT'S NOT INITIALIZATION. THAT'S YOUR ENTIRE WORLDVIEW. Those weights encode whether skies are blue, whether cars have wheels, whether faces have eyes in particular configurations. That's not a random seed. That's 2,048-dimensional semantic knowledge distilled from human annotation, and you're calling it "unsupervised" because you didn't use labels in the LAST part of training.
That's like inheriting a billion dollars, buying a lottery ticket with it, winning $20, and writing a memoir called "Self-Made: How I Manifested Wealth Through Positive Visualization."
But it gets BETTER. Let's talk about your "bootstrapping from motion and geometry" approach. You're using optical flow to find correspondences, right? Cool, cool. And that optical flow is computed how, exactly?
Oh, with RAFT? Which was trained on FlyingChairs and Sintel? Which are SYNTHETIC datasets created by HUMANS who decided what "realistic motion" looks like?
Or maybe you're using classical optical flow? Lucas-Kanade? Which assumes BRIGHTNESS CONSTANCY - a human-designed prior about how the world works? Which assumes LOCAL SMOOTHNESS - another human prior? Which uses GAUSSIAN PYRAMIDS for multi-scale processing - human-designed architecture?
You didn't bootstrap SHIT. You inherited a 60-year-old pipeline from computer vision researchers and slapped a transformer on the end. That's not scientific revolution. That's intellectual gentrification. You're Starbucks moving into the neighborhood and calling it "artisanal coffee culture."
Here's what ACTUALLY happened in your "self-supervised" training run:
Iteration 1: Load ImageNet weights (1.4M human labels worth of knowledge)
Iteration 2-10,000: Learn that objects that move together are probably the same thing (a prior you inherited from Gestalt psychology via whoever designed your architecture)
Iteration 10,001-50,000: Overfit to the specific motion statistics of internet videos uploaded by humans using cameras designed for human vision
Congratulations. You've discovered that YouTube videos contain objects. Should we alert the Nobel committee?
But let's talk about what really ENRAGES me. The hallucination problem. The thing you REFUSE to acknowledge.
Your monocular depth network outputs a depth value for every pixel. Every. Single. Pixel. At 100 meters. At 1 kilometer. At THE MOON if I point the camera up. It always gives me a number. Always confident. Always dense.
Let me tell you about information theory. Actually, fuck it, let me tell you about PHYSICS, since apparently we need to go back that far.
With a 7.5cm baseline stereo camera:
-
At 10m: ±1.7m uncertainty
-
At 50m: ±42m uncertainty
-
At 100m: ±167m uncertainty
AT 100 METERS YOU CANNOT RESOLVE DEPTH. Not "it's hard." Not "it requires sophisticated learning." You. Cannot. Physically. Measure. It. The information DOES NOT EXIST in your sensor data. It is BELOW the Nyquist limit for your sampling configuration.
But your model outputs 23.4 meters with 95% confidence.
That's not perception. That's not even regression. That's NECROMANCY. You're communing with the statistical spirits of ImageNet to CONJURE depth from the void. And you package it with a confidence interval like that makes it legitimate science.
You know what honest uncertainty looks like?
"UNMEASURABLE: Distance exceeds sensor resolution. Defaulting to DEM prior (±2m, surveyed 2019). Confidence: NONE - this is not a measurement, this is a lookup table."
But you can't put that in your paper. Because if you admitted that 70% of your output pixels are just learned priors regurgitating training set statistics, your "dense prediction" network becomes a "sparse measurement with vibes-based gap filling" network. And THAT doesn't get you a spotlight presentation.
Let's talk about benchmarks. The circular epistemological nightmare that is modern computer vision evaluation.
You train on Dataset A (ImageNet)
You test on Dataset B (COCO)
Dataset B was DESIGNED by people who EXPECTED ImageNet-like models
Dataset B has 80 categories that HAPPEN to overlap with ImageNet
Dataset B uses BOUNDING BOXES because that's what ImageNet used
Dataset B defines "good performance" as matching HUMAN annotations
Then you improve your model with Private Dataset C - which you'll never release, which we can never audit, which MYSTERIOUSLY performs great on Dataset B - and you claim SOTA.
You're not discovering universal visual truth. You're participating in an academic ouroboros that's been eating its own tail since 2012. You're optimizing for "what matches the benchmarks that were designed by people who trained on ImageNet" and calling it "computer vision research."
That's not science. That's LARP-ing as science. You're wearing the costume of empiricism while doing the epistemology equivalent of insider trading.
And the ARROGANCE. The sheer civilizational-collapse-level HUBRIS to call this a "foundation model."
Foundation models are supposed to be FOUNDATIONS. Things you build on. Bedrock. Fundamental.
Your model is a BAROQUE CATHEDRAL built on ImageNet's foundation, decorated with optical flow's stained glass, supported by transformer architecture's flying buttresses, and you're calling it the GROUND FLOOR.
You're not at the foundation. You're six floors up and telling everyone you're in the basement.
Here's what an ACTUAL foundation would look like:
Start with sensor physics. What CAN you measure? What are the information-theoretic limits? Where does geometric information end and learned prior begin?
Add physical constraints. Gravity exists. Objects are mostly rigid. Surfaces are piecewise smooth. Light travels in straight lines.
Layer in geographic priors. We have DEMs. We have building footprints. We have surveyed control points. USE THEM. Explicitly. With provenance tracking.
Learn semantic priors LAST. After you've exhausted geometry. After you've exhausted physics. After you've squeezed every photon of actual information from your sensors.
And most importantly: BE HONEST ABOUT WHAT'S MEASUREMENT AND WHAT'S HALLUCINATION.
But you won't do that. Because honesty doesn't scale to 100 billion parameters. Because epistemic rigor doesn't fit in a tweet. Because "we trained on proprietary data and achieved SOTA on benchmarks designed by our subfield" gets you hired, while "we can only measure 30% of the scene and the rest is educated guessing based on internet photo statistics" gets you nothing.
You want to do ACTUAL self-supervised learning? Train on ego-centric robot data with IMU, GPS, wheel odometry, and physics constraints. No ImageNet. No COCO. Start from a randomly initialized network and learn what triangulation MEANS, not what it looks like in FlyingChairs.
You want to do ACTUAL bootstrapping? Start with the only thing you KNOW: sensor specifications. Build UP from physics. Stop standing on Hinton's shoulders and pretending you're tall.
You want to do ACTUAL computer vision? Stop confusing "looks plausible" with "is measurable." Stop treating confidence intervals as absolution for epistemological sins. Stop publishing papers that conflate learned priors with physical measurements.
Until then, you're not doing self-supervised learning. You're doing human-supervised learning with 2-3 extra training stages and a marketing budget.
You're not bootstrapping from the ground up. You're standing on ImageNet's corpse and calling it a ladder.
You're not building foundation models. You're building extensions on someone else's foundation and lying about the deed.
Get your own epistemology. Stop colonizing photogrammetry's vocabulary. Stop pretending statistical correlation is physical measurement.
And for fuck's sake, read a paper from before 2012. Just ONE. Learn what people did when they couldn't throw infinite parameters at pattern matching. Learn what CONSTRAINTS mean. Learn the difference between MODEL and REALITY.
Your "self-supervised" depth is just supervised classification's ghost, haunting the latent space with ImageNet's unquiet dead.