The State Space of Research: Toward a Phenomenology of Perception (the Machine Kind)

There is a shape to how research actually works — not the idealized version from philosophy-of-science seminars, but the operational version that determines whether your paper gets into CVPR or lands in a desk drawer. Understanding this shape matters, because the process is the epistemology. The way we structure benchmarks, loss functions, and ablation studies doesn't just validate knowledge — it determines what kinds of knowledge are even possible to produce.

What I'm attempting here is something like a phenomenology of scientific research — not a methodology (how you should do it) or a sociology (how institutions shape it), but a description of what the process looks like from the inside, as experienced by the researcher who is navigating it. What are the signals you actually attend to? What makes one result feel significant and another feel hollow? What does it mean to understand something, as opposed to merely achieving a number? Husserl wanted to describe the structures of consciousness; I want to describe the structures of inquiry. The ambition is the same: to make the implicit visible before arguing about whether it's justified.

This post lays out that process as I understand it, viewed through the lens of computer vision, which is the field I work in. The goal is to make the implicit machinery of research legible, especially for people who are entering the field or adjacent to it and trying to understand why certain results matter and others don't.

Step 1: Problem Formulation

Everything starts with defining what "good" means — mathematically. This is the step most people skip when they describe research, and it's the one that determines everything downstream.

In computer vision, problem formulation means carving out a task: object detection, semantic segmentation, monocular depth estimation, image generation, pose estimation. Each of these is a precise mathematical statement about what the model should produce given what it receives. Detection means producing bounding boxes and class labels. Segmentation means producing a per-pixel class assignment. Depth estimation means producing a per-pixel scalar in metric or relative units.

The formulation constrains the solution space. If you define depth estimation as producing relative depth (ordinal rankings between pixels), you've excluded metrological claims about absolute distance. If you define detection as producing axis-aligned bounding boxes, you've excluded oriented objects. These choices feel technical, but they're epistemic — they determine what your field can know.

Step 2: Benchmarks and Datasets

Once you've formulated the problem, you need a way to measure progress. This is where benchmarks come in, and the distinction between a benchmark and a dataset matters more than most people realize.

A benchmark is an evaluation protocol: which metrics to compute, on which data splits, under which constraints. A dataset conforms to a benchmark — it provides the actual images, annotations, and splits. A single benchmark can have multiple datasets (KITTI and NYU Depth V2 both benchmark monocular depth, but with different sensors, scenes, and evaluation protocols). A single dataset can serve multiple benchmarks (COCO supports detection, segmentation, keypoint estimation, and captioning).

The benchmark is the contract. It says: "If you want to claim progress on X, here is how you must measure it." The dataset is the evidence. Good benchmarks have several properties:

Agreed-upon metrics — mAP for detection, IoU for segmentation, RMSE/AbsRel for depth. These aren't arbitrary; they encode what the community has decided matters.
Standard splits — Train/val/test partitions that everyone uses, so results are comparable.
Held-out test sets — Ideally with evaluation servers that prevent overfitting to the test set (though this is honored more in principle than in practice).
Diversity — Scene variety, sensor variety, difficulty gradients. A benchmark that only covers well-lit indoor scenes tells you nothing about robustness.

The choice of benchmark shapes the field more than any individual paper. When ImageNet became the benchmark for image classification, it channeled a decade of research toward a specific distribution of 1000 object categories photographed from the internet. When COCO became the benchmark for detection, it defined "good detection" as performing well on 80 common object categories at a specific IoU threshold. These choices aren't wrong, but they are choices, and they have consequences.

Step 3: Baseline Establishment

Before you can claim an improvement, you need a credible control. This means reproducing existing state-of-the-art (SOTA) results under identical conditions — same dataset, same splits, same evaluation protocol, same compute budget where possible.

This step is where a lot of papers quietly fail. Comparing your method against numbers copy-pasted from someone else's paper, trained on a different data split or with a different augmentation pipeline, is not a valid comparison. The strongest papers re-implement baselines or use official codebases with identical training configurations.

Baseline establishment is also where the compute/data confound lives. A persistent problem in CV research is distinguishing whether improvements come from the method or from more compute and data. If your model trains for 3x longer on 2x the data, beating the baseline doesn't mean your architecture is better — it means your budget is bigger. Good papers control for this: same FLOPs, same data, same training schedule.

Step 4: Method Design and Hypothesis

Now you propose something new — an architectural change, a novel loss function, a training procedure, a data augmentation strategy, a post-processing step. The contribution should be motivated by a clear hypothesis about why it should work, not just that it does.

"We added attention and the numbers went up" is not a hypothesis. "Cross-attention between depth and surface-normal features should improve depth estimation in textureless regions because surface normals provide geometric constraints where photometric matching fails" is a hypothesis. The difference matters because the hypothesis is what makes the result generalizable. Without it, you have a recipe, not knowledge.

The best CV papers have a tight loop between problem analysis (where does the current SOTA fail?), hypothesis (why does it fail there?), and method design (what architectural or training change addresses that failure mode?). The weakest papers skip the analysis and go straight to architecture search.

Step 5: Training and Evaluation

You minimize loss during training, then evaluate on held-out test sets using the benchmark's prescribed metrics. The gap between your method and baselines is your claimed contribution.

A note on the language: you minimize loss. Loss is the error signal — it measures how far your model's predictions are from the ground truth. Lower is better. Equivalently, you can say you maximize a performance metric like accuracy, mAP, or IoU. These are two descriptions of the same optimization, but the distinction matters when reading papers, because "maximizing loss" would mean making your model worse on purpose.

The evaluation protocol should match the benchmark exactly. Same crop sizes, same depth cap, same IoU thresholds. Small differences in evaluation protocol can produce surprisingly large differences in reported numbers, which is why benchmark standardization matters.

Step 6: Ablation Studies

This is where rigor lives. Ablation studies systematically remove or swap components of your method to isolate which parts contribute to the improvement. If your method has three novel components (a new backbone, a new loss term, and a new post-processing step), strong ablations test each independently:

Full method vs. method without the new backbone
Full method vs. method without the new loss term
Full method vs. method without the post-processing step
Every pairwise combination

This answers the question "is it the attention module or the new loss function?" — which is the question reviewers actually care about. A paper that reports only the full method's performance, without decomposing the contributions, is a paper that hasn't demonstrated understanding of its own results.

Strong ablations are what separate a rigorous paper from a "we threw things at the wall" paper. They are the primary mechanism by which peer reviewers assess whether a claimed contribution is real.

Step 7: Statistical Significance and Reproducibility

Ablations alone don't establish significance if you only ran once. The field is increasingly expecting:

Multiple runs with different random seeds — to show that results are stable, not artifacts of initialization.
Confidence intervals or standard deviations — to quantify how much variance exists in the results.
Reproducibility artifacts — code releases, configuration files, training logs.

This step is historically underemphasized in CV compared to, say, clinical trials or social science. But the replication crisis has reached ML, and reviewers are catching up. A result that falls within the variance of the baseline is not a result.

Step 8: Analysis and Failure Cases

The best papers don't just report numbers — they show you where the model succeeds and where it fails, and they explain why.

Qualitative results — Visualizations of predictions vs. ground truth, cherry-picked to show both strengths and weaknesses.
Error analysis — Breaking down performance by scene type, difficulty, or object category to understand failure modes.
Interpretability tools — GradCAM, attention maps, feature visualizations that show what the model learned.
Failure case analysis — Honest examination of where the method breaks. This shapes future work and is often the most valuable section of a paper for other researchers.

Step 9: Generalization Testing

Does your improvement on COCO transfer to OpenImages? Does your depth model trained on NYU work on KITTI? Cross-dataset evaluation separates genuine methodological contributions from benchmark overfitting.

A model that achieves SOTA on one benchmark but collapses on related benchmarks has likely learned dataset-specific artifacts, not the underlying visual task. The strongest claims of contribution come with cross-dataset validation, even when the benchmark doesn't require it.

Step 10: Peer Review and Adoption

The final validation is external. Peer review provides the initial filter — typically 3-4 reviewers assess novelty, rigor, significance, and clarity. But peer review is a noisy signal. The stronger signal of significance is what happens after publication:

Independent replication — Other groups reproduce your results using your code or their own implementation.
Adoption as a baseline — Subsequent papers compare against your method, which means the community considers it a credible reference point.
Methodological influence — Your technique gets incorporated into other architectures or applied to other tasks.
Citation patterns — A lagging and imperfect proxy, but over time, highly-cited papers tend to be ones that changed how people think about a problem, not just ones that achieved a number.

The Leaderboard Failure Mode

There's a well-known pathology in the field worth naming explicitly: the leaderboard chase. This is when incremental improvements (+0.1 mAP, +0.02 IoU) get published without meaningful ablations, insight, or analysis. The paper's entire contribution is "we are 0.1% better than the previous SOTA," achieved through hyperparameter tuning, larger models, or training tricks that don't generalize.

This is why ablation rigor and hypothesis-driven design are so valued by strong reviewers — they distinguish papers that understand something from papers that merely achieved something. The history of the field is written by the former, not the latter.

Why This Matters

The process I've described isn't bureaucracy. It's the mechanism by which a field builds reliable knowledge. Each step exists because, at some point, its absence led to false claims that wasted years of subsequent work. Benchmarks exist because "it looks good" wasn't rigorous enough. Ablations exist because "our full method works" didn't explain why. Reproducibility requirements exist because too many results evaporated when other groups tried to use them.

Understanding this process — not just following it, but understanding why each step exists — is what separates a researcher from someone who trains models. The state space of research is constrained by these structures, and working within them effectively is how you produce knowledge that survives contact with other people's scrutiny.

The Signals: A Teleological Map of the Research Arc

The arc described above isn't just a sequence of steps. It's a sequence of signals — each one guiding the researcher toward a particular end, each one encoding a particular kind of information about whether you're getting closer to something true or drifting into noise. These signals are the teleological skeleton of research. They're what make the process directional rather than random.

What follows is an attempt to name each signal, trace its role in the arc, and then — because I think it's clarifying in a way that modern methodology rarely is — analyze each through Aristotle's four causes. The four causes (material, formal, efficient, final) aren't a historical curiosity. They're a framework for asking: what is this thing made of, what shape does it take, what produces it, and what is it for? This is, in a sense, the phenomenological method applied to the instruments of research themselves: before we can evaluate whether a signal is reliable, we need to describe what it is — what it's constituted from, what form it takes, what produces it, and what end it serves. The bracketing comes first. The judgment comes after.

Signal 1: The Loss Gradient

The loss gradient is the most fundamental signal in the arc. It's what makes training possible at all — the directional information that tells the optimizer which way to push the weights. Every forward pass produces a prediction, every comparison to ground truth produces an error, and backpropagation converts that error into a gradient that flows through every parameter in the network. The entire training process is gradient descent on a loss surface, and the shape of that surface — its valleys, plateaus, saddle points — determines what the model can learn.

The loss gradient is also the signal most removed from human judgment. No reviewer reads it. No committee evaluates it. It operates in a space of millions of dimensions that no one visualizes or fully understands. And yet it is the mechanism by which the model acquires whatever capability it ends up having. Everything downstream — benchmark scores, ablation deltas, reviewer assessments — is a consequence of what the gradient did during training.

The Four Causes of the Loss Gradient

Material cause (hyle): The gradient is constituted from partial derivatives of the loss function with respect to every learnable parameter in the network. Its substance is numerical — tensors of floating-point values, typically float32 or float16, computed via automatic differentiation through the computational graph. The material also includes the training data itself: every gradient is conditioned on a specific mini-batch of images and annotations.

Formal cause (eidos): The gradient takes the form of a vector in parameter space — a direction and magnitude for each weight. Its structure is determined by the architecture (which parameters exist and how they connect) and the loss function (which errors are penalized and how). An L1 loss produces sparse gradients that treat all errors linearly. An L2 loss produces gradients that penalize large errors quadratically. A cross-entropy loss produces gradients shaped by the log of predicted probabilities. The formal cause is the loss function's definition: it determines the geometry of the optimization landscape the gradient navigates.

Efficient cause (kinoun): The gradient is produced by the chain rule, applied mechanically through backpropagation. The efficient cause is the algorithm itself — automatic differentiation — triggered by each forward pass through the network. In practice, this means the optimizer (SGD, Adam, AdamW), the learning rate schedule, the batch size, and the hardware (GPU memory constrains batch size, which constrains gradient noise) all shape the gradient that actually gets applied. Two identical architectures with different optimizers traverse different paths through the same loss landscape.

Final cause (telos): The gradient exists to minimize loss — to move the model's parameters toward a configuration that produces predictions closer to ground truth. But "closer to ground truth" is defined entirely by the loss function, and the loss function is a human choice. The final cause of the gradient is whatever the researcher decided to optimize for, which may or may not correspond to what they actually care about. A model that minimizes L1 depth error may produce smooth, plausible depth maps that are metrologically useless. The telos of the gradient is alignment with the loss, not alignment with truth.

Signal 2: The Benchmark Metric

The benchmark metric is the signal that makes research communicable. Where the gradient operates in private (inside a training loop, on one researcher's GPU cluster), the benchmark metric operates in public. It's the number you put in your paper. It's the number reviewers compare against other papers. It's the basis for claims like "state-of-the-art" and "significant improvement."

Metrics are reductions. They take a complex, high-dimensional prediction (a depth map, a set of bounding boxes, a segmentation mask) and compress it into a scalar or small set of scalars. mAP, IoU, RMSE, AbsRel, F1 — each one throws away information in a specific way, and the information it throws away is the information the field has implicitly decided doesn't matter.

The Four Causes of the Benchmark Metric

Material cause (hyle): The metric is constituted from predictions and ground-truth annotations. Its raw material is the output of a trained model evaluated on a held-out set of images with known labels. The quality of the ground truth is therefore a material constraint on the metric's meaning — a depth benchmark with noisy LiDAR ground truth cannot produce metric values more precise than the sensor that generated the annotations.

Formal cause (eidos): The metric takes the form of a mathematical function that maps a set of (prediction, ground-truth) pairs to a scalar. Its form determines what counts as "better." mAP rewards recall and precision jointly, weighting them by IoU overlap thresholds. RMSE penalizes large errors quadratically, making it sensitive to outliers. AbsRel normalizes by ground-truth depth, making nearby errors matter more than distant ones. The formal cause is the equation — and the equation encodes a value judgment about which errors are tolerable and which are not.

Efficient cause (kinoun): The metric is produced by running inference on a test set and computing the prescribed function. In practice, evaluation scripts, data loaders, image preprocessing (resize, crop, normalization), and even numerical precision can affect the result. The efficient cause includes the entire evaluation pipeline, which is why two implementations of "the same metric" can produce different numbers.

Final cause (telos): The metric exists to enable comparison — to answer "is method A better than method B?" But "better" is defined by the metric's form, not by the task's actual requirements. A depth model with lower RMSE is "better" by that metric, but may be worse for a downstream task that requires accurate edges rather than globally smooth predictions. The telos of the metric is commensurability across papers, which is valuable, but it can diverge from the telos of the research program, which is understanding.

Signal 3: The Ablation Delta

The ablation delta is the difference in performance between your full method and a version with one component removed. It's the signal that answers causality questions: did this specific change cause the improvement? Without it, a paper can only claim correlation between the method and the result. With it, you can attribute specific performance gains to specific architectural or training decisions.

The ablation delta is arguably the most epistemically important signal in the arc, because it's the closest thing to a controlled experiment that ML research produces. It holds everything constant except one variable and measures the effect.

The Four Causes of the Ablation Delta

Material cause (hyle): The ablation delta is constituted from two metric evaluations — the full method's score and the ablated variant's score — and the arithmetic difference between them. Its material is therefore parasitic on the benchmark metric: whatever the metric measures, the ablation delta measures the change in that measurement attributable to one component.

Formal cause (eidos): The delta takes the form of a signed scalar (or a vector of signed scalars across multiple metrics). Its structure is a comparison — a subtraction. The form implies a counterfactual: "what would have happened if this component were absent?" The formal cause also includes the experimental design: which component was removed, what replaced it (nothing? a simpler alternative? a random initialization?), and whether the ablated model was retrained from scratch or fine-tuned.

Efficient cause (kinoun): The ablation delta is produced by training and evaluating multiple model variants under controlled conditions. This is expensive — each ablation requires a full training run — which is why many papers cut corners here. The efficient cause is computational budget, and insufficient budget produces incomplete ablations that leave causal questions unanswered.

Final cause (telos): The ablation delta exists to establish causal attribution — to answer "which parts of your method matter?" Its telos is mechanistic understanding. A paper with strong ablations doesn't just show that the method works; it shows why it works, which components are essential, and which are incidental. This is what makes a result transferable: if you know why something works, you can apply the insight to other problems. If you only know that it works, you can only copy the recipe.

Signal 4: The Replication Result

The replication result is the signal that separates science from anecdote. It's what happens when someone other than the original authors runs the method — using the released code, or their own reimplementation — and reports whether they get the same numbers. Replication is the immune system of research: it catches errors, overfitting, unreported hyperparameter tuning, and outright fraud.

In CV, replication is complicated by the sensitivity of deep learning to training details. Random seeds, data augmentation order, learning rate warmup schedules, and even GPU hardware can produce different results. This doesn't mean replication is impossible — it means that a "replication" needs to be understood as "results within a reasonable variance of the original," not "identical numbers."

The Four Causes of the Replication Result

Material cause (hyle): The replication result is constituted from an independent training run (or set of runs) using the same method, data, and evaluation protocol as the original paper. Its material includes the code (released or reimplemented), the dataset, the hardware, and the researcher's time. Unlike the original result, the material also includes the original paper itself — the replicator's understanding of the method is mediated by the paper's description, which may be incomplete.

Formal cause (eidos): The replication takes the form of a comparison: original reported metrics vs. independently obtained metrics. The form includes a judgment of "close enough" — how much deviation is acceptable before a replication is considered a failure? There is no universal standard for this. Some fields use statistical tests; CV mostly relies on informal norms (within a few percent is fine; off by 10% suggests a problem).

Efficient cause (kinoun): The replication is produced by an independent researcher or group investing time and compute to reproduce the result. The efficient cause is therefore partly social: replications happen when someone needs the baseline (to compare against in their own paper) or doubts the result (to verify before building on it). Replications rarely happen for their own sake — there is little career incentive to publish "we got the same number they did."

Final cause (telos): The replication exists to establish reliability — to answer "can anyone get this result, or just the authors?" Its telos is trust. A replicated result can be built upon. An unreplicated result is provisional. The asymmetry between the cost of producing a result (one group, one paper) and the cost of replicating it (another group, usually unpublished) is one of the structural weaknesses of the field.

Signal 5: The Peer Review Decision

The peer review decision is the gatekeeping signal — the binary (accept/reject) or graded (scores, meta-reviews) judgment that determines whether a result enters the published literature. It is the most overtly social signal in the arc, and the one most contaminated by factors orthogonal to truth: reviewer expertise, reviewer mood, conference quotas, author reputation, and the stochastic matching of papers to reviewers.

And yet it is indispensable. Without peer review, there is no filter between "I trained a model and it got good numbers" and "the field considers this a contribution." The filter is noisy, biased, and slow, but the alternative — no filter — is worse.

The Four Causes of the Peer Review Decision

Material cause (hyle): The review is constituted from the submitted paper (text, figures, tables, supplementary material), the reviewer's expertise and biases, the review form (which asks specific questions: novelty, clarity, significance, experimental rigor), and the meta-reviewer's synthesis. The material is heterogeneous — part artifact, part human judgment, part institutional process.

Formal cause (eidos): The decision takes the form of scores (typically 1-10 on multiple axes), free-text feedback, and a final accept/reject recommendation. The formal structure varies by venue: some use single-blind review (reviewer knows author), some use double-blind (neither knows the other), some use open review (everything is public). The form shapes the content — double-blind review reduces reputation bias but increases the incentive to signal identity through self-citation patterns.

Efficient cause (kinoun): The review is produced by volunteer reviewers — typically other researchers in the field — working under time pressure with variable motivation. The efficient cause is the review process itself: area chairs assign papers, reviewers read and score them, meta-reviewers adjudicate disagreements. The quality of the efficient cause is highly variable. A careful reviewer who spends eight hours on a paper produces a different signal than a distracted reviewer who skims it in forty minutes.

Final cause (telos): The review exists to filter — to separate contributions that advance the field from submissions that don't. But "advance the field" is defined by the reviewers' understanding of the field, which is a function of the current paradigm. The telos of peer review is quality control, but the mechanism through which it operates is consensus — and consensus can suppress genuinely novel work that doesn't fit existing frameworks. The most important papers in the history of CV (AlexNet, ResNet, the original transformer) all had contentious reviews.

Signal 6: The Citation

The citation is the long-tail signal — the one that operates on the timescale of years rather than months. It's what happens after publication: how many subsequent papers reference yours, how they reference it (as a baseline to beat? as a methodological foundation? as a historical curiosity?), and whether your ideas get absorbed into the field's default toolkit.

Citations are the closest thing research has to a market signal. They aggregate thousands of independent decisions by other researchers about what's worth building on. Like market signals, they're informative but distorted — by Matthew effects (famous papers get cited because they're famous), by recency bias, by the structure of reference lists (you cite what you've read, and you read what's already well-cited).

The Four Causes of the Citation

Material cause (hyle): The citation is constituted from a reference in another paper's bibliography and (usually) an in-text mention explaining the relationship. Its material is textual — a bibliographic entry — but its significance is relational: it encodes a judgment that the cited work is relevant to the citing work.

Formal cause (eidos): The citation takes the form of a directed edge in a graph — from citing paper to cited paper. Aggregated, citations form a network whose structure reveals influence patterns, intellectual lineages, and community boundaries. The formal cause also includes the context: a citation in the related-work section ("Smith et al. also studied X") carries different weight than a citation in the method section ("We build on the architecture proposed by Smith et al.").

Efficient cause (kinoun): The citation is produced by the citing author's decision to reference the work — a decision influenced by relevance, visibility, reviewer expectations (reviewers often demand citations to specific prior work), and social dynamics (citing your advisor's papers, citing papers from the group you want to collaborate with). The efficient cause is partly epistemic and partly sociological, which is why citation counts are a noisy proxy for impact.

Final cause (telos): The citation exists to establish intellectual lineage — to answer "where does this work come from, and what does it build on?" Its telos is continuity: connecting new results to the existing body of knowledge. But citations also serve a signaling function (demonstrating that the author knows the literature), a political function (acknowledging collaborators and gatekeepers), and an evaluative function (citation counts are used in hiring, tenure, and funding decisions). The telos is therefore plural and often conflicting.

The Crack in the Foundations: Failure Modes, Falsification, and the Ill-Posed Nature of Research in an AI Age

Everything described above — the signals, their causes, the arc they trace — rests on an assumption so foundational that it rarely gets stated: that the process, followed carefully, converges on truth. That benchmarks measure something real. That ablations isolate something causal. That peer review filters for something valuable. That the teleology is justified — that the signals are actually pointing somewhere worth going.

The uncomfortable question is: what if they aren't?

This isn't nihilism. It's the question that philosophy of science has been wrestling with for a century, and the answer is less reassuring than working scientists tend to assume. Let me trace the argument through two thinkers who defined the terms, and then explain why the AI age makes the problem worse, not better.

Popper: The Falsification Problem

Karl Popper's contribution was to point out that science doesn't prove theories true — it proves them false. A theory is scientific if and only if it makes predictions that could, in principle, be shown to be wrong. The value of an experiment is not that it confirms a hypothesis, but that it could have refuted it and didn't. Knowledge accumulates not by verification but by the survival of theories that have withstood attempts at falsification.

Applied to the CV research arc, Popper's framework asks: what would falsify your contribution? And here the signals start to crack.

A benchmark result doesn't falsify the baseline — it just shows that your method scores higher on this metric, on this dataset, under these conditions. The baseline isn't "wrong"; it's just lower. An ablation delta doesn't falsify the removed component — it shows the component contributes to this metric, but says nothing about whether the component captures a real property of the visual world or an artifact of the dataset. A successful replication doesn't falsify the alternative hypothesis that the method is overfit to a narrow distribution — it just shows the overfitting is reproducible.

The problem is that the entire arc is structured around optimization, not falsification. We're not trying to prove our method wrong. We're trying to show it's better than the last method. This is a fundamentally different epistemic activity, and Popper would say it's not science — it's engineering. The distinction matters because engineering is about what works, and science is about what's true, and those can diverge for a very long time before anyone notices.

Consider: the entire field of monocular depth estimation spent a decade optimizing for scale-invariant metrics that explicitly throw away absolute scale information. The methods got better and better at predicting relative depth. The benchmarks showed consistent progress. The ablations were sound. The papers were well-reviewed. And none of it produced a measurement — because the problem formulation (Step 1) had defined the task in a way that excluded measurement by construction. No individual signal in the arc could detect this failure, because every signal was downstream of the formulation. The teleology was internally consistent and externally vacuous.

Kuhn: The Paradigm Problem

Thomas Kuhn's contribution was to point out that science doesn't progress continuously — it alternates between periods of "normal science" (puzzle-solving within an accepted framework) and "revolutionary science" (framework replacement). During normal science, the paradigm defines what counts as a problem, what counts as a solution, and what counts as evidence. Anomalies that don't fit the paradigm are ignored, explained away, or treated as errors — until enough of them accumulate that the paradigm collapses and is replaced.

The CV research arc is textbook normal science. The paradigm is: deep learning works, benchmarks measure progress, architectural innovation drives improvement. Within this paradigm, the signals function perfectly. You train, you evaluate, you ablate, you publish. The leaderboard goes up. The field advances.

But Kuhn's warning is that normal science is blind to its own assumptions. The benchmark defines what progress looks like, so you can't use the benchmark to question whether the definition of progress is correct. The peer review process is staffed by practitioners of the current paradigm, so it systematically filters out work that challenges the paradigm's foundations. The citation network reinforces whatever the field is already doing, because you can only cite work that exists within the current framework.

The leaderboard failure mode I described earlier is a symptom of Kuhnian normal science running out of genuine problems: when the paradigm has been exploited to the point where further progress requires incrementally larger models and longer training schedules rather than new ideas, the signals still work — the numbers still go up — but the knowledge production has stopped. The field is spinning its wheels inside a local optimum of the paradigm, and the signals can't tell you that because the signals are defined by the paradigm.

The AI Age: When the Signals Lose Their Referent

Now add the complication that makes all of this urgent rather than merely philosophical: AI systems are now capable of executing the research arc autonomously or semi-autonomously. LLMs can write papers. Automated ML systems can search architectures, tune hyperparameters, run ablations, and produce results that pass peer review. The signals I've described — gradients, metrics, ablation deltas, review decisions, citations — can all be generated, optimized, and gamed by systems that have no understanding of what the signals mean.

This is the ill-posed problem at the heart of research in the AI age: the signals that guide research are all proxies, and AI systems are better at optimizing proxies than they are at understanding what the proxies point at.

When a human researcher optimizes a benchmark metric, they (usually) understand that the metric is a proxy for a visual capability, and they (usually) have intuitions about when the proxy breaks down. When an automated system optimizes the same metric, it has no such understanding. It will exploit evaluation bugs, overfit to test-set statistics, find adversarial inputs that inflate scores, and produce methods that achieve state-of-the-art numbers without learning anything about the visual world.

This isn't a failure of the AI system. It's a failure of the signal. The benchmark metric was designed to be a sufficient guide for human researchers operating in good faith within a shared paradigm. It was never designed to be robust to optimization by systems with superhuman capability and zero understanding.

The same applies to every signal in the arc:

Loss gradients can be optimized by systems that don't understand what the loss measures, producing models that minimize error on the training distribution without learning generalizable visual features.
Ablation deltas can be manufactured by systems that search over component combinations without hypotheses, producing "causal" attributions that are actually artifacts of the search process.
Peer review can be gamed by systems that learn to produce papers matching the stylistic and structural expectations of reviewers, without the underlying work being novel or significant.
Citations can be inflated by systems that produce large volumes of incremental work citing each other, creating the appearance of impact without the reality.

Each signal, when examined through its four causes, reveals the same vulnerability: the final cause — the telos, the thing the signal is for — is not encoded in the signal itself. The signal is a proxy, and the relationship between the proxy and its purpose is maintained by human judgment, shared norms, and good faith. Remove any of those, and the signal becomes a target rather than a guide.

Goodhart's Law, All the Way Down

This is Goodhart's Law applied to the entire epistemology of a field: when a measure becomes a target, it ceases to be a good measure. What's new in the AI age is not that the signals can be gamed — they always could — but that the speed and scale of gaming has increased by orders of magnitude, while the capacity for human oversight has not.

The research arc I described in the first half of this post is a product of centuries of epistemic engineering. Each signal exists because a previous failure mode was identified and a corrective was designed. Benchmarks corrected for subjective evaluation. Ablations corrected for confounded contributions. Replication corrected for irreproducible results. Peer review corrected for unchecked claims.

But each corrective was designed for a world where the primary failure mode was human error — sloppiness, bias, self-deception, occasional fraud. The failure mode of the AI age is different: it's systematic proxy optimization at superhuman speed. The correctives aren't calibrated for this. A peer reviewer who spends eight hours on a paper cannot detect that the ablations were generated by an automated system that searched over 10,000 component combinations and reported only the ones that told a clean story. An evaluation server cannot detect that a model's high benchmark score reflects exploitation of annotation artifacts rather than visual understanding.

The Challenge

There is, I think, little reason to favor these tools and modes of thinking in themselves. They have no intrinsic epistemic authority. Benchmarks are conventions. Metrics are value judgments. Ablations are controlled experiments only to the extent that the controls are actually controlled. Peer review is expert opinion aggregated under time pressure. Citations are popularity weighted by network effects.

What made them work — what gave them their epistemic force — was the social and intellectual context in which they operated: small communities of researchers who knew each other, who shared tacit knowledge about what mattered and what didn't, who could distinguish a genuine insight from a leaderboard trick because they had spent years developing intuition about the problem. The signals were never self-sufficient. They were scaffolding for human judgment.

The question for the field — not just CV, but all of empirical science — is what happens when the scaffolding has to bear weight it was never designed for. When the volume of papers exceeds the capacity of human review. When the complexity of methods exceeds the capacity of human understanding. When the optimization pressure on every signal exceeds the robustness of the signal's design.

I don't have an answer. I'm not sure anyone does yet. But I think the first step is to stop treating the research arc as a neutral methodology and start treating it as what it is: a historically contingent set of social and mathematical conventions that worked reasonably well under specific conditions, and whose continued adequacy under radically different conditions is an open — and genuinely important — question.

Popper told us that science progresses by falsification. Kuhn told us that falsification only works within a paradigm, and paradigms change for reasons that aren't fully rational. The AI age is telling us something neither of them anticipated: that the entire apparatus of signal, benchmark, and review can be operated — fluently, at scale — by systems that have no access to the thing the apparatus was built to find.

What that means for how we should do research is, I think, the most important methodological question of the next decade. And it's one that no benchmark can answer.

That's why I started with phenomenology rather than prescription. Before we can fix the apparatus, we have to see it clearly — not as a neutral methodology, but as a lived structure of attention, judgment, and trust that was built by and for human minds. A phenomenology of scientific research doesn't tell you what to do next. It tells you what you're actually doing now, which turns out to be the harder and more necessary question.

RGHTEOUS GAMBT

The State Space of Research