Why More Robot Data Isn't Making Better Robots

Teams keep scaling the hours they log, yet a policy that aced the lab still falls over in a new room. The bottleneck in Physical AI was never volume — it's coverage.

There is a quiet assumption baked into most robot-learning roadmaps: that the road to a more capable policy runs through more data. More hours. More episodes. More teleoperation. Scale the collection, the thinking goes, and generalization will follow.

It mostly doesn't. Teams routinely log tens of thousands of hours and still watch a policy that aced the lab face-plant the moment the floor is a different color or the bin sits two inches to the left. The constraint was never the size of the dataset. It is what the dataset actually covers.

The difference between seeing more and seeing differently

In 2018, a research team led by John Zech tested a deep-learning model that read chest X-rays for signs of pneumonia. On scans from the hospital it had trained on, it was excellent. Point it at a different hospital and its accuracy slipped — sometimes sharply.

When the team looked closer, the model hadn't really learned to read lungs. It had learned to read hospitals. Different sites use different scanners, position patients differently, and bake department-specific markers into the image. Because pneumonia was more common at some sites than others, the cheapest way to score well on the training set was to recognize which hospital a scan came from and bet accordingly. The model was a brilliant hospital-detector wearing a pneumonia-detector's badge.

This is not a one-off. The textbook version is a classifier that separated huskies from wolves with suspicious confidence — until researchers showed it was keying on the snow in the background, because nearly every wolf photo happened to be snowy. Same failure, different costume: the model grabbed whatever happened to correlate with the label, not the thing anyone actually cared about.

Models learn the conditions, not the task

A model only knows the slice of the world its training data occupies — the region where examples actually exist. Statisticians call it the support of the distribution. Inside that region, models interpolate beautifully. Step outside it and they extrapolate, which is a polite word for guess.

The trouble is that the world at deployment rarely matches the world at collection. Lighting drifts, backgrounds change, the operator is new, the object is a slightly different shade of red. This gap — when the inputs a model meets in the wild pull away from the inputs it trained on — is covariate shift, and it is exactly where confident systems quietly break. A policy that only ever saw one warehouse has no way to know which features were essential (the geometry of the grasp) and which were incidental (the reflection off that particular floor). So it leans on all of them.

A dense training region with several data points falling outside its boundary, illustrating covariate shift. — Inside its training region a model interpolates cleanly; the points that fall outside — new lighting, new rooms, new operators — are where it has to guess.

Low-entropy data builds confidently brittle robots

This reframes what a “big” dataset even means. Ten thousand episodes from a single cell — same lighting rig, same bin, the same handful of operators moving the same way — carry far less information than the raw count suggests. The variation that teaches a model what to ignore simply is not there. Call it low-entropy data: high volume, low variety.

Five hundred episodes deliberately spread across lighting, surfaces, operators, object instances, and failure modes can be worth more than ten thousand from one cell. The diversity forces the model to find the pattern that holds across all of them — the invariant — instead of memorizing the conditions. Low-entropy data does something more dangerous than fail outright: it produces models that are confidently brittle. High scores, high certainty, wrong assumptions. The metrics look great right up until the day of deployment.

A tight dense cluster of identical points beside a sparse but evenly spread set of points. — Ten thousand near-identical episodes (left) can carry less information than a few hundred that deliberately span the space (right).

The fix is representative, not bigger

If variance is the bottleneck, then collection has to be designed for variance, not just throughput. A few principles we build around:

Sample for coverage, not count. Decide up front which conditions matter — lighting, scene, operator, object, time of day — and deliberately span them, instead of hoping diversity falls out of sheer volume.
Borrow from domain randomization. Robotics learned long ago that aggressively varying simulation — textures, lighting, physics — yields policies that transfer to reality. The same logic applies to real-world capture: vary on purpose.
Keep the failures. A grasp that slips and recovers carries more signal than a clean success. Most pipelines delete exactly the episodes a policy most needs to learn from.
Measure the distribution. Without metadata on every episode — where, who, under what conditions — you cannot tell a dataset from a pile of files. You cannot see which regions are oversampled, which are missing, or where a failure originated. Throughput-optimized pipelines drift toward sameness by default; only measurement catches it.

A grid with points spread to cover many cells across two axes, with some cells left empty. — Designing collection for coverage: sampling deliberately across conditions — and being able to see which regions are still empty.

What this means for Physical AI

Robotics is leaving the era where the headline number was hours collected. The number that matters next is how much of the deployment world those hours actually represent. A humanoid or VLA policy does not need more of the same loop — it needs range: many operators, many environments, many ways a task can go right and wrong, all labeled in context.

That is the bet behind how we collect at Manukriya. Not the same lab loop repeated until the counter looks impressive, but real people doing real work across environments, with the slips and corrections left in and the conditions written down. The goal was never a bigger dataset. It is a dataset shaped like the world the robot has to work in.

The next constraint in Physical AI is not access to more data. It is access to data that looks like the world your robot will actually meet.