A 60-page field survey of every shipping deep-tech model in production right now. Where the moats actually are. Where they aren't. What the next twelve months will and won't do.
The defining shift of 2025 was not a new model release. It was the slow, mostly unannounced movement of physical-AI systems out of demos and into production lines, hospital wards, energy substations, and warehouses. We tracked 412 such systems. Thirty-eight are now load-bearing — meaning a real customer pays a real bill that depends on the model not failing in physical space.
This report is the synthesis of twelve months of fieldwork. We sat in on hundreds of integration meetings, ran our own benchmark across every model whose weights we could obtain, interviewed 84 operators, and pressure-tested the headline claims of every shipping vendor we could get on a call. The picture that emerges is sharper, and more uneven, than the trade press has suggested.
Three observations frame everything else in the report.
The model is no longer the moat. The cost of training a competitive base model in any given physical domain dropped between 2.4× and 11× depending on category over the year. Where defensibility now lives — without exception, in every operator interview — is in the data-collection rig, the calibration recipe, the deployment harness, and the liability framework. The bits between the model and the world are now the bits worth paying for.
The deployment gap is widening, not narrowing. Of the 412 models we tracked, 89 reported in-domain benchmark performance within 5% of state-of-the-art. Only 38 have a real customer in production. The gap between "passes the benchmark" and "doesn't break the customer" is the dominant cost in every program we examined — and it grew, year-on-year, in all five physical-AI categories.
The simulation layer is the new platform fight. Whoever owns the simulators that produce training and evaluation data for physical AI will own the substrate of the next decade in the way that whoever owned PyTorch and CUDA owned the last one. We count seven serious contenders for that layer right now. Three are open. Four are not. The next eighteen months will likely halve that field.
If 2024 was the year the demos got real, 2025 was the year the integration costs got real with them. 2026 is the year the survivors find out which is harder.
We publish the Atlas every February for one reason: every founder, operator, and investor we work with reaches roughly the same conclusion in roughly the same order, and we can shorten that loop by writing it down. The Atlas is what we wish we had read when we were last starting from scratch.
It is not a market map. We have no opinion about which logos belong in which quadrant of which graphic. It is a field report — the things our researchers and venture partners learned by taking the meetings, reproducing the benchmarks, and watching the systems break in the wild.
Read it as one input, not five. The numbers are precise where they need to be. The judgements are ours.
// methodology, dataset definitions, inclusion criteria, and the full annotated bibliography are published in §02 and §10 of this document. The companion machine-readable dataset is at uplabs.lab/atlas-26/data under CC-BY-SA.
A field report is only as good as its inclusion criteria. Ours were strict, and we threw out roughly a third of the models we initially shortlisted. Here is how we built the dataset and what we explicitly chose to leave out.
To enter the Atlas a model had to satisfy all five conditions:
Four hundred and twelve models passed all five gates. Roughly two hundred more were excluded — mostly for failing condition 4 (no 2025 evidence), with a long tail of failures on condition 3 (no non-foundation-model baseline) and condition 5 (eval not reproducible).
The Atlas is a survey of physical AI. We deliberately left out three large adjacent categories:
We also excluded all systems where the only evidence was a press release or a cinematic demo video. That excluded more applicants than every other criterion combined.
For every model in the Atlas we attempted at least one of the following pressure tests, in priority order: independent benchmark reproduction, operator interview about real-world performance, paper-trail audit of training data, and physical-deployment site visit. We managed at least one for 387 of the 412 models. The remaining 25 are flagged in the dataset as UNVERIFIED and excluded from any quantitative claim in the body of this report.
Where our results disagreed with the original authors' published numbers, we wrote to the authors. Forty-one teams responded. In most cases the discrepancy was a configuration difference, which we documented and then averaged. In seven cases the discrepancy was unresolved; we report both numbers.
// the full pressure-test log, including the seven unresolved disagreements, is in the companion dataset.
The Atlas sorts the 412 models into five categories. The categories are deliberately use-case-shaped, not architecture-shaped, because the architectural diversity within each category is much smaller than the people inside it want to admit, and the deployment diversity is much larger than the press has acknowledged.
Models that produce control signals for a robot, vehicle, or actuator. Within this we split locomotion (legged, wheeled, aerial) from manipulation (arm-and-gripper) — they share less than the literature suggests and are diverging architecturally as the year went on.
The defining shift of 2025: the move from per-task imitation-learning models to multi-task models trained on cross-embodiment data. The seven serious players in this category have all converged on roughly the same recipe — vision-language-action backbones at 3B to 30B params, trained on a mix of teleoperation, simulation, and YouTube — and the differences between them are now almost entirely on the data-collection side.
Models that consume sensor streams and emit either a labelled scene, an occupancy grid, or a predictive rollout. The split inside this category is whether the model is asked to recognise or to predict — and the predictive subset is where almost all the 2025 progress was. Recognition has been a solved-shaped problem since 2023; the thing that changed in 2025 is that you can now ask a model "what is going to happen in the next 6 seconds in this scene" and get an answer that is good enough to act on, in narrow domains.
Models that approximate an expensive simulator — fluid dynamics, molecular dynamics, plasma physics, weather. This is the category with the largest quantitative gains in 2025 and the smallest deployment count. Surrogates are 100 to 10,000× faster than first-principles simulation and now within engineering tolerance for an expanding list of problems. The bottleneck is regulatory, not technical: the customer wants a simulation result with a regulatory pedigree, and the surrogate doesn't have one yet.
Models that produce candidate designs — molecules, antibodies, materials, mechanical parts, circuit layouts — conditioned on a target spec. The wave of 2024 was generative chemistry; the wave of 2025 was generative engineering, with serious work shipping in mechanical CAD, ASIC layout, and packaging-engineering. The defensibility here is in the eval: a generated molecule means nothing without a wet-lab to test it, and the labs are the moat.
Models that predict an operational variable in a physical system — grid demand, traffic flow, hospital admissions, supply-chain lead times. The dullest category and the largest by deployment count. These models have been shipping for years; what changed in 2025 was the move from per-customer hand-tuned models to general-purpose foundation models in the category, which finally crossed the cost-to-deploy threshold for mid-market customers.
| category | count | in production | headline ∆ '25 | moat live in |
|---|---|---|---|---|
| embodied controllers | 112 | 4 | +38% benchmark | data rig |
| perception & world | 97 | 11 | +22% predictive | sensor stack |
| scientific surrogates | 71 | 3 | up to 10,000× | regulatory |
| generative design | 69 | 6 | +44% hit-rate | wet-lab |
| operational forecasting | 63 | 14 | +11% MAPE | integration |
The architectural diversity inside each category is smaller than its proponents admit; the deployment diversity is larger than its critics admit.— Atlas '26, §03 framing note
Each category gets a chapter of its own deeper in the report, with the dataset, the live deployments, the open problems, and our best read of the eighteen-month trajectory.
The single most reliable predictor that a physical-AI program would still be running in twelve months was not its benchmark score, its parameter count, its founding team's pedigree, or its capital raised. It was whether the team owned its own data-collection rig.
We tested this claim three ways. We looked at the 38 models in production and asked which of them had bespoke collection infrastructure: 34. We looked at the dead programs — there are 47 of those in our dataset — and asked the same: 6. And we looked at the funded-but-dormant middle, where the answer was almost exactly even.
The point is not that owning a data rig causes survival. The point is that the willingness to invest two years of capex into a sensor harness, a simulator, or a teleop fleet correlates extremely tightly with the operational seriousness needed to ship a physical-AI product at all.
Every shipping embodied controller we tracked had at least one of: a teleoperation fleet of more than 100 units, a domain-specific simulator with calibration loops back to the physical world, or a long-running sensor deployment producing labelled data at rate. The cost of this infrastructure is in the high seven figures and the time-to-stand-up is twelve to eighteen months. That is the moat. It is not glamorous and it is not transferable.
The unsexy infrastructure that wraps the model: the watchdogs, the fallback policies, the human-in-the-loop interfaces, the failure-mode taxonomy, the on-call rota, the simulator-replay system that lets you debug last Thursday's incident at 10× speed. None of this is in the paper. All of it is in the moat. The teams that survived 2025 had built this; the teams that didn't, hadn't.
Especially in surrogate models for engineering and in any clinical or safety-critical setting. A model that has a paper trail through the regulator is worth meaningfully more than one that doesn't, because the customer's compliance department is spending months they don't have on every new vendor. We watched a competitively inferior model win a major hospital deal on this axis alone.
For generative-design especially: whoever owns the wet lab, the test track, the wind tunnel, the synthesis pipeline, owns the velocity of the design loop. The model is the cheap part. The closing-the-loop infrastructure is not.
One of the leading mobile-manipulation programs we tracked has spent more on its teleoperation fleet over the last two years than on model training. The fleet — roughly 280 units across three sites — is the data-generation engine and the deployment-validation surface at once. The model, in their telling, is the artefact that falls out of the fleet, not the other way around. The competitor that invested instead in a larger model and a simulator-only data pipeline is now losing every customer comparison on long-tail reliability and is actively re-architecting around their own data rig, eighteen months too late.
Three forces collapsed model-level differentiation across the year:
The model is the cheap part. Everything around the model is the expensive, defensible part. This is the reverse of the picture five years ago and most of the strategy you read still hasn't caught up.
Equal weight has to be given to the inverse: the things that everyone is treating as defensible that aren't, and the moats that have already collapsed without the trade press having fully noticed. We list five.
Two years ago this was the most-discussed axis in the field. It is now an afterthought, and rightfully so. Within every category in the Atlas we found at least one model under 7B params that was within deployment tolerance of the 70B-plus state-of-the-art. The relationship between parameter count and customer outcomes was, to a first approximation, zero.
The benchmark you read on the leaderboard is, with high probability, contaminated, gameable, or both. We saw the same six tricks used so consistently across leaderboard submissions that we have stopped weighting them in any of our own diligence. None of them are intentional; all of them are structural. We discuss the six tricks, with examples, in the appendix.
Several deals announced in 2024 as "exclusive partnerships with a frontier lab" did not survive contact with 2025. Either the exclusive expired, the partner shipped a competing first-party product, or the partnership turned out to be a customer relationship dressed in marketing language. We count nine such partnerships announced in 2024; six of them are no longer load-bearing.
The "built for X" pitch — a model architecturally specialised for a single domain — was a viable position in 2022. By 2025 it has decayed into a marketing line. The general-purpose backbones now win at most domain tasks once they have access to the same data, and the data is what's actually scarce. We saw three companies built around this pitch fold or pivot in the year.
Worth noting because it is the most painful one. Several teams we admire built their data rig in 2021–22 and have been overtaken in 2025 by teams that started two years later, learned from public failure modes, and are running on cheaper infra. Being early is a moat only when paired with the willingness to keep tearing your own rig down.
The graveyard of physical-AI startups in 2025 is mostly populated by teams that confused their head start with a moat.— Atlas '26, §05 closing
Briefly, the six most common ways a leaderboard number ends up not predicting deployment outcomes:
None of these is a fraud claim. All of them are structural — the result of a fast-moving field with under-enforced eval norms. The Atlas weights every reported number against our pressure-test of which of these six modes the eval is exposed to.
Eighty-nine of the 412 models in the Atlas are within five points of state-of-the-art on their headline benchmark. Thirty-eight are running in production. The shortfall — fifty-one models that pass the bar on paper but not in the field — is the single most important number in this report.
We spent more diligence time on this gap than on any other section. The pattern that emerges is consistent enough across categories that we feel confident in a tentative claim: the deployment gap is structural, not residual. It is not a problem that gets smaller with more model effort. It gets smaller with more environment effort, and the field is structurally underinvested in environment.
Across the 51 not-shipped-but-good-enough models, the most common cause of the deployment shortfall, in order:
Note that only the first of these is something the model team could address by being better at modelling. The remaining four are environment-engineering, regulation, and integration problems. Most of the funded teams we tracked have a head of research and no head of deployment.
The 38 models in production crossed the gap one of two ways. The split is almost even.
Take a domain you can fence in tightly. A specific warehouse layout, a specific protein family, a specific weather product. Build the data rig and the deployment harness for that fenced-in domain. Ship. Expand later. Twenty-one of the 38 production models followed this pattern. They are smaller in scope than their press suggests; they are more reliable than their competitors.
Take a general model and wrap it in enough environment-engineering — guardrails, fallback policies, human-in-the-loop, replay-debug — that the long tail is contained operationally rather than technically. Seventeen of the 38 followed this pattern. They are riskier than the narrowed competitors but cover much wider use-cases per deploy.
Neither pattern is "correct". The teams that fail are the teams that pick neither — that ship a general-purpose model into a wide deployment with no environment harness and rely on the model's average-case performance to carry them. The model is good on average. Production is a long-tail business.
The bar for shipping is not the average case. It is the worst weekday of the worst week of the year.— Atlas '26, §06 closing
If you are a customer evaluating a physical-AI vendor, the questions in the Atlas RFP appendix (§10.3) are the ones that actually predict shipping outcomes. None of them are about the model. All of them are about the environment around the model. We have included a clean copy you can adapt.
Twelve specific predictions for what the next twelve months will and will not do, with our confidence level for each. Past Atlases have tracked roughly sixty-five percent on these calls. Section 8 of this report scores last year's twelve.
| # | call | conf |
|---|---|---|
| 01 | The number of physical-AI models in production crosses 100 by Feb '27 (currently 38). | 0.78 |
| 02 | At least two of the four leading embodied-controller programs converge on the same teleop-data format and standardise it publicly. | 0.55 |
| 03 | A scientific-surrogate model becomes the default in at least one regulated engineering process (likely fluid dynamics in aerospace). | 0.62 |
| 04 | Generative-design hit-rates plateau at the model layer; the year's gains come from wet-lab throughput. | 0.71 |
| 05 | Open-weight robotics models approach within 10% of frontier closed models on standard manipulation benchmarks. | 0.66 |
| 06 | The simulation-platform fight (§01) collapses from 7 contenders to 3 by year-end. | 0.52 |
| 07 | At least one major hospital system standardises on a foundation-model triage product. | 0.74 |
| 08 | Per-token training costs in robotics fall another 3× year-on-year. | 0.61 |
| 09 | A high-profile physical-AI failure mode causes a sector-wide insurance-product re-rate. | 0.58 |
| 10 | The deployment gap (§06) does not close — measured as production count over near-SOTA count. | 0.69 |
| 11 | An open eval suite for physical generalisation — multi-lab, regulator-trusted — ships and is adopted by ≥3 of the top 10 programs. | 0.41 |
| 12 | One of the 38 production models in this Atlas is no longer in production by next year's edition. | 0.85 |
Confidence levels are our subjective probabilities based on the diligence behind each claim. Anything below 0.5 we publish for transparency but don't recommend acting on; we publish them because we owe the reader our actual uncertainty rather than our marketing version of it.
The full reasoning behind each call — the operator interviews, the precedents, the priors — is in the companion document. The appendix is longer than this report.
Every call is logged with its confidence the day the Atlas ships. Twelve months later we score the set against a simple Brier loss — the squared distance between our stated probability and the realised outcome. We publish the raw scores for every call we have made since 2022. This is not a sophisticated calibration system; it is a public commitment to be embarrassed by our own track record. So far the year-on-year Brier score has trended down. We expect it to stop trending down at some point and we'd rather you watch us catch ourselves than hide it.
Twelve months ago we made twelve calls in the AI Atlas '25. Eight landed, three didn't, one is still ambiguous. The three misses, in detail, with what we should have weighted differently — because the failure modes are more useful than the hits.
We were wrong. We had it at 0.74 confidence. We underestimated the speed at which a coalition of academic labs and one well-capitalised commercial release would close the gap. The lesson: in fast-moving categories, "X won't exist" calls degrade much faster than "Y will dominate" calls. We are pulling the same shape of call out of this year's set as a result.
They didn't quite. We had it at 0.66. The actual gain was around 1.4×, depending on how you slice the benchmark. We over-extrapolated from a strong Q4 '24 paper that turned out to depend on an unusually clean wet-lab pipeline that was not widely reproducible. Lesson: a single strong paper at the end of the year is often the high-water mark of that year's recipe, not the start of next year's.
None did, in the calendar year. We had it at 0.59. One has now begun an orderly wind-down in Q1 '26, which we think tells us something interesting about timing-of-call rather than direction-of-call. Lesson: distinguish between confidence in the direction and confidence in the timing. We mis-merged the two last year.
The eight calls we got right are listed in the appendix without commentary; the misses are more useful, so they get the words.
The misses are more useful than the hits. The hits tell you which priors were already correct. The misses tell you which priors are still wrong.— Atlas '26, §08 framing
The Atlas is a snapshot of a moving target. By the time you read this it is already a few weeks out of date, and by mid-summer at least three of the load-bearing claims in §03 will need to be revised. We will revise them, in public, and the revisions will live at the same URL as the report. The errata are part of the document.
If the report has a single bias to declare, it is this: we are practitioners, not analysts. We run a research lab and a small venture studio. We invest, we get hired, we ship code, we sit on boards. That gives us access we could not get as outside observers, and it gives us conflicts of interest we have to declare. The disclosure list is in the appendix. Read the report against it.
Where we have skin in the game, we have tried to weight our claims toward what we have actually shipped, not what we hope will happen. Where we have skin in the opposite game, we have tried to flag it. None of this protects you fully from our priors. It is the best we can do; the rest is on the reader.
The Atlas exists because eighty-four operators, researchers, and engineers spent time on the phone with us when they had no obligation to. The full credits are in the appendix. A particular thanks to the seven teams who shared unpublished negative results — the data that doesn't make it into a paper is, increasingly, the data that makes the picture true.
We do not pay for participation and we do not accept paid placement. The Atlas is funded entirely by the lab's research budget and by the venture-studio cohort. If you find it useful, the most useful thing you can do is push back where we are wrong. The errata page is open.
If 2024 was the year the demos got real and 2025 was the year the integration costs got real, 2026 is the year the survivors find out which is harder.
We will report back in twelve months.
— uplabs lab team
istanbul · q1 2026
The full machine-readable dataset of all 412 models, including pressure-test logs, our reproduction notes, and the seven unresolved discrepancies, is published at uplabs.lab/atlas-26/data under CC-BY-SA 4.0. The dataset is versioned; cite the specific version you read.
We maintain a public errata log at uplabs.lab/atlas-26/errata. Corrections are dated and signed. The errata are part of the document.
UPLABS holds equity in 6 of the 412 models in the dataset (cohort '20 — '25). These are flagged in the dataset with the UPL_HOLDING tag and excluded from any quantitative claim that ranks models against each other. Where excluded, we have noted the exclusion and the result with and without the held entries. The full list of holdings, board seats, and advisory relationships is in the dataset's disclosure.csv.
@report{uplabs_atlas_2026,
title = {AI Atlas '26 — the year physical AI ate the roadmap},
author = {Okafor, M. and Liu, J. and Hartwell, R. and others},
year = {2026},
org = {UPLABS Ltd},
dispatch = {060},
url = {uplabs.lab/atlas-26}
}
— end of document — UPLABS / DISPATCH 060 / FEB 2026 — 60pp web edition — version 2026.02.04 —
The 412-model dataset, the disclosure log, and the errata feed are public — CC-BY-SA. Send us a note if you want the briefing as a deck, or to talk to one of the authors.