Skip to content
Research pillar

Physical Intelligence Research

Foundation models for the physical world: sensorimotor learning and cross-embodiment transfer.

The bet is that the same scaling trajectory that produced general-purpose language models is now producing general-purpose physical policies. Foundation models trained on the union of internet, video, and cross-embodiment robot data are beginning to transfer across tasks and bodies in the way that language models transferred across tasks and domains. The technical questions split into four. The first is whether the pretraining recipe — what data mixture, what objective, what architecture — produces policies that transfer. The second is whether the action representation supports cross-embodiment transfer at the rate the cross-task transfer suggests. The third is whether the inference latency profile is compatible with the physical-control loops that real robots have to close. The fourth is whether the evaluation methodology is honest about what the policies do under deployment-realistic conditions, where curation is impossible. We work on the substrate: the action representations that make transfer possible, the inference systems that make it real-time, the cross-embodiment datasets that make it general, and the evaluation regimes that keep claims honest in a field where curated demos are abundant and real deployment is rare.

The four questions are different

The phrase “physical intelligence” gets used to mean at least four distinct things, and the conflation has cost the field clarity. The first is the foundation-model claim: that vision-language-action (VLA) pretraining at scale produces policies whose capability transfers to held-out tasks at rates that justify the term “foundation model.” The second is the cross-embodiment claim: that policies pretrained on a mixture of robot platforms transfer to held-out platforms at rates that justify shared training rather than per-platform training. The third is the deployment-latency claim: that the inference latency of foundation policies is compatible with the high-frequency control loops that physical robots have to close, on hardware that is constrained by power, thermal, and weight budgets. The fourth is the evaluation-honesty claim: that the field’s evaluation methodology, in particular the practice of reporting cherry-picked successful demonstrations, is incompatible with the deployment context, and that a correction toward uncurated-deployment-realistic evaluation is overdue.

The four claims are independent. A program can succeed at foundation-model pretraining and fail at cross-embodiment transfer; this is the present state of the field for many published policies. A program can succeed at both and fail at deployment latency; this is the state for academic-scale models that are too large to run at control-loop frequencies on edge hardware. A program can succeed at all three and fail at evaluation honesty, which is the structural risk that the literature’s curated-demo culture currently embodies. The relevant question for the program is how to address all four simultaneously, with the methodological discipline that the failure modes of each compound rather than cancel.

What foundation models have demonstrated

The 2023 RT-2 paper by Brohan and colleagues at Google DeepMind demonstrated that a single vision-language-action model trained on a mixture of robot trajectories and web-scale image-text data exhibits non-trivial transfer to held-out tasks.1 The 2023 Open X-Embodiment collaboration — a cross-laboratory effort assembling a unified action-space dataset across multiple robot platforms — showed that cross-embodiment training on unified action spaces produces policies that transfer across robot platforms at rates that single-embodiment training does not.2 Physical Intelligence (the company) has shipped a series of policies — π0, π0.5, π0.7 — that operate at high frequency on a range of embodiments and that demonstrate the open-vocabulary skill acquisition that VLA pretraining promises.3 OpenVLA released open weights that have made the line of work reproducible at academic scale.4 The 2025 FAST paper by Pertsch and colleagues established that action tokenization through frequency-space transforms produces representations amenable to autoregressive modelling.5 The 2023 Diffusion Policy paper by Chi and colleagues sits as the default visuomotor architecture for a wide class of manipulation tasks, and represents the principal architectural alternative to autoregressive-token VLA designs.6

The technical sources of difficulty are the ones the language-model literature already taught us to expect, plus several that are unique to embodiment. Action spaces are not unified across embodiments; tokenization is non-trivial. Real-time inference budgets are tight; latency that is invisible at thirty frames per second on a chat interface is a balance failure on a humanoid. Evaluations are easily Goodharted by curation. And the data-collection cost remains high relative to language data, even with pooled efforts.

There is also a representational concern that is specific to physical control. Language models operate on a discrete token space with a well-defined vocabulary and a clean autoregressive structure. Physical action does not have a corresponding canonical representation. Joint-space, Cartesian-space, and frequency-space representations each have advantages, and the choice between them is bound up with the choice of training objective and the choice of inference architecture. The community has not yet converged on a winner, and may not. The pretraining recipe that works depends on the representation, and the representation that works depends on the deployment surface — which means the field is doing several entangled experiments at once, and the program treats this entanglement as a feature of the current research moment rather than as a problem to be resolved before further work proceeds.

What the physical-intelligence program is, technically

We organize this work along four sub-strands.

Sensorimotor foundation models

The pretraining recipe is being actively worked out. Mixtures of internet-scale image-text data, simulation rollouts, real-robot demonstrations, and human-video data each contribute differently. RT-21 and π03 are the leading public references for what the recipe currently looks like. We work on the data-mixing problem — how the contribution of each source scales — and on the architectural choices that make the resulting policies amenable to interpretability and verification. The connection to mechanistic interpretability work in AI Safety is direct: a sensorimotor foundation model is no easier to interpret than a language model, and is doing more consequential things. SIMA7 is an instructive reference for how scaling across simulated environments produces transfer, and we treat the simulator-to-physical-policy gap as a central methodological question rather than an implementation detail.

The discipline points include explicit data-source-attribution (so that the contribution of each data source to the resulting policy can be characterized rather than assumed), explicit pretraining-distribution-shift evaluation (so that the gap between training data and deployment data is measured rather than assumed away), and a preference for architectures whose internal representations are amenable to sparse-autoencoder-style interpretability work. The honest summary is that the current generation of VLA models has not been characterized with anything like the interpretability rigor that comparable language models have received, and the program treats closing this gap as a research priority.

Action tokenization

The choice of action representation determines what transfer is possible. Naive joint-space tokenization is high-frequency but fails to transfer across embodiments. End-effector pose tokenization transfers better but loses contact information. The 2025 FAST paper proposes frequency-space tokenization that compresses action sequences into autoregressively-predictable tokens with quantifiable reconstruction loss.5 We work on action representations that admit unified pretraining across embodiments while preserving the contact and force information that contact-rich manipulation requires. The choice between autoregressive token-prediction architectures and continuous-action diffusion architectures (Diffusion Policy6) is closely entangled with the tokenization question, and we expect the field to settle on hybrids rather than on a single winner.

The discipline points include explicit reconstruction-loss characterization (the tokenization is measured against a fidelity bar rather than chosen by aesthetic preference), explicit cross-embodiment transfer evaluation (the tokenization is tested on its ability to support transfer rather than only on its ability to support training), and a preference for tokenizations whose information content is bounded and characterizable. Tokenizations that work-but-cannot-be-characterized are research debt rather than research result, and the program is explicit about this preference.

Real-time inference at the edge

A foundation model that requires a hundred milliseconds per inference is a balance failure on a humanoid. The deployment frontier requires policy inference at frequencies of tens to hundreds of Hertz on hardware that is constrained by power, thermal, and weight budgets. We work on quantization, distillation, and architectural choices that move foundation-model inference into the latency budgets that physical control requires. The connection to Cognitive Computing is direct: edge-inference architectures, including neuromorphic and in-memory-compute substrates, are part of the design space.

We are also interested in hierarchical inference — a fast, light policy at the high-frequency control layer, refreshed asynchronously by a slower, heavier policy that handles task-level reasoning — as a pragmatic way to use available silicon. The hierarchical-inference architecture borrows from the operating-systems literature on real-time scheduling and from the control-theory literature on cascaded controllers; the integration with foundation-policy pretraining is the open engineering question. The discipline points include explicit latency-budget allocation across the hierarchy (so that the high-frequency layer’s latency budget is bounded and the low-frequency layer’s is bounded), and explicit failure-mode characterization at each level (so that the failure of one layer does not cascade into uncharacterized failure of the others).

Cross-embodiment transfer

The RT-X collaboration2 demonstrated that pooled cross-embodiment data produces policies that transfer better than single-embodiment training. The RoboCat line of work8 extended this with self-improvement across embodiments. We work on the transfer problem in regimes where embodiments are substantially different — bipedal humanoids versus tabletop arms versus mobile manipulators — and on the evaluation question of how to measure transfer without inflating it through favorable task selection. The honest version of the question is whether transfer scales with the diversity of pretraining embodiments, or whether there are returns-to-scale that flatten well before the embodiment space is covered. The empirical answer is not yet in, and the program treats the question as load-bearing for the field’s deployment trajectory.

Definitional bounds

Before moving to the open problems, four exclusions are worth being explicit about.

Physical intelligence does not mean general embodied AGI. The program is on foundation-policy substrates that support cross-task and cross-embodiment transfer, not on general embodied artificial general intelligence. The popular-science framings of “AI in the physical world” as imminent embodied AGI are not the program’s research substrate.

Physical intelligence does not mean single-foundation-policy convergence. The architectural diversity of the field — autoregressive-token VLAs, diffusion policies, hybrid architectures, hierarchical systems — is a feature of the current research moment, not a problem to be solved by premature convergence. The program funds work across the architectural alternatives rather than betting on a single winner.

Physical intelligence does not mean evaluation is solved. The evaluation-honesty concern is real and unsolved. The field’s curated-demo culture is incompatible with the deployment context, and the methodological correction is overdue. The program is explicit about this and treats uncurated-deployment-realistic evaluation as a load-bearing research investment.

Physical intelligence does not mean cloud inference is sufficient. The deployment context — humanoids, mobile manipulators, autonomous vehicles — requires edge inference at high frequency on power-and-thermal-constrained hardware. Policies that work only in the cloud are research demos rather than deployable products, and the program treats edge-inference work as load-bearing rather than as an afterthought.

Open problems

  1. Action-space unification across embodiments. No single tokenization is dominant. Joint-space, end-effector, frequency-space, and learned-codebook approaches each transfer well in some regimes and badly in others. We do not have a principled selection rule.
  2. Real-time inference latency budgets. The product of inference frequency and policy size is bounded by hardware. The Pareto frontier of policy capability versus inference latency on edge hardware is moving, but slowly relative to model-scale increases on the cloud side.
  3. Online adaptation. Foundation policies are pretrained and then largely frozen at deployment. The adaptation regime — light-touch finetuning, adapter modules, in-context skill acquisition — works in some cases and not others, and the conditions under which it works are not characterized.
  4. Modular skill libraries. The Voyager-style skill-library approach is attractive but has not been demonstrated at humanoid scale. The composition rules under which acquired skills compose into longer-horizon behaviors are open.
  5. Goodhart-resistant evaluation. Most public robot-learning evaluations are easily over-fit by curation. Real evaluation requires uncurated deployment, which is rare. The methodological problem is open.
  6. Contact and force representation. Vision-only policies underperform on contact-rich manipulation. Tactile and force-torque modalities improve performance but are not yet integrated into pretraining at scale.
  7. Mechanistic interpretability of VLA models. The interpretability work that has been done on language models has not been extended to VLA models at scale. The program treats this as a load-bearing research investment.
  8. Verifiable physical-action policies. The verified-envelope work in AI Safety and Project Aegis needs to be extended to physical-action policies, where the action grammars are higher-bandwidth and the side effects are physical.

Three risk scenarios

Scenario A — Curation collapse

The first failure mode is the curation-collapse scenario. The field publishes increasingly impressive curated demonstrations, the deployment context relies on the implicit promise of those demonstrations, and the gap between curated-demo capability and uncurated-deployment capability widens to the point where the deployment context fails. The mitigation is methodological: insistence on uncurated-deployment-realistic evaluation, with explicit failure-rate reporting, explicit task-distribution characterization, and explicit pre-registration of evaluation protocols.

Scenario B — Inference-latency wall

The second failure mode is the inference-latency-wall scenario. Foundation-policy capability grows with model scale, but edge-inference latency grows with model scale at a rate that physical control loops cannot accommodate, and the deployable policy capability stalls at a level well below the cloud-inference capability. The mitigation is hierarchical inference, aggressive distillation, and architectural co-design of the policy and the inference hardware. The program treats edge-inference work as load-bearing rather than as a downstream optimization.

Scenario C — Successful cross-embodiment foundation policies

The third scenario, which we treat as the base case if the technical and methodological work are competent, is staged deployment in which cross-embodiment foundation policies are validated on increasingly diverse embodiments, the evaluation methodology corrects toward uncurated-deployment-realistic protocols, the edge-inference latency budgets are met by the architectural-and-hardware co-design, and the verifiable-physical-action work extends the safety substrate to physical control. The trajectory is the trajectory the program is aiming at.

What technical work bears on this

This pillar is the foundation-model layer that Humanoid Robotics and Autonomous Agents build on. The edge-inference work connects directly to Cognitive Computing, where neuromorphic and in-memory-compute substrates expand the deployable inference frontier. The systems-engineering view sits at Physical Intelligence (engineering). Verification of physical-control policies is a non-trivial extension of the work in AI Safety: the action grammars are higher-bandwidth, the side effects are physical, and the verified envelopes from Project Aegis require physics-aware extensions. The connection to ENERA is through the autonomous-assembly applications: kilometre-scale orbital arrays cannot be assembled by human-supervised means, and the foundation-policy substrate is the deployable autonomy that makes the assembly tractable.

Where to read further

Humanoid Robotics treats the embodied platform that physical-intelligence policies are deployed on. Autonomous Agents treats the multi-agent extension. Cognitive Computing treats the edge-inference substrate. AI Safety treats the verification framework. Project Aegis treats the deployable-policy verification substrate.

Footnotes

  1. Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, et al. (Google DeepMind), “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control”, arXiv 2023. 2

  2. Open X-Embodiment Collaboration (Embodiment Collaboration), “Open X-Embodiment: Robotic Learning Datasets and RT-X Models”, arXiv 2023. 2

  3. Kevin Black, Noah Brown, Danny Driess, et al. (Physical Intelligence), “π0: A Vision-Language-Action Flow Model for General Robot Control”, arXiv 2024. 2

  4. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, et al., “OpenVLA: An Open-Source Vision-Language-Action Model”, arXiv 2024.

  5. Karl Pertsch, et al., “FAST: Efficient Action Tokenization for Vision-Language-Action Models”, arXiv 2025. 2

  6. Cheng Chi, Siyuan Feng, Yilun Du, et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”, arXiv 2023 (RSS 2023). 2

  7. SIMA Team (DeepMind), “Scaling Instructable Agents Across Many Simulated Worlds”, arXiv 2024.

  8. Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, et al. (DeepMind), “RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation”, arXiv 2023.

FAQ

Common questions

  • What is physical intelligence?

    Physical intelligence is foundation-model-style competence applied to the physical world: perceiving, predicting, acting, and learning across embodiments and environments. The research is to do for sensorimotor control what frontier text models did for language — get one model that generalises across many tasks, devices, and surfaces.

  • How do action tokenisations generalise across embodiments?

    A useful action tokeniser maps continuous motor commands into discrete tokens that preserve task-relevant structure across very different bodies. The current research frontier is whether the right tokenisation is geometric, dynamics-aware, or learned end-to-end. Our bet is that hybrid schemes win, with body-specific decoders below a shared planner.

  • What inference latency budgets are tolerable for closed-loop control?

    It depends on the task: 100 ms for whole-body locomotion, 10 ms for fine manipulation, sub-millisecond for some force-control regimes. Foundation-model inference does not natively meet those budgets, so the architecture has to combine a slow, large planner with fast, small controllers. We are characterising that hierarchy.

  • How do you evaluate physical foundation models without easy Goodharting?

    By evaluating on long, multi-stage tasks in environments with sparse, semantically meaningful success criteria, not on short tasks that admit shortcut policies. We are building evaluation infrastructure where task variation is structured enough to compare runs but diverse enough to make memorisation infeasible.

Get involved

We welcome collaborators on this pillar. Write to research@apiksystems.com with a short note about what you'd like to work on.

Related across the site