The case for the humanoid form factor is narrow and specific: human environments are already shaped for human bodies, and rebuilding them is more expensive than building bodies that can use them. The humanoid bet is therefore not a generality bet about form factors; it is a deployment bet about the most direct route to general-purpose physical labor in human-shaped spaces — warehouses, kitchens, hospitals, homes. The technical questions split into four. The first is locomotion: bipedal walking and running on unstructured terrain at the reliability bar that real deployment requires. The second is dexterous manipulation: contact-rich, in-hand, two-handed, articulated-object manipulation at rates that justify the platform’s cost. The third is whole-body integration: closing the control loops at kilohertz rates while running foundation-policy inference at tens-to-hundreds of Hertz, with explicit hierarchical decomposition that respects both bandwidth budgets. The fourth is the deployment-and-evaluation discipline: distinguishing the curated-demo capability that the field publishes from the uncurated-deployment capability that the platforms actually have. We work on humanoid platforms not as a generality bet but as the most direct route to general-purpose physical labor in human-shaped spaces.
The four questions are different
Three years ago, humanoid robotics was a research field with a small set of expensive bipedal platforms and a long list of unsolved problems. It is now an industry, with multiple commercial programs at scale: Boston Dynamics’ Atlas in its electric incarnation, Figure’s 03 platform with the Helix end-to-end vision-language-action model, Tesla’s Optimus generations, 1X’s Neo, Apptronik’s Apollo, and a growing list of Chinese and European entrants. The pace is real, but the gap between demo and deployment is the thing that matters. Most demos are heavily curated, run on flat indoor surfaces with known objects, and rely on remote teleoperation or per-task scripted policies under the hood.
The locomotion claim is the cleanest of the four. Bipedal locomotion on real terrain requires control loops that close in milliseconds while integrating proprioception, vision, and inertial measurement under partial occlusion. Berkeley’s and ETH Zürich’s work on learned locomotion controllers has demonstrated that reinforcement-learning policies trained in simulation can transfer to hardware with reasonable robustness. The 2020 Lee, Hwangbo, Wellhausen, Koltun, and Hutter paper in Science Robotics on quadrupedal locomotion over challenging terrain is the cleanest reference for what learned controllers can achieve.1 The same techniques struggle on the manipulation side, where contact dynamics and object diversity break the closed-form physics that locomotion can largely rely on.
The manipulation claim is harder. RT-2’s vision-language-action model showed that internet-scale pretraining transfers usefully into the action space.2 The Open X-Embodiment dataset pooled the demonstration data of more than thirty institutions to produce a single dataset large enough to make foundation pretraining tractable.3 OpenVLA’s open-weights release made the line of work reproducible at academic scale.4 π0 from Physical Intelligence demonstrated cross-embodiment manipulation policies that operate at high frequency on a range of robot bodies.5 Diffusion policies have become the default for visuomotor learning.6 But the long tail — manipulating deformable objects, handling unexpected contact, recovering from contact-mode errors mid-trajectory — remains hard, and per-task data requirements remain high.
There is a second-order issue beneath the per-task data problem. Robot demonstrations are produced by humans operating teleoperation interfaces, and the throughput of high-quality demonstration data is bounded by human-operator hours. The Open X-Embodiment dataset is a cooperative achievement, not an industrial inevitability; the data-economics of robot learning is qualitatively different from the data-economics of language modelling, and the deployment timeline depends on whether the gap closes through synthetic data, simulation, or new modes of human supervision.
The reason this work matters: the value created by physical labor in human-shaped environments is large, the demographic pressure on its supply is rising, and the technical trajectory is now real. The rest of this page describes how we work the problem.
What the humanoid-robotics program is, technically
We organize this work along four sub-strands.
Locomotion
Humanoid locomotion is at the threshold where learned controllers reliably outperform hand-tuned model-predictive control on a widening set of terrains. The Lee et al. 2020 work on learned legged locomotion in unstructured environments1 and ETH Zürich’s broader portfolio of learned-controller research established the basic recipe: massive simulation, domain randomization, and policy distillation onto the physical platform. Boston Dynamics’ Atlas demonstrates the high end of what is achievable with a hybrid model-based and learned stack. We work on locomotion that integrates manipulation context — the controller’s foot placement and momentum management depend on what the upper body is doing — and on graceful degradation when the visual or inertial input degrades.
The discipline points include simulation-to-real-gap characterization (the classes of phenomena that domain randomization handles cleanly versus the classes it does not are not yet well-mapped), explicit failure-mode characterization on hardware (the rate at which the locomotion controller produces unsafe states under realistic-deployment perturbations is not currently reported in academic publications, and is the relevant safety datum), and an architectural preference for hybrid controllers (where model-based control handles the well-characterized regime and learned control extends the operating envelope into regimes the model does not capture).
Dexterous manipulation
The manipulation problem has been transformed by the arrival of vision-language-action foundation models. RT-2,2 the RT-X cross-embodiment dataset,3 OpenVLA,4 and π05 collectively established that pretrained policies transfer across embodiments with non-trivial efficiency. We work on the long tail of contact-rich manipulation: insertion, in-hand reorientation, two-handed manipulation of articulated objects. Diffusion policy6 is one of our default architectures for the visuomotor layer; we are interested in extending it with explicit contact modelling. The Franka research platforms remain useful for tabletop work where the cost of a humanoid is unjustified.
The contact-rich manipulation frontier is the most consequential research surface for deployable humanoid robotics. Free-space manipulation — reaching, grasping, stacking — is approximately solved at the demo level by the foundation-model substrate. Contact-rich manipulation — insertion, force-controlled assembly, in-hand reorientation, deformable-object handling — remains hard, the data is harder to collect, the simulation-to-real gap is wider, and the failure modes are harder to recover from. The discipline points include explicit force-and-tactile-sensor integration into the pretraining substrate (vision-only policies systematically underperform on contact-rich tasks), explicit failure-mode-recovery training (the policy must recover from contact-mode errors rather than relying on the demonstrator never having entered such states), and explicit task-distribution evaluation (the published successes are typically on a curated task distribution, and the unrecovered failure rate on the broader distribution is the relevant metric).
Sensor fusion
A humanoid is a multi-modal sensor platform: stereo and depth cameras, joint encoders, IMUs, tactile sensors at the fingertips, force-torque sensors at the wrists. Fusing these modalities under tight latency budgets is a real engineering problem, and one where the gap between simulator and hardware is unforgiving. We work on representations that maintain calibrated uncertainty across modalities, and on fallback behaviors when individual sensors degrade. NVIDIA Isaac Sim is one of our reference simulation environments; we build on it where its physics fidelity is sufficient and around it where it is not.
The discipline points include explicit calibration-uncertainty representation (so that the policy can reason about when sensor inputs are unreliable rather than treating them as oracle-level reliable), explicit modality-failure characterization (so that the platform’s behavior under sensor degradation is characterized rather than assumed), and a preference for sensor-fusion architectures that produce interpretable outputs (so that the operator can understand why the platform did what it did, particularly under failure).
Whole-body control
Whole-body control is the layer that turns task-level goals into joint-level torques while respecting balance, contact, and kinematic constraints. Classical optimization-based whole-body controllers solve a quadratic program at each control step; learned controllers replace some or all of that solve with a policy. The interesting research question is the layering: which parts of the stack benefit from being learned, which from being analytically constrained, and how the two layers communicate in the presence of unmodelled dynamics.
A practical concern is that the bandwidth of the foundation-model layer and the bandwidth of the whole-body control layer differ by orders of magnitude — the policy may run at tens of Hertz while the controller runs at kilohertz — and the design of the interface between them, which determines what kinds of disturbances the controller handles autonomously and which propagate to the policy for revision, is one of the load-bearing engineering decisions in any humanoid stack. Our internal work focuses on whole-body control that integrates cleanly with the foundation-model layer described in Physical Intelligence. The discipline points include explicit bandwidth-allocation across the layers, explicit disturbance-rejection-mode characterization, and a preference for hierarchical-controller architectures over single-monolithic-controller architectures.
Definitional bounds
Before moving to the open problems, four exclusions are worth being explicit about.
Humanoid robotics does not mean humanoids are universally optimal. The humanoid form factor is the right form factor for human-shaped environments, where the cost of rebuilding the environment exceeds the cost of building a body that fits. It is not the right form factor for warehouse-shaped environments (where wheeled platforms with task-specific manipulators dominate), for outdoor terrain (where wheeled or tracked platforms dominate), or for very heavy industrial work (where fixed-position robotics with high-stiffness actuation dominates). The program is on the human-shaped-environment niche specifically.
Humanoid robotics does not mean general physical AGI. The platform is for general-purpose physical labor in human-shaped spaces, not for general embodied artificial general intelligence. The popular-science framings of humanoid robotics as imminent embodied AGI are not the program’s research substrate.
Humanoid robotics does not mean deployment is solved. The deployment context — uncurated environments, contact-rich manipulation, multi-hour autonomous operation, recovery from unexpected failures — is hard, and the field’s curated-demo culture systematically overstates the deployment readiness of current platforms. The program is explicit about this and treats uncurated-deployment-realistic evaluation as a load-bearing research investment.
Humanoid robotics does not mean replacement of humans. The deployment context for humanoids is augmentation of human physical labor in environments where the work is physically demanding, dangerous, or simply scarce of qualified human labor (an aging-population reality in much of the developed world). The program is explicit about the augmentation framing and does not endorse the replacement framing as either accurate or politically-economically useful.
Open problems
- Dexterous-manipulation scaling. Per-task data requirements remain high, and skills do not yet transfer across object geometries with the efficiency that pretraining suggests is possible. The data efficiency frontier is the binding constraint on practical deployment.
- Contact-rich manipulation. Tasks involving deliberate, controlled contact — assembly, in-hand reorientation, deformable manipulation — remain harder than free-space manipulation by a substantial margin.
- Real-time whole-body control. Closing the loop fast enough to handle disturbances on a humanoid scale, while running a foundation-model policy that may have a non-trivial inference latency, is an unsolved systems problem. The hierarchical decomposition is open.
- Simulation-to-real. Sim-to-real has improved substantially, but it remains the dominant source of deployment surprise. The classes of phenomena that domain randomization handles cleanly, and the classes it does not, are not yet well-mapped.
- Energy-efficient actuation. Continuous operation on battery is bounded by actuator efficiency. The thermodynamic and mechanical limits here are not loose, and the trade-off between torque density and efficiency constrains what tasks a humanoid can sustain.
- In-context tool acquisition. Humans pick up unfamiliar tools and learn them on the spot. Robots do not, reliably. In-context skill acquisition for novel tools is open, and it is the property that most clearly separates a useful humanoid from a curated demo.
- Safety in shared spaces. Humanoids deployed in environments shared with humans require safety behavior that is robust to unexpected human movement, deliberate adversarial behavior, and sensor-degradation conditions. The deployment-safety literature is at an early stage relative to the capability literature.
- Failure-mode-recovery training. Policies trained on demonstrations of successful task completions do not, in general, learn to recover from the failure modes that real deployment produces. Explicit failure-mode-recovery training is an open methodological question.
Three risk scenarios
Scenario A — Deployment-without-recovery
The first failure mode is the deployment-without-failure-mode-recovery scenario. Policies trained on curated successful demonstrations do not encounter the failure modes that real deployment produces; the deployment context proceeds without the failure-mode-recovery training catching up; the resulting platforms are deployed into environments where they encounter unrecovered failures at higher rates than the demos suggested. The mitigation is explicit failure-mode-recovery training and explicit uncurated-deployment-realistic evaluation.
Scenario B — Safety incident in shared environment
The second failure mode is the safety-incident-in-shared-environment scenario. A humanoid deployed in a workplace or home environment causes injury to a human, the regulatory response constrains deployment across the entire industry, and the deployment trajectory of the entire field is delayed by years. The mitigation is conservative deployment posture (deploying initially in environments where humans are not present), explicit safety-system-redundancy (multiple independent safety systems with explicit failure-mode characterization), and engagement with the regulatory community on operating-envelope characterization.
Scenario C — Successful staged deployment
The third scenario, which we treat as the base case if the engineering and safety work are competent, is staged deployment in which humanoid platforms are deployed initially in environments where the deployment-safety bar is low (warehouses, dedicated industrial sites, training facilities), the deployment envelope is gradually widened as the platforms accumulate operating hours and the failure-mode characterization deepens, and the shared-environment deployment is the endpoint of a multi-year validation trajectory. The trajectory is the trajectory the program is aiming at.
What technical work bears on this
This pillar connects most directly to Physical Intelligence, which provides the foundation-model substrate that our manipulation policies build on, and to Autonomous Agents, which extends single-robot questions to fleets of embodied agents. The systems-engineering view of the platform sits at Physical Intelligence (engineering). The verification techniques developed in AI Safety and instantiated in Project Aegis apply to humanoid platforms with adapted action grammars. The connection to ENERA is through the autonomous-assembly applications: kilometre-scale orbital arrays cannot be assembled by human-supervised means, and the humanoid platforms (in microgravity-adapted variants) are part of the deployable autonomy that makes the assembly tractable. The connection to INTEGRITISSUE is through the autonomous-medical-manipulation applications: bedside bioprinting and autonomous-surgical capability in remote habitats are humanoid-platform applications.
Where to read further
Physical Intelligence treats the foundation-policy substrate. Autonomous Agents treats the multi-agent extension. AI Safety treats the verification framework. Project Aegis treats the deployable-policy verification substrate.
Footnotes
-
Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter, “Learning quadrupedal locomotion over challenging terrain”, Science Robotics 5, no. 47 (2020). Learned locomotion on unstructured terrain. ↩ ↩2
-
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, et al. (Google DeepMind), “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control”, arXiv 2023. ↩ ↩2
-
Open X-Embodiment Collaboration, “Open X-Embodiment: Robotic Learning Datasets and RT-X Models”, arXiv 2023. ↩ ↩2
-
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, et al., “OpenVLA: An Open-Source Vision-Language-Action Model”, arXiv 2024. ↩ ↩2
-
Kevin Black, Noah Brown, Danny Driess, et al. (Physical Intelligence), “π0: A Vision-Language-Action Flow Model for General Robot Control”, arXiv 2024. ↩ ↩2
-
Cheng Chi, Siyuan Feng, Yilun Du, et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”, arXiv 2023. ↩ ↩2