Zero target-task robot demonstrations
PhysDex, post-trained only with Human-as-Humanoid converted labels, deploys on seven real high-DoF manipulation tasks — ring placement, magic-cube packing, water pouring, cup stacking, and more — several without any target-task robot demonstrations.
Cup Stacking
Ring Toss
Water Pouring
Light-Bulb Installation
Temperature Sensing
Why human video can become robot action
Vision-language-action (VLA) models require high-quality observation–action supervision, yet scaling such data is difficult, especially for high-DoF humanoids. Teleoperation provides controller-aligned supervision, while human egocentric videos capture diverse bimanual manipulation but do not directly provide executable robot actions. We introduce Human-as-Humanoid, a human-to-humanoid supervision framework that enables near-real-time action generation, making human demonstrations usable for high-DoF VLA training by jointly aligning the robot embodiment, the sensing setup, and the action-label interface. Built on PrimeU, a human-aligned 60-DoF upper-body humanoid, the framework uses synchronized ego-exo videos to pair deployment-aligned egocentric observations with exocentric motion recovery, retargets recovered human motion through staged IK into controller-aligned 60-DoF action chunks, and trains PhysDex with FK-aware supervision to preserve wrist and fingertip task-space geometry. Experiments validate the conversion chain at the motion-recovery, robot-action-space, and real-robot deployment levels. Human-as-Humanoid yields a 4.8–7.2× raw demonstration-throughput gain over teleoperation, and policies post-trained only with converted human labels generalize to real-robot deployment without target-task robot demonstrations.
Four coupled requirements
Converting human video to robot-executable supervision exposes four tightly coupled requirements that Human-as-Humanoid addresses jointly.
Embodiment Alignment
Robot morphology and sensing layout stay compatible with human demonstrations, reducing retargeting error from differences in body scale, reachable workspace, and viewpoints.
Observation–Motion Compatibility
Egocentric streams provide deployment-aligned policy inputs; exocentric views support robust upper-body and hand recovery under occlusion.
Action-Interface Alignment
Converted labels respect the robot's joint ordering, URDF convention, joint limits, and controller interface — not just task-space intent.
Joint–Task Consistency
Executable joint commands preserve the wrist and fingertip geometry that contact-rich manipulation depends on.
PrimeU: A human-aligned humanoid
Instead of treating human-to-robot transfer as a purely post-hoc retargeting problem, Human-as-Humanoid starts from the embodiment. PrimeU's upper body follows standard adult-male manipulation proportions, closing the workspace and sensing gap before any learning algorithm is applied.
| Dimension | Human (cm) | PrimeU (cm) | Ratio |
|---|---|---|---|
| Shoulder breadth | 41.5 | 40.4 | 0.97 |
| Shoulder-to-head height | 31.5 | 37.1 | 1.18 |
| Shoulder-to-middle-fingertip reach | 78.6 | 80.3 | 1.02 |
| Hand length | 19.3 | 19.3 | 1.00 |
Near-real-time human-to-humanoid conversion
Synchronized ego-exo human videos are converted into executable 60-DoF humanoid action labels at collection time. The pipeline runs at ~20 FPS. PhysDex is trained in the same robot action space with FK-aware supervision (DS-HKC) that preserves wrist and fingertip geometry.
Human-derived labels are robot-compatible
A discrete action tokenizer trained only on human-derived robot-action chunks is evaluated on held-out real-robot windows it never saw during fitting. Low cross-domain reconstruction error confirms that converted human actions occupy a manifold close to real PrimeU demonstrations. FK-aware supervision (DS-HKC) further couples the 60 action dimensions through the robot kinematic chain, measuring whether induced wrist and fingertip poses remain consistent with task-space manipulation geometry.
| Diagnostic | Training data | Eval. | Norm. MAE mean / p95 |
EE error (mm) mean / p95 |
|---|---|---|---|---|
| Cross-domain | Human only | Robot | 0.0080 / 0.0097 | 5.34 / 12.67 |
| In-domain baseline | Robot only | Robot | 0.0099 / 0.0117 | 4.09 / 6.84 |
| Mixed-domain | Robot + human data | Robot | 0.0096 / 0.0114 | 4.86 / 9.11 |
Limitations
Pose-estimation quality bounds downstream retargeting quality; IK quality bounds policy quality since human-derived data inherits the robot model, joint limits, and calibration. The pipeline is tied to PrimeU's URDF and joint convention — transferring to a new embodiment requires re-retargeting. Human-derived actions also capture kinematics more directly than contact forces, so robot data remains important for anchoring, evaluation, and contact-rich refinement. The zero-shot claim refers to target-task deployment without target-task robot demonstrations, not to eliminating all robot-specific modeling assumptions.
BibTeX
@misc{humanashumanoid2026,
title = {Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning
from Ego-Exo Human Videos with Human-Aligned Embodiments},
author = {Xiaopeng Lin and Ruoqi Yang and Shijie Lian and Zhaolong Shen and Bin Yu and Changti Wu and Haibao Liu and Yuxiang Zhang and Hong Li and Qiyuan Su and Haochen Liu and Xuguo He and Yukun Shi and Cong Huang and Zhirui Zhang and Bojun Cheng and Kai Chen},
year = {2026}
}