HKUST(GZ) DeepCybo Zhongguancun Academy ZGCI HUST Beihang University

Human-as-Humanoid Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos
with Human-Aligned Embodiments

A human-to-humanoid supervision framework that converts synchronized ego-exo videos into controller-aligned 60-DoF action chunks near video capture rate — enabling VLA training from human demonstrations without target-task robot data.

Xiaopeng Lin1,2,*, Ruoqi Yang2,*, Shijie Lian3,6,*, Zhaolong Shen3,7,*, Bin Yu3,5,*, Changti Wu3, Haibao Liu2, Yuxiang Zhang2, Hong Li2, Qiyuan Su2, Haochen Liu2, Xuguo He2, Yukun Shi4, Cong Huang3,4, Zhirui Zhang2, Bojun Cheng1,†, Kai Chen2,3,4,†

1HKUST (GZ)   2DeepCybo   3ZGCA   4ZGCI   5Harbin Institute of Technology   6HUST   7Beihang University

* Equal contribution    † Corresponding author

Human-as-Humanoid replaces teleoperation-bound data collection with a near-real-time ego-exo human video conversion chain that generates executable 60-DoF humanoid action labels on PrimeU — yielding a 4.8–7.2× raw throughput gain.
4.8–7.2×
Throughput Gain
vs. motion-capture teleoperation
60DoF
Action Space
arms · hands · neck · waist
7
Downstream Tasks
real-robot deployment
1,500hr
Pretraining Data
ego-exo human demonstrations
Real-Robot Deployment

Zero target-task robot demonstrations

PhysDex, post-trained only with Human-as-Humanoid converted labels, deploys on seven real high-DoF manipulation tasks — ring placement, magic-cube packing, water pouring, cup stacking, and more — several without any target-task robot demonstrations.

Magic-Cube Packing

Ego
Exo

Cup Stacking

Ego
Exo

Ring Toss

Ego
Exo

Water Pouring

Ego
Exo

Light-Bulb Installation

Ego
Exo

Temperature Sensing

Ego
Exo
Abstract

Why human video can become robot action

Vision-language-action (VLA) models require high-quality observation–action supervision, yet scaling such data is difficult, especially for high-DoF humanoids. Teleoperation provides controller-aligned supervision, while human egocentric videos capture diverse bimanual manipulation but do not directly provide executable robot actions. We introduce Human-as-Humanoid, a human-to-humanoid supervision framework that enables near-real-time action generation, making human demonstrations usable for high-DoF VLA training by jointly aligning the robot embodiment, the sensing setup, and the action-label interface. Built on PrimeU, a human-aligned 60-DoF upper-body humanoid, the framework uses synchronized ego-exo videos to pair deployment-aligned egocentric observations with exocentric motion recovery, retargets recovered human motion through staged IK into controller-aligned 60-DoF action chunks, and trains PhysDex with FK-aware supervision to preserve wrist and fingertip task-space geometry. Experiments validate the conversion chain at the motion-recovery, robot-action-space, and real-robot deployment levels. Human-as-Humanoid yields a 4.8–7.2× raw demonstration-throughput gain over teleoperation, and policies post-trained only with converted human labels generalize to real-robot deployment without target-task robot demonstrations.

Core Challenge

Four coupled requirements

Converting human video to robot-executable supervision exposes four tightly coupled requirements that Human-as-Humanoid addresses jointly.

i

Embodiment Alignment

Robot morphology and sensing layout stay compatible with human demonstrations, reducing retargeting error from differences in body scale, reachable workspace, and viewpoints.

ii

Observation–Motion Compatibility

Egocentric streams provide deployment-aligned policy inputs; exocentric views support robust upper-body and hand recovery under occlusion.

iii

Action-Interface Alignment

Converted labels respect the robot's joint ordering, URDF convention, joint limits, and controller interface — not just task-space intent.

iv

Joint–Task Consistency

Executable joint commands preserve the wrist and fingertip geometry that contact-rich manipulation depends on.

Embodiment

PrimeU: A human-aligned humanoid

Instead of treating human-to-robot transfer as a purely post-hoc retargeting problem, Human-as-Humanoid starts from the embodiment. PrimeU's upper body follows standard adult-male manipulation proportions, closing the workspace and sensing gap before any learning algorithm is applied.

PrimeU human-aligned humanoid embodiment
PrimeU — two 7-DoF arms, two 20-DoF Wuji dexterous hands, a 3-DoF neck, and a 3-DoF waist (60 DoF total). Head- and wrist-view Intel RealSense D435 cameras match the viewpoint structure used by the deployed VLA policy.
Action-relevant anthropometric scale alignment. Human values: 50th-percentile male, ANSUR II. PrimeU values from URDF kinematic tree.
DimensionHuman (cm)PrimeU (cm)Ratio
Shoulder breadth41.540.40.97
Shoulder-to-head height31.537.11.18
Shoulder-to-middle-fingertip reach78.680.31.02
Hand length19.319.31.00
Validation

Near-real-time human-to-humanoid conversion

Synchronized ego-exo human videos are converted into executable 60-DoF humanoid action labels at collection time. The pipeline runs at ~20 FPS. PhysDex is trained in the same robot action space with FK-aware supervision (DS-HKC) that preserves wrist and fingertip geometry.

Human-as-Humanoid and PhysDex pipeline
(Top) Ego-exo human videos → tracking → mesh-aware motion recovery → staged IK → controller-aligned 60-DoF robot action labels. (Bottom) PhysDex: a flow-matching DiT conditioned on PhysBrain VLM tokens, supervised with DS-HKC differentiable FK constraints on wrist poses and fingertip positions.
Validation · Action Space & FK-Aware Training

Human-derived labels are robot-compatible

A discrete action tokenizer trained only on human-derived robot-action chunks is evaluated on held-out real-robot windows it never saw during fitting. Low cross-domain reconstruction error confirms that converted human actions occupy a manifold close to real PrimeU demonstrations. FK-aware supervision (DS-HKC) further couples the 60 action dimensions through the robot kinematic chain, measuring whether induced wrist and fingertip poses remain consistent with task-space manipulation geometry.

Action-interface compatibility diagnostics. 100 real-robot evaluation windows; lower is better.
DiagnosticTraining dataEval. Norm. MAE
mean / p95
EE error (mm)
mean / p95
Cross-domainHuman onlyRobot0.0080 / 0.00975.34 / 12.67
In-domain baselineRobot onlyRobot0.0099 / 0.01174.09 / 6.84
Mixed-domainRobot + human dataRobot0.0096 / 0.01144.86 / 9.11
Training loss comparison: FK-aware vs joint-only
Training-loss comparison. FK-aware supervision (PhysDex, red) reaches lower loss under the same budget, with the PhysBrain-initialized model improving further beyond step 6k.
Discussion

Limitations

Pose-estimation quality bounds downstream retargeting quality; IK quality bounds policy quality since human-derived data inherits the robot model, joint limits, and calibration. The pipeline is tied to PrimeU's URDF and joint convention — transferring to a new embodiment requires re-retargeting. Human-derived actions also capture kinematics more directly than contact forces, so robot data remains important for anchoring, evaluation, and contact-rich refinement. The zero-shot claim refers to target-task deployment without target-task robot demonstrations, not to eliminating all robot-specific modeling assumptions.

Cite

BibTeX

@misc{humanashumanoid2026,
  title     = {Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning
               from Ego-Exo Human Videos with Human-Aligned Embodiments},
  author    = {Xiaopeng Lin and Ruoqi Yang and Shijie Lian and Zhaolong Shen and Bin Yu and Changti Wu and Haibao Liu and Yuxiang Zhang and Hong Li and Qiyuan Su and Haochen Liu and Xuguo He and Yukun Shi and Cong Huang and Zhirui Zhang and Bojun Cheng and Kai Chen},
  year      = {2026}
}