Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos

Real-Robot Deployment

Zero target-task robot demonstrations

PhysDex, post-trained only with Human-as-Humanoid converted labels, deploys on seven real high-DoF manipulation tasks — ring placement, magic-cube packing, water pouring, cup stacking, and more — several without any target-task robot demonstrations.

Magic-Cube Packing

Ego

Exo

Cup Stacking

Ego

Exo

Ring Toss

Ego

Exo

Water Pouring

Ego

Exo

Light-Bulb Installation

Ego

Exo

Temperature Sensing

Ego

Exo

Abstract

Why human video can become robot action

Vision-language-action (VLA) models require high-quality observation–action supervision, yet scaling such data is difficult, especially for high-DoF humanoids. Teleoperation provides controller-aligned supervision, while human egocentric videos capture diverse bimanual manipulation but do not directly provide executable robot actions. We introduce Human-as-Humanoid, a human-to-humanoid supervision framework that enables near-real-time action generation, making human demonstrations usable for high-DoF VLA training by jointly aligning the robot embodiment, the sensing setup, and the action-label interface. Built on PrimeU, a human-aligned 60-DoF upper-body humanoid, the framework uses synchronized ego-exo videos to pair deployment-aligned egocentric observations with exocentric motion recovery, retargets recovered human motion through staged IK into controller-aligned 60-DoF action chunks, and trains PhysDex with FK-aware supervision to preserve wrist and fingertip task-space geometry. Experiments validate the conversion chain at the motion-recovery, robot-action-space, and real-robot deployment levels. Human-as-Humanoid yields a 4.8–7.2× raw demonstration-throughput gain over teleoperation, and policies post-trained only with converted human labels generalize to real-robot deployment without target-task robot demonstrations.

Core Challenge

Four coupled requirements

Converting human video to robot-executable supervision exposes four tightly coupled requirements that Human-as-Humanoid addresses jointly.

i

Embodiment Alignment

Robot morphology and sensing layout stay compatible with human demonstrations, reducing retargeting error from differences in body scale, reachable workspace, and viewpoints.

ii

Observation–Motion Compatibility

Egocentric streams provide deployment-aligned policy inputs; exocentric views support robust upper-body and hand recovery under occlusion.

iii

Action-Interface Alignment

Converted labels respect the robot's joint ordering, URDF convention, joint limits, and controller interface — not just task-space intent.

iv

Joint–Task Consistency

Executable joint commands preserve the wrist and fingertip geometry that contact-rich manipulation depends on.

Embodiment

PrimeU: A human-aligned humanoid

Instead of treating human-to-robot transfer as a purely post-hoc retargeting problem, Human-as-Humanoid starts from the embodiment. PrimeU's upper body follows standard adult-male manipulation proportions, closing the workspace and sensing gap before any learning algorithm is applied.

PrimeU human-aligned humanoid embodiment — **PrimeU** — two 7-DoF arms, two 20-DoF Wuji dexterous hands, a 3-DoF neck, and a 3-DoF waist (60 DoF total). Head- and wrist-view Intel RealSense D435 cameras match the viewpoint structure used by the deployed VLA policy.

Action-relevant anthropometric scale alignment. Human values: 50th-percentile male, ANSUR II. PrimeU values from URDF kinematic tree.
Dimension	Human (cm)	PrimeU (cm)	Ratio
Shoulder breadth	41.5	40.4	0.97
Shoulder-to-head height	31.5	37.1	1.18
Shoulder-to-middle-fingertip reach	78.6	80.3	1.02
Hand length	19.3	19.3	1.00

Validation

Near-real-time human-to-humanoid conversion

Synchronized ego-exo human videos are converted into executable 60-DoF humanoid action labels at collection time. The pipeline runs at ~20 FPS. PhysDex is trained in the same robot action space with FK-aware supervision (DS-HKC) that preserves wrist and fingertip geometry.

Human-as-Humanoid and PhysDex pipeline — **(Top)** Ego-exo human videos → tracking → mesh-aware motion recovery → staged IK → controller-aligned 60-DoF robot action labels. **(Bottom)** PhysDex: a flow-matching DiT conditioned on PhysBrain VLM tokens, supervised with DS-HKC differentiable FK constraints on wrist poses and fingertip positions.

Validation · Action Space & FK-Aware Training

Human-derived labels are robot-compatible

A discrete action tokenizer trained only on human-derived robot-action chunks is evaluated on held-out real-robot windows it never saw during fitting. Low cross-domain reconstruction error confirms that converted human actions occupy a manifold close to real PrimeU demonstrations. FK-aware supervision (DS-HKC) further couples the 60 action dimensions through the robot kinematic chain, measuring whether induced wrist and fingertip poses remain consistent with task-space manipulation geometry.

Action-interface compatibility diagnostics. 100 real-robot evaluation windows; lower is better.
Diagnostic	Training data	Eval.	Norm. MAE mean / p95	EE error (mm) mean / p95
Cross-domain	Human only	Robot	0.0080 / 0.0097	5.34 / 12.67
In-domain baseline	Robot only	Robot	0.0099 / 0.0117	4.09 / 6.84
Mixed-domain	Robot + human data	Robot	0.0096 / 0.0114	4.86 / 9.11

Training loss comparison: FK-aware vs joint-only — Training-loss comparison. FK-aware supervision (PhysDex, red) reaches lower loss under the same budget, with the PhysBrain-initialized model improving further beyond step 6k.

Discussion

Limitations

Pose-estimation quality bounds downstream retargeting quality; IK quality bounds policy quality since human-derived data inherits the robot model, joint limits, and calibration. The pipeline is tied to PrimeU's URDF and joint convention — transferring to a new embodiment requires re-retargeting. Human-derived actions also capture kinematics more directly than contact forces, so robot data remains important for anchoring, evaluation, and contact-rich refinement. The zero-shot claim refers to target-task deployment without target-task robot demonstrations, not to eliminating all robot-specific modeling assumptions.

Cite

BibTeX

@misc{humanashumanoid2026,
  title     = {Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning
               from Ego-Exo Human Videos with Human-Aligned Embodiments},
  author    = {Xiaopeng Lin and Ruoqi Yang and Shijie Lian and Zhaolong Shen and Bin Yu and Changti Wu and Haibao Liu and Yuxiang Zhang and Hong Li and Qiyuan Su and Haochen Liu and Xuguo He and Yukun Shi and Cong Huang and Zhirui Zhang and Bojun Cheng and Kai Chen},
  year      = {2026}
}