Robotic generalization relies on physical intelligence—the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception. While most VLMs are trained on third-person data, creating a viewpoint mismatch for humanoid robots, large-scale human egocentric videos offer a scalable alternative that naturally captures rich interaction context and causal structure. We propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the E2E-3M dataset at scale. An egocentric-aware embodied brain, PhysBrain, trained on E2E-3M, exhibits substantially improved egocentric understanding for planning tasks, provides sample-efficient VLA fine-tuning initialization, and achieves 53.9% SimplerEnv success rates—demonstrating effective transfer from human egocentric supervision to downstream robot control.
Human egocentric videos encode rich embodied experience, including action progression, hand-object interaction, and task-level structure. However, this experience is not directly usable for training embodied brains. Raw videos lack explicit structure, free-form language annotations are unstable, and unconstrained generation often introduces temporal ambiguity or hallucinated interactions. Our key idea is to translate egocentric human data into structured and verifiable supervision that captures the hierarchical structure of embodied behavior, spanning action semantics, temporal organization, interaction dynamics, and task-level reasoning. To this end, we design a schema-driven, rule-validated egocentric VQA data engine that systematically converts raw egocentric human videos into multi-level supervision aligned with embodied planning and interaction reasoning.








We evaluate the transferability of egocentric gains under two widely adopted VLA paradigms: PhysGR00T (GR00T-style) and PhysPI (Pi-style), keeping the action expert lightweight and consistent across both.
(a) PhysGR00T follows the dual-system design in GR00T N1.5: the VLM plays the role of System 2 to produce high-level multimodal representations, while a Flow-Matching (FM) action expert serves as System 1 to generate continuous actions. PhysGR00T uses the last-layer VLM hidden states as the conditioning signal, with the FM expert implemented as a diffusion transformer (DiT) that denoises an action trajectory by cross-attending to VLM features. (b) PhysPI, in the spirit of π0, more tightly couples the VLM backbone with the action expert through layer-wise cross-attention conditioning. Instead of only using the last VLM layer, PhysPI conditions the DiT blocks with multiple VLM layers, injecting them layer-wise into the action expert. This stronger coupling allows egocentric improvements distributed across VLM layers to be more effectively utilized for control.
Results evaluating the egocentric understanding capabilities of VLM models using the EgoThink benchmark. Best results in bold.
| Method | Activity | Forecast | Localization | Object | Planning | Reasoning | Average |
|---|---|---|---|---|---|---|---|
| General VLM | |||||||
| GPT-4 (Achiam et al., 2023) | 70.5 | 61.5 | 88.5 | 79 | 35.5 | 65.3 | 67.4 |
| MiniGPT-4-7B (Zhu et al., 2023) | 50 | 15.5 | 59 | 48 | 13 | 32 | 36.8 |
| LLaVA-1.5-7B (Liu et al., 2024) | 39.5 | 50 | 74 | 62 | 25.5 | 51 | 51.2 |
| LLaMA-3.2-11B (Dubey et al., 2024) | 33.5 | 50 | 59 | 64 | 41 | 48.7 | 50.4 |
| Qwen-2.5-VL-7B (Bai et al., 2025c) | 56.5 | 54 | 71.5 | 64.7 | 32 | 60 | 57.3 |
| Embodied Brain | |||||||
| VST-RL-7B (Yang et al., 2025a) | 53 | 56 | 70.5 | 67.7 | 17 | 63.7 | 56.2 |
| RoboBrain2.0-7B (Team et al., 2025) | 36 | 49.5 | 78 | 61.3 | 37 | 52.7 | 53.1 |
| PhysBrain (ours) | 70 | 53.5 | 77 | 65.3 | 64.5 | 58 | 64.3 |
PhysBrain achieves 64.3% average on EgoThink, ranking second only to GPT-4 (67.4%) while significantly outperforming all other 7B-scale models, especially on Planning (64.5% vs next best 41.0%).
Results of evaluating the VLA models with the WidowX robot in the SimplerEnv simulation environment. The VLM backbone is fine-tuned under the VLA paradigm following the PhysGR00T architecture.
| Method | Put Spoon on Towel |
Put Carrot on Plate |
Stack Green
Block on Yellow Block |
Put Eggplant in Yellow Basket |
Average |
|---|---|---|---|---|---|
| VLA Baselines | |||||
| RT-1-X (Brohan et al., 2024) | 0.0 | 4.2 | 0.0 | 0.0 | 1.1 |
| Octo-Base (Team Octo et al., 2024) | 15.8 | 12.5 | 0.0 | 41.7 | 17.5 |
| Octo-Small (Team Octo et al., 2024) | 41.7 | 8.2 | 0.0 | 56.7 | 26.7 |
| OpenVLA (Kim et al., 2024) | 4.2 | 0.0 | 0.0 | 12.5 | 4.2 |
| OpenVLA-OFT (Kim et al., 2025) | 12.5 | 4.2 | 4.2 | 72.5 | 23.4 |
| RoboVLM (Huang et al., 2024) | 50.0 | 37.5 | 0.0 | 83.3 | 42.7 |
| TraceVLA (Zhang et al., 2025) | 12.5 | 16.6 | 16.6 | 65.0 | 27.7 |
| SpatialVLA (Li et al., 2025) | 20.8 | 20.8 | 25.0 | 70.8 | 34.4 |
| CogACT (Zhao et al., 2024) | 71.7 | 50.8 | 15.0 | 67.5 | 51.3 |
| VideoVLA (Wang et al., 2025) | 75.0 | 20.8 | 45.8 | 70.8 | 53.1 |
| π0 (Yang et al., 2024) | 29.1 | 0.0 | 16.6 | 62.5 | 27.1 |
| π0-FAST (Yang et al., 2025) | 29.1 | 21.9 | 10.8 | 66.6 | 48.3 |
| VLM Baselines | |||||
| Qwen2.5-VL-7B (Bai et al., 2025c) | 59.2 | 30.8 | 3.3 | 44.2 | 34.4 |
| RoboBrain2.0-7B (Team et al., 2025) | 30.8 | 24.7 | 2.5 | 93.3 | 37.8 |
| VST-RL-7B (Yang et al., 2025a) | 57.7 | 41.7 | 16.7 | 50.0 | 41.3 |
| Spatial-SSRL-7B (Zhang et al., 2025) | 56.3 | 44.8 | 6.2 | 72.9 | 45.1 |
| PhysBrain (ours) | 65.6 | 37.5 | 33.3 | 79.2 | 53.9 |