HKUST DeepCybo
Zhongguancun Academy ZIAI Harbin Institute of Technology Huazhong University of Science and Technology

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

* Equal contribution    † Corresponding author
1The Hong Kong University of Science and Technology (Guangzhou)    2Zhongguancun Academy    3Zhongguancun Institute of Artificial Intelligence
4DeepCybo    5Harbin Institute of Technology    6Huazhong University of Science and Technology

Abstract

Robotic generalization relies on physical intelligence—the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception. While most VLMs are trained on third-person data, creating a viewpoint mismatch for humanoid robots, large-scale human egocentric videos offer a scalable alternative that naturally captures rich interaction context and causal structure. We propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the E2E-3M dataset at scale. An egocentric-aware embodied brain, PhysBrain, trained on E2E-3M, exhibits substantially improved egocentric understanding for planning tasks, provides sample-efficient VLA fine-tuning initialization, and achieves 53.9% SimplerEnv success rates—demonstrating effective transfer from human egocentric supervision to downstream robot control.

Egocentric2Embodiment Pipeline & E2E-3M Dataset

Human egocentric videos encode rich embodied experience, including action progression, hand-object interaction, and task-level structure. However, this experience is not directly usable for training embodied brains. Raw videos lack explicit structure, free-form language annotations are unstable, and unconstrained generation often introduces temporal ambiguity or hallucinated interactions. Our key idea is to translate egocentric human data into structured and verifiable supervision that captures the hierarchical structure of embodied behavior, spanning action semantics, temporal organization, interaction dynamics, and task-level reasoning. To this end, we design a schema-driven, rule-validated egocentric VQA data engine that systematically converts raw egocentric human videos into multi-level supervision aligned with embodied planning and interaction reasoning.

Egocentric2Embodiment Translation Pipeline
~3M
VQA Corpus
3
Data Sources
7
VQA Modes
4
VLM Engines
Task Genre Distribution
Vocabulary Distribution
What is the sequence of actions performed by the hands in this video clip?
Answer
First, the right hand reaches toward the equipment on the lab bench. Then, the left hand moves to support the apparatus. Finally, both hands coordinate to position the equipment correctly.
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame

Framework Architecture

We evaluate the transferability of egocentric gains under two widely adopted VLA paradigms: PhysGR00T (GR00T-style) and PhysPI (Pi-style), keeping the action expert lightweight and consistent across both.

PhysBrain Framework

(a) PhysGR00T follows the dual-system design in GR00T N1.5: the VLM plays the role of System 2 to produce high-level multimodal representations, while a Flow-Matching (FM) action expert serves as System 1 to generate continuous actions. PhysGR00T uses the last-layer VLM hidden states as the conditioning signal, with the FM expert implemented as a diffusion transformer (DiT) that denoises an action trajectory by cross-attending to VLM features. (b) PhysPI, in the spirit of π0, more tightly couples the VLM backbone with the action expert through layer-wise cross-attention conditioning. Instead of only using the last VLM layer, PhysPI conditions the DiT blocks with multiple VLM layers, injecting them layer-wise into the action expert. This stronger coupling allows egocentric improvements distributed across VLM layers to be more effectively utilized for control.

Experimental Results

Egocentric VLM and VLA Performance

Egocentric Understanding Evaluation (EgoThink)

Results evaluating the egocentric understanding capabilities of VLM models using the EgoThink benchmark. Best results in bold.

Method Activity Forecast Localization Object Planning Reasoning Average
General VLM
GPT-4 (Achiam et al., 2023) 70.5 61.5 88.5 79 35.5 65.3 67.4
MiniGPT-4-7B (Zhu et al., 2023) 50 15.5 59 48 13 32 36.8
LLaVA-1.5-7B (Liu et al., 2024) 39.5 50 74 62 25.5 51 51.2
LLaMA-3.2-11B (Dubey et al., 2024) 33.5 50 59 64 41 48.7 50.4
Qwen-2.5-VL-7B (Bai et al., 2025c) 56.5 54 71.5 64.7 32 60 57.3
Embodied Brain
VST-RL-7B (Yang et al., 2025a) 53 56 70.5 67.7 17 63.7 56.2
RoboBrain2.0-7B (Team et al., 2025) 36 49.5 78 61.3 37 52.7 53.1
PhysBrain (ours) 70 53.5 77 65.3 64.5 58 64.3

PhysBrain achieves 64.3% average on EgoThink, ranking second only to GPT-4 (67.4%) while significantly outperforming all other 7B-scale models, especially on Planning (64.5% vs next best 41.0%).

SimplerEnv Evaluation (PhysGR00T Architecture)

Results of evaluating the VLA models with the WidowX robot in the SimplerEnv simulation environment. The VLM backbone is fine-tuned under the VLA paradigm following the PhysGR00T architecture.

Method Put Spoon
on Towel
Put Carrot
on Plate
Stack Green Block
on Yellow Block
Put Eggplant
in Yellow Basket
Average
VLA Baselines
RT-1-X (Brohan et al., 2024) 0.0 4.2 0.0 0.0 1.1
Octo-Base (Team Octo et al., 2024) 15.8 12.5 0.0 41.7 17.5
Octo-Small (Team Octo et al., 2024) 41.7 8.2 0.0 56.7 26.7
OpenVLA (Kim et al., 2024) 4.2 0.0 0.0 12.5 4.2
OpenVLA-OFT (Kim et al., 2025) 12.5 4.2 4.2 72.5 23.4
RoboVLM (Huang et al., 2024) 50.0 37.5 0.0 83.3 42.7
TraceVLA (Zhang et al., 2025) 12.5 16.6 16.6 65.0 27.7
SpatialVLA (Li et al., 2025) 20.8 20.8 25.0 70.8 34.4
CogACT (Zhao et al., 2024) 71.7 50.8 15.0 67.5 51.3
VideoVLA (Wang et al., 2025) 75.0 20.8 45.8 70.8 53.1
π0 (Yang et al., 2024) 29.1 0.0 16.6 62.5 27.1
π0-FAST (Yang et al., 2025) 29.1 21.9 10.8 66.6 48.3
VLM Baselines
Qwen2.5-VL-7B (Bai et al., 2025c) 59.2 30.8 3.3 44.2 34.4
RoboBrain2.0-7B (Team et al., 2025) 30.8 24.7 2.5 93.3 37.8
VST-RL-7B (Yang et al., 2025a) 57.7 41.7 16.7 50.0 41.3
Spatial-SSRL-7B (Zhang et al., 2025) 56.3 44.8 6.2 72.9 45.1
PhysBrain (ours) 65.6 37.5 33.3 79.2 53.9

Citation

@article{lin2025physbrain, title={PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence}, author={Lin, Xiaopeng and Lian, Shijie and Yu, Bin and Yang, Ruoqi and Wu, Changti and Miao, Yuzhuo and Jin, Yurun and Shi, Yukun and Huang, Cong and Cheng, Bojun and Chen, Kai}, journal={arXiv preprint}, year={2025} }