PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

Abstract

Robotic generalization relies on physical intelligence—the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception. While most VLMs are trained on third-person data, creating a viewpoint mismatch for humanoid robots, large-scale human egocentric videos offer a scalable alternative that naturally captures rich interaction context and causal structure. We propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the E2E-3M dataset at scale. An egocentric-aware embodied brain, PhysBrain, trained on E2E-3M, exhibits substantially improved egocentric understanding for planning tasks, provides sample-efficient VLA fine-tuning initialization, and achieves 53.9% SimplerEnv success rates—demonstrating effective transfer from human egocentric supervision to downstream robot control.

Framework Architecture

We evaluate the transferability of egocentric gains under two widely adopted VLA paradigms: PhysGR00T (GR00T-style) and PhysPI (Pi-style), keeping the action expert lightweight and consistent across both.

(a) PhysGR00T follows the dual-system design in GR00T N1.5: the VLM plays the role of System 2 to produce high-level multimodal representations, while a Flow-Matching (FM) action expert serves as System 1 to generate continuous actions. PhysGR00T uses the last-layer VLM hidden states as the conditioning signal, with the FM expert implemented as a diffusion transformer (DiT) that denoises an action trajectory by cross-attending to VLM features. (b) PhysPI, in the spirit of π₀, more tightly couples the VLM backbone with the action expert through layer-wise cross-attention conditioning. Instead of only using the last VLM layer, PhysPI conditions the DiT blocks with multiple VLM layers, injecting them layer-wise into the action expert. This stronger coupling allows egocentric improvements distributed across VLM layers to be more effectively utilized for control.

Experimental Results

Egocentric Understanding Evaluation (EgoThink)

Results evaluating the egocentric understanding capabilities of VLM models using the EgoThink benchmark. Best results in bold.

Method	Activity	Forecast	Localization	Object	Planning	Reasoning	Average
General VLM
GPT-4 (Achiam et al., 2023)	70.5	61.5	88.5	79	35.5	65.3	67.4
MiniGPT-4-7B (Zhu et al., 2023)	50	15.5	59	48	13	32	36.8
LLaVA-1.5-7B (Liu et al., 2024)	39.5	50	74	62	25.5	51	51.2
LLaMA-3.2-11B (Dubey et al., 2024)	33.5	50	59	64	41	48.7	50.4
Qwen-2.5-VL-7B (Bai et al., 2025c)	56.5	54	71.5	64.7	32	60	57.3
Embodied Brain
VST-RL-7B (Yang et al., 2025a)	53	56	70.5	67.7	17	63.7	56.2
RoboBrain2.0-7B (Team et al., 2025)	36	49.5	78	61.3	37	52.7	53.1
PhysBrain (ours)	70	53.5	77	65.3	64.5	58	64.3

PhysBrain achieves 64.3% average on EgoThink, ranking second only to GPT-4 (67.4%) while significantly outperforming all other 7B-scale models, especially on Planning (64.5% vs next best 41.0%).

SimplerEnv Evaluation (PhysGR00T Architecture)

Results of evaluating the VLA models with the WidowX robot in the SimplerEnv simulation environment. The VLM backbone is fine-tuned under the VLA paradigm following the PhysGR00T architecture.

Method	Put Spoon on Towel	Put Carrot on Plate	Stack Green Block on Yellow Block	Put Eggplant in Yellow Basket	Average
VLA Baselines
RT-1-X (Brohan et al., 2024)	0.0	4.2	0.0	0.0	1.1
Octo-Base (Team Octo et al., 2024)	15.8	12.5	0.0	41.7	17.5
Octo-Small (Team Octo et al., 2024)	41.7	8.2	0.0	56.7	26.7
OpenVLA (Kim et al., 2024)	4.2	0.0	0.0	12.5	4.2
OpenVLA-OFT (Kim et al., 2025)	12.5	4.2	4.2	72.5	23.4
RoboVLM (Huang et al., 2024)	50.0	37.5	0.0	83.3	42.7
TraceVLA (Zhang et al., 2025)	12.5	16.6	16.6	65.0	27.7
SpatialVLA (Li et al., 2025)	20.8	20.8	25.0	70.8	34.4
CogACT (Zhao et al., 2024)	71.7	50.8	15.0	67.5	51.3
VideoVLA (Wang et al., 2025)	75.0	20.8	45.8	70.8	53.1
π₀ (Yang et al., 2024)	29.1	0.0	16.6	62.5	27.1
π₀-FAST (Yang et al., 2025)	29.1	21.9	10.8	66.6	48.3
VLM Baselines
Qwen2.5-VL-7B (Bai et al., 2025c)	59.2	30.8	3.3	44.2	34.4
RoboBrain2.0-7B (Team et al., 2025)	30.8	24.7	2.5	93.3	37.8
VST-RL-7B (Yang et al., 2025a)	57.7	41.7	16.7	50.0	41.3
Spatial-SSRL-7B (Zhang et al., 2025)	56.3	44.8	6.2	72.9	45.1
PhysBrain (ours)	65.6	37.5	33.3	79.2	53.9