HKUST DeepCybo
Zhongguancun Academy ZIAI Harbin Institute of Technology Huazhong University of Science and Technology

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

* Equal contribution    † Corresponding author
1The Hong Kong University of Science and Technology (Guangzhou)    2Zhongguancun Academy    3Zhongguancun Institute of Artificial Intelligence
4DeepCybo    5Harbin Institute of Technology    6Huazhong University of Science and Technology

Abstract

Robotic generalization relies on physical intelligence—the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception. While most VLMs are trained on third-person data, creating a viewpoint mismatch for humanoid robots, large-scale human egocentric videos offer a scalable alternative that naturally captures rich interaction context and causal structure. We propose an Egocentric2Embodiment translation pipeline that systematically converts first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding, egocentric consistency, and temporal logic validation, enabling the construction of the E2E-3M dataset at scale. An egocentric-aware embodied brain, PhysBrain, trained on E2E-3M, demonstrates substantial improvements in egocentric understanding—particularly for planning tasks—provides sample-efficient VLA fine-tuning initialization, and achieves strong performance on SimplerEnv and RoboCasa benchmarks, demonstrating effective transfer from human egocentric supervision to downstream robot control.

Egocentric2Embodiment Pipeline & E2E-3M Dataset

Human egocentric videos encode rich embodied experience, including action progression, hand-object interaction, and task-level structure. However, this experience is not directly usable for training embodied brains. Raw videos lack explicit structure, free-form language annotations are unstable, and unconstrained generation often introduces temporal ambiguity or hallucinated interactions. Our Egocentric2Embodiment Translation Pipeline addresses this by systematically converting human egocentric videos into structured and verifiable supervision that captures the hierarchical structure of embodied behavior. The pipeline employs: (1) scenario-aware segmentation to chunk episodes into temporal clips with contextual priors; (2) seven schema-driven VQA modes (temporal, spatial, attribute, mechanics, reasoning, summary, trajectory) with customized question-answer generation; (3) deterministic rule validation to enforce evidence grounding, egocentric consistency, and mode-specific temporal logic; and (4) structured output compilation into the E2E-3M dataset, aggregating ~3M VQA instances from three complementary domains: Ego4D (open-world household), BuildAI (industrial workflows), and EgoDex (laboratory manipulation).

Egocentric2Embodiment Translation Pipeline
~3M
VQA Corpus
3
Data Sources
7
VQA Modes
4
VLM Engines
Task Genre Distribution
Vocabulary Distribution
Noun Distribution
Verb Distribution
What is the sequence of actions performed by the hands in this video clip?
Answer
First, the right hand reaches toward the equipment on the lab bench. Then, the left hand moves to support the apparatus. Finally, both hands coordinate to position the equipment correctly.
Frame
Frame
Frame
Frame
Frame
Frame
Frame
Frame

Framework Architecture

PhysBrain is an egocentric-centered VLM backbone obtained through supervised fine-tuning on E2E-3M (mixed with general vision-language data), enhancing first-person understanding, reasoning, and planning capabilities. For robotic control, we instantiate PhysVLA, which follows a dual-system design: PhysBrain serves as System 2 (high-level reasoning), while a Flow-Matching diffusion action expert acts as System 1 (action generation).

PhysBrain Framework

PhysVLA follows a dual-system design inspired by GR00T N1.5: PhysBrain acts as System 2, providing high-level multimodal representations from its last-layer hidden states. A Flow-Matching (FM) action expert, implemented as a diffusion transformer (DiT), serves as System 1 to generate continuous actions by denoising action trajectories. The action expert cross-attends to PhysBrain's features (VLM features as keys/values, action tokens as queries), conditioned on the last-layer VLM hidden states. Under rectified-flow parameterization, the model predicts a velocity field that transports noise to the target action chunk with a simple regression objective, using only 8 denoising steps at inference to generate 16-step action chunks. This lightweight design provides a controlled setting to examine how informative the egocentric VLM representation is for action prediction.

Experimental Results

PhysBrain demonstrates significant improvements across multiple benchmarks. On EgoThink, PhysBrain-8B achieves 69.7% average performance with exceptional planning capabilities, substantially exceeding all baselines. On EgoPlan, PhysBrain-8B reaches 47.4/46.9 on Benchmark1/Benchmark2, outperforming Qwen3-VL-8B by +3.1/+6.4 points. When fine-tuned as PhysVLA for robotic control, PhysBrain-8B achieves 67.4% success rate on SimplerEnv, comparable to state-of-the-art RoboBrain2.5 (67.6%) but requiring no massive cross-embodiment robot data. Real-world validation on Franka Research 3 further confirms effectiveness (20/30 vs. 16/30 baseline). These results validate that large-scale human egocentric data effectively bridges VLMs to physical intelligence.

Egocentric VLM and VLA Performance

Comparison on EgoPlan and EgoThink Benchmarks

The detailed sub-tasks belong exclusively to the EgoThink benchmark, with the overall average reported in the final column.

Method EgoPlan-B1 | EgoPlan-B2 EgoThink Benchmark
Acc. Acc. Act. Fore. Loc. Obj. Asst. Nav. Reas. Avg.
General VLM
GPT-4o (Hurst et al., 2024a) 39.5 41.0 73.0 66.0 89.0 78.6 28.0 12.0 66.0 66.4
MiniGPT-4-7B (Zhu et al., 2023) 28.1 24.5 45.5 36.5 61.5 48.0 30.0 12.0 36.7 41.6
LLaVA-1.5-7B (Liu et al., 2024) 27.8 25.4 35.0 43.5 76.0 65.3 33.0 26.0 53.0 51.6
LLaMA-3.2-11B (Dubey et al., 2024) 24.3 25.1 34.0 49.5 57.5 62.7 42.0 22.0 47.7 48.4
Qwen-3-VL-4B (Yang et al., 2025a) 42.2 34.6 63.5 65.0 82.5 72.6 46.0 35.0 71.0 66.7
Qwen-3-VL-8B (Yang et al., 2025a) 44.3 40.5 68.0 66.5 86.0 72.3 41.0 39.0 61.7 65.9
Embodied Brain
VST-RL-7B (Yang et al., 2025b) 40.8 28.7 55.0 56.5 69.5 67.3 15.0 22.0 62.3 56.2
RoboBrain2.0-7B (Team et al., 2025a) 38.6 23.3 35.0 47.0 77.5 60.7 44.0 38.0 52.3 52.8
RoboBrain2.5-8B (Tan et al., 2026) 45.9 45.2 57.5 56.5 81.0 70.3 40.0 28.0 68.3 62.4
PhysBrain-4B (ours) 43.9 39.3 68.0 64.5 85.5 76.3 66.0 44.0 66.0 69.4
PhysBrain-8B (ours) 47.4 46.9 69.0 69.0 86.5 76.0 65.0 42.0 64.0 69.7

SimplerEnv Evaluation (PhysVLA Architecture)

Results of evaluating the VLA models with the WidowX robot in the SimplerEnv simulation environment. PhysBrain is fine-tuned under the VLA paradigm following the PhysVLA architecture, where the VLM backbone provides high-level representations to a Flow-Matching action expert.

Method Put Spoon
on Towel
Put Carrot
on Plate
Stack Green Block
on Yellow Block
Put Eggplant
in Yellow Basket
Average
VLA Baselines
RT-1-X (O'Neill et al., 2024) 0.0 4.2 0.0 0.0 1.1
Octo-Base (Team et al., 2024) 15.8 12.5 0.0 41.7 17.5
Octo-Small (Team et al., 2024) 41.7 8.2 0.0 56.7 26.7
OpenVLA (Kim et al., 2024) 4.2 0.0 0.0 12.5 4.2
OpenVLA-OFT (Kim et al., 2025) 12.5 4.2 4.2 72.5 23.4
RoboVLM (Li et al., 2024b) 50.0 37.5 0.0 83.3 42.7
TraceVLA (Zheng et al., 2025) 12.5 16.6 16.6 65.0 27.7
SpatialVLA (Qu et al., 2025) 20.8 20.8 25.0 70.8 34.4
CogACT (Li et al., 2024) 71.7 50.8 15.0 67.5 51.3
VideoVLA (Shen et al., 2025) 75.0 20.8 45.8 70.8 53.1
π0 (Black et al., 2024) 29.1 0.0 16.6 62.5 27.1
π0.5 (Black et al., 2025) 49.3 64.7 44.7 69.7 57.1
Isaac-GR00T-N1.6-Bridge (Team et al., 2025b) 64.5 65.5 5.5 93.0 57.1
VLM Baselines
Qwen2.5-VL-7B (Bai et al., 2025b) 68.7 35.4 25.0 75.0 51.0
Qwen3-VL-4B (Yang et al., 2025a) 87.5 50.0 29.2 54.2 55.2
Qwen3-VL-8B (Yang et al., 2025a) 68.7 38.5 30.2 87.9 56.3
VST-RL-7B (Yang et al., 2025b) 57.7 41.7 16.7 50.0 41.3
RoboBrain2.0-7B (Team et al., 2025a) 30.8 24.7 2.5 93.3 37.8
RoboBrain2.5-8B (Tan et al., 2026) 75.0 55.5 40.1 100.0 67.6
PhysBrain-4B (ours) 90.3 58.3 34.7 80.6 65.9
PhysBrain-8B (ours) 77.8 62.5 34.8 94.8 67.4

RoboCasa Tabletop Evaluation (PhysVLA Architecture)

Results of evaluating the VLA models with the GR1 robot in the RoboCasa Tabletop simulation environment. PhysBrain is fine-tuned under the VLA paradigm following the PhysVLA architecture, demonstrating consistent improvements across 24 complex manipulation tasks.

Task Isaac-
QwenGR00T
N1.6
QwenGR00T
+ Qwen3VL
QwenOFT
+ Qwen3VL
QwenFAST
+ Qwen3VL
PhysBrain-4B PhysBrain-8B
PnP Bottle To Cabinet Close 51.5 46.0 30.0 38.0 74.0 70.0
PnP Can To Drawer Close 13.0 80.0 76.0 44.0 68.0 74.0
PnP Cup To Drawer Close 8.5 54.0 44.0 56.0 42.0 46.0
PnP Milk To Microwave Close 14.0 48.0 44.0 44.0 54.0 60.0
PnP Potato To Microwave Close 41.5 28.0 32.0 14.0 24.0 34.0
PnP Wine To Cabinet Close 16.5 46.0 36.0 14.0 54.0 40.0
PnP Novel From Cuttingboard To Basket 58.0 48.0 50.0 54.0 62.0 54.0
PnP Novel From Cuttingboard To Cardboardbox 46.5 40.0 40.0 42.0 44.0 56.0
PnP Novel From Cuttingboard To Pan 68.5 68.0 70.0 58.0 56.0 72.0
PnP Novel From Cuttingboard To Pot 65.0 52.0 54.0 58.0 58.0 74.0
PnP Novel From Cuttingboard To Tieredbasket 46.5 56.0 38.0 40.0 40.0 44.0
PnP Novel From Placemat To Basket 58.5 42.0 32.0 36.0 42.0 58.0
PnP Novel From Placemat To Bowl 57.5 44.0 58.0 38.0 56.0 56.0
PnP Novel From Placemat To Plate 63.0 48.0 52.0 42.0 80.0 62.0
PnP Novel From Placemat To Tieredshelf 28.5 18.0 24.0 18.0 14.0 28.0
PnP Novel From Plate To Bowl 57.0 60.0 60.0 52.0 54.0 70.0
PnP Novel From Plate To Cardboardbox 43.5 50.0 50.0 30.0 50.0 54.0
PnP Novel From Plate To Pan 51.0 54.0 66.0 48.0 68.0 56.0
PnP Novel From Plate To Plate 78.7 70.0 68.0 50.0 78.0 60.0
PnP Novel From Tray To Cardboardbox 51.5 38.0 44.0 28.0 40.0 52.0
PnP Novel From Tray To Plate 71.0 56.0 56.0 34.0 66.0 60.0
PnP Novel From Tray To Pot 64.5 50.0 62.0 46.0 52.0 70.0
PnP Novel From Tray To Tieredbasket 57.0 36.0 54.0 36.0 50.0 48.0
PnP Novel From Tray To Tieredshelf 31.5 16.0 30.0 16.0 22.0 28.0
Average 47.6 47.8 48.8 39.0 49.75 55.25

Citation

@article{lin2025physbrain, title = {PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence}, author = {Lin, Xiaopeng and Lian, Shijie and Yu, Bin and Yang, Ruoqi and Shen, Zhaolong and Wu, Changti and Miao, Yuzhuo and Jin, Yurun and Shi, Yukun and He, Jiyan and Huang, Cong and Cheng, Bojun and Chen, Kai}, journal = {arXiv preprint}, year = {2025} }