Robotic generalization relies on physical intelligence—the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception. While most VLMs are trained on third-person data, creating a viewpoint mismatch for humanoid robots, large-scale human egocentric videos offer a scalable alternative that naturally captures rich interaction context and causal structure. We propose an Egocentric2Embodiment translation pipeline that systematically converts first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding, egocentric consistency, and temporal logic validation, enabling the construction of the E2E-3M dataset at scale. An egocentric-aware embodied brain, PhysBrain, trained on E2E-3M, demonstrates substantial improvements in egocentric understanding—particularly for planning tasks—provides sample-efficient VLA fine-tuning initialization, and achieves strong performance on SimplerEnv and RoboCasa benchmarks, demonstrating effective transfer from human egocentric supervision to downstream robot control.
Human egocentric videos encode rich embodied experience, including action progression, hand-object interaction, and task-level structure. However, this experience is not directly usable for training embodied brains. Raw videos lack explicit structure, free-form language annotations are unstable, and unconstrained generation often introduces temporal ambiguity or hallucinated interactions. Our Egocentric2Embodiment Translation Pipeline addresses this by systematically converting human egocentric videos into structured and verifiable supervision that captures the hierarchical structure of embodied behavior. The pipeline employs: (1) scenario-aware segmentation to chunk episodes into temporal clips with contextual priors; (2) seven schema-driven VQA modes (temporal, spatial, attribute, mechanics, reasoning, summary, trajectory) with customized question-answer generation; (3) deterministic rule validation to enforce evidence grounding, egocentric consistency, and mode-specific temporal logic; and (4) structured output compilation into the E2E-3M dataset, aggregating ~3M VQA instances from three complementary domains: Ego4D (open-world household), BuildAI (industrial workflows), and EgoDex (laboratory manipulation).








PhysBrain is an egocentric-centered VLM backbone obtained through supervised fine-tuning on E2E-3M (mixed with general vision-language data), enhancing first-person understanding, reasoning, and planning capabilities. For robotic control, we instantiate PhysVLA, which follows a dual-system design: PhysBrain serves as System 2 (high-level reasoning), while a Flow-Matching diffusion action expert acts as System 1 (action generation).
PhysVLA follows a dual-system design inspired by GR00T N1.5: PhysBrain acts as System 2, providing high-level multimodal representations from its last-layer hidden states. A Flow-Matching (FM) action expert, implemented as a diffusion transformer (DiT), serves as System 1 to generate continuous actions by denoising action trajectories. The action expert cross-attends to PhysBrain's features (VLM features as keys/values, action tokens as queries), conditioned on the last-layer VLM hidden states. Under rectified-flow parameterization, the model predicts a velocity field that transports noise to the target action chunk with a simple regression objective, using only 8 denoising steps at inference to generate 16-step action chunks. This lightweight design provides a controlled setting to examine how informative the egocentric VLM representation is for action prediction.
PhysBrain demonstrates significant improvements across multiple benchmarks. On EgoThink, PhysBrain-8B achieves 69.7% average performance with exceptional planning capabilities, substantially exceeding all baselines. On EgoPlan, PhysBrain-8B reaches 47.4/46.9 on Benchmark1/Benchmark2, outperforming Qwen3-VL-8B by +3.1/+6.4 points. When fine-tuned as PhysVLA for robotic control, PhysBrain-8B achieves 67.4% success rate on SimplerEnv, comparable to state-of-the-art RoboBrain2.5 (67.6%) but requiring no massive cross-embodiment robot data. Real-world validation on Franka Research 3 further confirms effectiveness (20/30 vs. 16/30 baseline). These results validate that large-scale human egocentric data effectively bridges VLMs to physical intelligence.
The detailed sub-tasks belong exclusively to the EgoThink benchmark, with the overall average reported in the final column.
| Method | EgoPlan-B1 | EgoPlan-B2 | EgoThink Benchmark | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc. | Acc. | Act. | Fore. | Loc. | Obj. | Asst. | Nav. | Reas. | Avg. | |
| General VLM | ||||||||||
| GPT-4o (Hurst et al., 2024a) | 39.5 | 41.0 | 73.0 | 66.0 | 89.0 | 78.6 | 28.0 | 12.0 | 66.0 | 66.4 |
| MiniGPT-4-7B (Zhu et al., 2023) | 28.1 | 24.5 | 45.5 | 36.5 | 61.5 | 48.0 | 30.0 | 12.0 | 36.7 | 41.6 |
| LLaVA-1.5-7B (Liu et al., 2024) | 27.8 | 25.4 | 35.0 | 43.5 | 76.0 | 65.3 | 33.0 | 26.0 | 53.0 | 51.6 |
| LLaMA-3.2-11B (Dubey et al., 2024) | 24.3 | 25.1 | 34.0 | 49.5 | 57.5 | 62.7 | 42.0 | 22.0 | 47.7 | 48.4 |
| Qwen-3-VL-4B (Yang et al., 2025a) | 42.2 | 34.6 | 63.5 | 65.0 | 82.5 | 72.6 | 46.0 | 35.0 | 71.0 | 66.7 |
| Qwen-3-VL-8B (Yang et al., 2025a) | 44.3 | 40.5 | 68.0 | 66.5 | 86.0 | 72.3 | 41.0 | 39.0 | 61.7 | 65.9 |
| Embodied Brain | ||||||||||
| VST-RL-7B (Yang et al., 2025b) | 40.8 | 28.7 | 55.0 | 56.5 | 69.5 | 67.3 | 15.0 | 22.0 | 62.3 | 56.2 |
| RoboBrain2.0-7B (Team et al., 2025a) | 38.6 | 23.3 | 35.0 | 47.0 | 77.5 | 60.7 | 44.0 | 38.0 | 52.3 | 52.8 |
| RoboBrain2.5-8B (Tan et al., 2026) | 45.9 | 45.2 | 57.5 | 56.5 | 81.0 | 70.3 | 40.0 | 28.0 | 68.3 | 62.4 |
| PhysBrain-4B (ours) | 43.9 | 39.3 | 68.0 | 64.5 | 85.5 | 76.3 | 66.0 | 44.0 | 66.0 | 69.4 |
| PhysBrain-8B (ours) | 47.4 | 46.9 | 69.0 | 69.0 | 86.5 | 76.0 | 65.0 | 42.0 | 64.0 | 69.7 |
Results of evaluating the VLA models with the WidowX robot in the SimplerEnv simulation environment. PhysBrain is fine-tuned under the VLA paradigm following the PhysVLA architecture, where the VLM backbone provides high-level representations to a Flow-Matching action expert.
| Method | Put Spoon on Towel |
Put Carrot on Plate |
Stack Green
Block on Yellow Block |
Put Eggplant in Yellow Basket |
Average |
|---|---|---|---|---|---|
| VLA Baselines | |||||
| RT-1-X (O'Neill et al., 2024) | 0.0 | 4.2 | 0.0 | 0.0 | 1.1 |
| Octo-Base (Team et al., 2024) | 15.8 | 12.5 | 0.0 | 41.7 | 17.5 |
| Octo-Small (Team et al., 2024) | 41.7 | 8.2 | 0.0 | 56.7 | 26.7 |
| OpenVLA (Kim et al., 2024) | 4.2 | 0.0 | 0.0 | 12.5 | 4.2 |
| OpenVLA-OFT (Kim et al., 2025) | 12.5 | 4.2 | 4.2 | 72.5 | 23.4 |
| RoboVLM (Li et al., 2024b) | 50.0 | 37.5 | 0.0 | 83.3 | 42.7 |
| TraceVLA (Zheng et al., 2025) | 12.5 | 16.6 | 16.6 | 65.0 | 27.7 |
| SpatialVLA (Qu et al., 2025) | 20.8 | 20.8 | 25.0 | 70.8 | 34.4 |
| CogACT (Li et al., 2024) | 71.7 | 50.8 | 15.0 | 67.5 | 51.3 |
| VideoVLA (Shen et al., 2025) | 75.0 | 20.8 | 45.8 | 70.8 | 53.1 |
| π0 (Black et al., 2024) | 29.1 | 0.0 | 16.6 | 62.5 | 27.1 |
| π0.5 (Black et al., 2025) | 49.3 | 64.7 | 44.7 | 69.7 | 57.1 |
| Isaac-GR00T-N1.6-Bridge (Team et al., 2025b) | 64.5 | 65.5 | 5.5 | 93.0 | 57.1 |
| VLM Baselines | |||||
| Qwen2.5-VL-7B (Bai et al., 2025b) | 68.7 | 35.4 | 25.0 | 75.0 | 51.0 |
| Qwen3-VL-4B (Yang et al., 2025a) | 87.5 | 50.0 | 29.2 | 54.2 | 55.2 |
| Qwen3-VL-8B (Yang et al., 2025a) | 68.7 | 38.5 | 30.2 | 87.9 | 56.3 |
| VST-RL-7B (Yang et al., 2025b) | 57.7 | 41.7 | 16.7 | 50.0 | 41.3 |
| RoboBrain2.0-7B (Team et al., 2025a) | 30.8 | 24.7 | 2.5 | 93.3 | 37.8 |
| RoboBrain2.5-8B (Tan et al., 2026) | 75.0 | 55.5 | 40.1 | 100.0 | 67.6 |
| PhysBrain-4B (ours) | 90.3 | 58.3 | 34.7 | 80.6 | 65.9 |
| PhysBrain-8B (ours) | 77.8 | 62.5 | 34.8 | 94.8 | 67.4 |
Results of evaluating the VLA models with the GR1 robot in the RoboCasa Tabletop simulation environment. PhysBrain is fine-tuned under the VLA paradigm following the PhysVLA architecture, demonstrating consistent improvements across 24 complex manipulation tasks.
| Task | Isaac- QwenGR00T N1.6 |
QwenGR00T + Qwen3VL |
QwenOFT + Qwen3VL |
QwenFAST + Qwen3VL |
PhysBrain-4B | PhysBrain-8B |
|---|---|---|---|---|---|---|
| PnP Bottle To Cabinet Close | 51.5 | 46.0 | 30.0 | 38.0 | 74.0 | 70.0 |
| PnP Can To Drawer Close | 13.0 | 80.0 | 76.0 | 44.0 | 68.0 | 74.0 |
| PnP Cup To Drawer Close | 8.5 | 54.0 | 44.0 | 56.0 | 42.0 | 46.0 |
| PnP Milk To Microwave Close | 14.0 | 48.0 | 44.0 | 44.0 | 54.0 | 60.0 |
| PnP Potato To Microwave Close | 41.5 | 28.0 | 32.0 | 14.0 | 24.0 | 34.0 |
| PnP Wine To Cabinet Close | 16.5 | 46.0 | 36.0 | 14.0 | 54.0 | 40.0 |
| PnP Novel From Cuttingboard To Basket | 58.0 | 48.0 | 50.0 | 54.0 | 62.0 | 54.0 |
| PnP Novel From Cuttingboard To Cardboardbox | 46.5 | 40.0 | 40.0 | 42.0 | 44.0 | 56.0 |
| PnP Novel From Cuttingboard To Pan | 68.5 | 68.0 | 70.0 | 58.0 | 56.0 | 72.0 |
| PnP Novel From Cuttingboard To Pot | 65.0 | 52.0 | 54.0 | 58.0 | 58.0 | 74.0 |
| PnP Novel From Cuttingboard To Tieredbasket | 46.5 | 56.0 | 38.0 | 40.0 | 40.0 | 44.0 |
| PnP Novel From Placemat To Basket | 58.5 | 42.0 | 32.0 | 36.0 | 42.0 | 58.0 |
| PnP Novel From Placemat To Bowl | 57.5 | 44.0 | 58.0 | 38.0 | 56.0 | 56.0 |
| PnP Novel From Placemat To Plate | 63.0 | 48.0 | 52.0 | 42.0 | 80.0 | 62.0 |
| PnP Novel From Placemat To Tieredshelf | 28.5 | 18.0 | 24.0 | 18.0 | 14.0 | 28.0 |
| PnP Novel From Plate To Bowl | 57.0 | 60.0 | 60.0 | 52.0 | 54.0 | 70.0 |
| PnP Novel From Plate To Cardboardbox | 43.5 | 50.0 | 50.0 | 30.0 | 50.0 | 54.0 |
| PnP Novel From Plate To Pan | 51.0 | 54.0 | 66.0 | 48.0 | 68.0 | 56.0 |
| PnP Novel From Plate To Plate | 78.7 | 70.0 | 68.0 | 50.0 | 78.0 | 60.0 |
| PnP Novel From Tray To Cardboardbox | 51.5 | 38.0 | 44.0 | 28.0 | 40.0 | 52.0 |
| PnP Novel From Tray To Plate | 71.0 | 56.0 | 56.0 | 34.0 | 66.0 | 60.0 |
| PnP Novel From Tray To Pot | 64.5 | 50.0 | 62.0 | 46.0 | 52.0 | 70.0 |
| PnP Novel From Tray To Tieredbasket | 57.0 | 36.0 | 54.0 | 36.0 | 50.0 | 48.0 |
| PnP Novel From Tray To Tieredshelf | 31.5 | 16.0 | 30.0 | 16.0 | 22.0 | 28.0 |
| Average | 47.6 | 47.8 | 48.8 | 39.0 | 49.75 | 55.25 |