* Video loading, thanks for your patience ;)

ActiveMimic

Egocentric Video Pretraining with Active Perception

▼ Scroll down the page to find more ▼

ActiveMimic: Egocentric Video Pretraining
with Active Perception

1Fudan University, 2Shanghai Innovation Institute, 3Current Robotics

Abstract

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.

Video

Overview of ActiveMimic

Active Perception as the Missing Signal

ActiveMimic treats egocentric camera motion as a viewpoint action rather than noise, allowing the model to learn how humans actively reposition their view while manipulating objects. During everyday tasks, humans continuously adjust their gaze through head and body movements to gather visual evidence before acting. ActiveMimic captures this behavior by extracting both camera motion and wrist action from in-the-wild egocentric video and encoding them into a unified 27-dimensional action representation. This unified representation enables the model to jointly learn the coupled dynamics of viewpoint motion and manipulation. After pretraining, the acquired active perception capability transfers to a real-world humanoid robot, which actively repositions its viewpoint during task execution.

From Egocentric Video to Robot Control

From a single body-worn RGB camera, ActiveMimic recovers synchronized camera and wrist trajectories using off-the-shelf vision models, without requiring any additional sensors or controlled capture conditions. Because camera motion and wrist motion are inherently coupled in egocentric video, ActiveMimic resolves this coupling by re-expressing all poses in a common reference frame and encoding them as a unified 27D action vector. A model is then pretrained on this structured action to jointly predict future viewpoint and wrist motion from egocentric observations. The pretrained model is adapted to the target robot with robot-specific demonstrations, aligning pretraining and deployment around the same underlying behavior: actively acquiring visual information to support manipulation.

Experiments

Task 1: Restocking

The robot crouches to pick up a water bottle from the table, stands and looks up to scan the shelf, and then places the bottle into an empty slot.

Task 2: Reaching

The robot reaches over an obstacle to grasp a target object, requiring coordinated head, torso, and arm motion.

Task 3: Finding

The robot actively searches for the target object by moving its viewpoint before selecting the correct arm to grasp it.

Task 4: Pouring

The robot coordinates both arms to transfer liquid between two containers while maintaining close-range visual feedback.

Real-world Results

ActiveMimic surpasses all baselines across the four real-world tasks, achieving 90.1% on Restocking, 88.9% on Reaching, 91.7% on Finding, and 93.3% on Pouring. The gains over wrist-only and SFT-only variants show that camera motion supervision is the key differentiating signal, while the strong results against pi0 suggest that egocentric video pretraining can match robot-data pretraining and provide unique active-perception advantages.

Conclusion

We introduce ActiveMimic, an active-perception-aware pretraining framework for in-the-wild egocentric video. Across real-world tasks, ActiveMimic consistently surpasses baselines pretrained on human video, confirming active perception as the key missing ingredient for unlocking egocentric human video pretraining. We further provide evidence that active perception originates from egocentric pretraining and that camera motion supervision facilitates representational transfer from human perception to robot control.