en.Wedoany.com Reported - A new study proposes a generative pipeline for training humanoid robots in mobile manipulation, which generates large-scale paired data without manual annotation.
To achieve perception-driven mobile manipulation, humanoid robots need to link their own observations and task instructions with whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, but existing data sources cannot provide such complete tuples at scale. The research team addresses this bottleneck by synthetically generating vision-language-kinematic (VLK) supervision in reconstructed scenes.
The pipeline utilizes 3D Gaussian Splatting to reconstruct indoor environments with metric scale, leverages privileged scene information to synthesize navigation and object interaction trajectories, and renders paired egocentric observations post-hoc. Without human intervention, the researchers generated 48,000 paired trajectories and trained a VLK policy that predicts short-horizon whole-body kinematic trajectories. A whole-body tracker converts these predictions into actual actions for a physical humanoid robot.
To validate the method's effectiveness, the research team performed navigation and single-object transportation tasks on a physical Unitree G1 humanoid robot. Results show that synthetic interactions generated from reconstructed scenes can provide effective supervision for sim-to-real perception-driven humanoid robot mobile manipulation. The project website has been made publicly available.









