Goal
Enable MuJoCo agents to follow open-vocabulary natural-language instructions without vision by combining motion-language alignment with hierarchical reinforcement learning.
Research notes
Open-vocabulary natural-language instruction following without vision, using motion-language alignment + hierarchical RL.
Embedded preview (with fallback).
If the viewer doesn’t load, use the Open PDF button.
High-level overview.
Enable MuJoCo agents to follow open-vocabulary natural-language instructions without vision by combining motion-language alignment with hierarchical reinforcement learning.
Use a MotionGPT-style pipeline: VQ-VAE motion tokenization + motion-language alignment to connect text to motion.
High-level policy selects skills conditioned on language; low-level controller executes atomic motions learned from mocap.
Evaluate on Humanoid, HalfCheetah, Ant, and manipulation tasks using motion-language alignment as the reward signal.