Problem
Many language-conditioned control methods rely on rendering the simulator and computing rewards with vision-language models (e.g., CLIP). That render-to-CLIP pipeline is expensive and makes real-time human-in-the-loop control impractical.
Research paper
Replacing visual rewards with motion-language alignment to enable fast, real-time instruction following in MuJoCo — without vision.
Embedded preview (with fallback).
If the viewer doesn’t load, use the Open PDF button.
What the project is about (and what it achieved).
Many language-conditioned control methods rely on rendering the simulator and computing rewards with vision-language models (e.g., CLIP). That render-to-CLIP pipeline is expensive and makes real-time human-in-the-loop control impractical.
Compute reward directly from joint trajectories: use motion-language similarity as the reward signal, bypassing visual rendering entirely.
Convert MuJoCo motion features to a HumanML3D-style representation and score how well the motion matches the instruction using MotionGPT’s pretrained motion encoder.
Train locomotion policies with PPO using a simple hierarchy: a high-level policy outputs target joint positions, and a low-level PD controller executes stable torques.
Motion-language rewards compute in ~0.52ms (≈1,938 rewards/s) versus ~14.85ms for CLIP + rendering, with ~32.5× lower GPU memory usage.
The speedup enables live instruction following: users can issue natural language commands and observe immediate agent responses at interactive rates.
Policies trained on a single instruction retain similarity on paraphrases like “sprint forward” / “dash forward”, showing semantic understanding beyond memorization.
Tested on Humanoid, Ant, HalfCheetah, Walker2d, and Hopper. Strong alignment on planar/quadruped locomotion, with stability/morphology limits on Humanoid/Hopper.