Zhen Wu Homepage

Zhen Wu 「吴臻」

I am an Applied Scientist Intern at Amazon FAR (Frontier AI & Robotics), working with Angjoo Kanazawa, Guanya Shi, Rocky Duan, and Pieter Abbeel. Prior to that, I earned my Master's degree from Stanford University, advised by C. Karen Liu. I earned my Bachelor's degree from Peking University.

My research interests are in humanoid robotics and character animation. I am particularly interested in how humans and robots perceive and interact with the world, with a focus on achieving human-level agility and dexterity.

I'm always open to discussion and collaboration—feel free to drop me an email if you're interested.

Email: zhenwu [AT] stanford.edu

News

[09/2025] Learning to Ball has been accepted to SIGGRAPH Asia 2025 (Journal Track)!
[07/2025] The code for HIHI is released in this repo.
[06/2025] HIHI has been accepted to ICCV 2025! See you in Hawaii!

Publications

	OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa^†, Pieter Abbeel^†, Carmelo Sferrazza^†, C. Karen Liu^†, Rocky Duan^†, Guanya Shi^† In Submission* webpage \| abstract \| bibtex \| dataset \| twitter A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction- preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 9-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long- horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum. @article{ }
	Learning to Ball: Composing Policies for Long-Horizon Basketball Moves Pei Xu, Zhen Wu, Ruocheng Wang, Vishnu Sarukkai, Kayvon Fatahalian, Ioannis Karamouzas, Victor Zordan, C. Karen Liu ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2025) webpage \| pdf \| abstract \| bibtex \| code \| bilibili Learning a control policy for a multi-phase, long-horizon task, such as basketball maneuvers, remains challenging for reinforcement learning approaches due to the need for seamless policy composition and transitions between skills. A long-horizon task typically consists of distinct subtasks with well-defined goals, separated by transitional subtasks with unclear goals but critical to the success of the entire task. Existing methods like the mixture of experts and skill chaining struggle with tasks where individual policies do not share significant commonly explored states or lack well-defined initial and terminal states between different phases. In this paper, we introduce a novel policy integration framework to enable the composition of drastically different motor skills in multi-phase long-horizon tasks with ill-defined intermediate states. Based on that, we further introduce a high-level soft router to enable seamless and robust transitions between the subtasks. We evaluate our framework on a set of fundamental basketball skills and challenging transitions. Policies trained by our approach can effectively control the simulated character to interact with the ball and accomplish the long-horizon task specified by real-time user commands, without relying on ball trajectory references. @article{basketball, author = {Xu, Pei and Wu, Zhen and Wang, Ruocheng and Sarukkai, Vishnu and Fatahalian, Kayvon and Karamouzas, Ioannis and Zordan, Victor and Liu, C. Karen}, title = {Learning to Ball: Composing Policies for Long-Horizon Basketball Moves}, journal = {ACM Transactions on Graphics}, publisher = {ACM New York, NY, USA}, year = {2024}, volume = {44}, number = {6}, doi = {10.1145/3763367} }
	Human-Object Interaction from Human-Level Instructions Zhen Wu, Jiaman Li, Pei Xu, C. Karen Liu ICCV 2025 webpage \| pdf \| abstract \| bibtex \| code Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure physical plausibility of the motion. Our experiments demonstrate the effectiveness of our system in synthesizing realistic interactions with diverse objects in complex environments, highlighting its potential for real-world applications. @inproceedings{wu2025human, title={Human-object interaction from human-level instructions}, author={Wu, Zhen and Li, Jiaman and Xu, Pei and Liu, C Karen}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={11176--11186}, year={2025} }
	Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors Yuke Lou, Yiming Wang, Zhen Wu, Rui Zhao, Wenjia Wang, Mingyi Shi, Taku Komura In Submission pdf \| abstract \| bibtex Human-object interaction (HOI) synthesis is important for various applications, ranging from virtual reality to robotics. However, acquiring 3D HOI data is challenging due to its complexity and high cost, limiting existing methods to the narrow diversity of object types and interaction patterns in training datasets. This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets. The core idea of our method lies in leveraging extensive HOI knowledge from pre-trained Multimodal Models. Given a text description, our system first obtains temporally consistent 2D HOI image sequences using image or video generation models, which are then uplifted to 3D HOI milestones of human and object poses. We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images. Our estimation method is adaptive to various object templates obtained from text-to-3D models or online retrieval. A physics-based tracking of the 3D HOI kinematic milestone is further applied to refine both body motions and object poses, yielding more physically plausible HOI generation results. The experimental results demonstrate that our method is capable of generating open-vocabulary HOIs with physical realism and semantic diversity. @article{lou2025zero, title={Zero-shot human-object interaction synthesis with multimodal priors}, author={Lou, Yuke and Wang, Yiming and Wu, Zhen and Zhao, Rui and Wang, Wenjia and Shi, Mingyi and Komura, Taku}, journal={arXiv preprint arXiv:2503.20118}, year={2025} }

Teaching

CS224R: Deep Reinforcement Learning - Spring 2025
CS229: Machine Learning - Winter 2025
CS248B: Fundamentals of Computer Graphics: Animation and Simulation - Fall 2025

Website template from here and here