Kaiyang Ji

2-505, VDI, SIST

393 Middle Huaxia Road, 201210

Shanghai, China

About Me

I am Kaiyang Ji, a second-year master student at Visual & Data Intelligence Center (VDI) in ShanghaiTech University, advised by Prof. Jingya Wang and Prof. Ye Shi. Previously, I graduated from ShanghaiTech University with a major in computer science, advised by Prof. Jingya Wang and Prof. Jingyi Yu.

Research Interest

My research interest broadly lies in computer vision, machine learning, and robotics. Particularly, my current research focuses on Human-Centered 3D Vision, Generative Models and Embodied AI.

I am looking for collaborators and friends. Feel free to contact me if you are interested in these fantasic topics!

Email / Google Scholar / Github

news

Jul 29, 2025	We will organize ICCV 2025 Workshop Challenge “Human-Robot-Scene Interaction and Collaboration”! Call for candidates!
Jul 24, 2025	Human-X has been accepted by ICCV 2025 as Highlight!
Jun 26, 2025	Our paper Human-X has been accepted by ICCV 2025!
Sep 01, 2024	I have joined VDI in 24Fall as a CS Master student!
Feb 27, 2024	Our paper S2Fusion has been accepted by CVPR 2024!

selected publications

arXiv 2505
One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion

Yahao Fan^*, Tianxiang Gui^*, Kaiyang Ji^*, and 6 more authors

arXiv preprint arXiv:2505.18780, 2025

Abs arXiv Bib PDF Website

Humanoid locomotion faces a critical scalability challenge: traditional reinforcement learning (RL) methods require task-specific rewards and struggle to leverage growing datasets, even as more training terrains are introduced. We propose DreamPolicy, a unified framework that enables a single policy to master diverse terrains and generalize zero-shot to unseen scenarios by systematically integrating offline data and diffusion-driven motion synthesis. At its core, DreamPolicy introduces Humanoid Motion Imagery (HMI) - future state predictions synthesized through an autoregressive terrain-aware diffusion planner curated by aggregating rollouts from specialized policies across various distinct terrains. Unlike human motion datasets requiring laborious retargeting, our data directly captures humanoid kinematics, enabling the diffusion planner to synthesize "dreamed" trajectories that encode terrain-specific physical constraints. These trajectories act as dynamic objectives for our HMI-conditioned policy, bypassing manual reward engineering and enabling cross-terrain generalization. Crucially, DreamPolicy addresses the scalability limitations of prior methods: while traditional RL fails to exploit growing datasets, our framework scales seamlessly with more offline data. As the dataset expands, the diffusion prior learns richer locomotion skills, which the policy leverages to master new terrains without retraining. Experiments demonstrate that DreamPolicy achieves an average of 90% success rates in training environments and an average of 20% higher success on unseen terrains than the prevalent method. It also generalizes to perturbed and composite scenarios where prior approaches collapse. By unifying offline data, diffusion-based trajectory synthesis, and policy optimization, DreamPolicy overcomes the "one task, one policy" bottleneck, establishing a paradigm for scalable, data-driven humanoid control.
@article{fan2025one, title = {One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion}, author = {Fan, Yahao and Gui, Tianxiang and Ji, Kaiyang and Ding, Shutong and Zhang, Chixuan and Gu, Jiayuan and Yu, Jingyi and Wang, Jingya and Shi, Ye}, journal = {arXiv preprint arXiv:2505.18780}, year = {2025}, }
ICCV 2025 Hightlight
Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

Kaiyang Ji, Ye Shi, Zichen Jin, and 5 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

Abs arXiv Bib PDF Video Code Website

Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners’ movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.
@article{ji2025towards, title = {Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis}, author = {Ji, Kaiyang and Shi, Ye and Jin, Zichen and Chen, Kangyi and Xu, Lan and Ma, Yuexin and Yu, Jingyi and Wang, Jingya}, journal = {In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025}, }
arXiv 2503
Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

Zekai Deng, Ye Shi, Kaiyang Ji, and 3 more authors

arXiv preprint arXiv:2503.18349, 2025

Abs arXiv Bib PDF Website

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable longhorizon interactions with diverse object types—including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios.
@article{deng2025human, title = {Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy}, author = {Deng, Zekai and Shi, Ye and Ji, Kaiyang and Xu, Lan and Huang, Shaoli and Wang, Jingya}, journal = {arXiv preprint arXiv:2503.18349}, year = {2025}, }
CVPR 2024
A unified diffusion framework for scene-aware human motion estimation from sparse signals

Jiangnan Tang, Jingya Wang, Kaiyang Ji, and 3 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Abs arXiv Bib PDF Video Code

Estimating full-body human motion via sparse tracking signals from head-mounted displays and hand controllers in 3D scenes is crucial to applications in AR/VR. One of the biggest challenges to this task is the one-to-many mapping from sparse observations to dense full-body motions, which endowed inherent ambiguities. To help resolve this ambiguous problem, we introduce a new framework to combine rich contextual information provided by scenes to benefit fullbody motion tracking from sparse observations. To estimate plausible human motions given sparse tracking signals and 3D scenes, we develop S2Fusion, a unified framework fusing Scene and sparse Signals with a conditional difFusion model. S2Fusion first extracts the spatial-temporal relations residing in the sparse signals via a periodic autoencoder, and then produces time-alignment feature embedding as additional inputs. Subsequently, by drawing initial noisy motion from a pre-trained prior, S2Fusion utilizes conditional diffusion to fuse scene geometry and sparse tracking signals to generate full-body scene-aware motions. The sampling procedure of S2Fusion is further guided by a specially designed scene-penetration loss and phase-matching loss, which effectively regularizes the motion of the lower body even in the absence of any tracking signals, making the generated motion much more plausible and coherent. Extensive experimental results have demonstrated that our S2Fusion outperforms the state-of-the-art in terms of estimation quality and smoothness.
@inproceedings{tang2024unified, title = {A unified diffusion framework for scene-aware human motion estimation from sparse signals}, author = {Tang, Jiangnan and Wang, Jingya and Ji, Kaiyang and Xu, Lan and Yu, Jingyi and Shi, Ye}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {21251--21262}, year = {2024}, }