Kaiyang Ji

2-505, VDI, SIST

393 Middle Huaxia Road, 201210

Shanghai, China

About Me

I am Kaiyang Ji, a second-year master student at Visual & Data Intelligence Center (VDI) in ShanghaiTech University, advised by Prof. Jingya Wang and Prof. Ye Shi. Previously, I graduated from ShanghaiTech University with a major in computer science, advised by Prof. Jingya Wang and Prof. Jingyi Yu.

Research Interest

My research interest broadly lies in computer vision, machine learning, and robotics. Particularly, my current research focuses on Human-Centered 3D Vision, Generative Models and Embodied AI.

I am looking for collaborators and friends. Feel free to contact me if you are interested in these fantasic topics!

Note: I am actively seeking Fall 2027 CS Ph.D. opportunities in Embodied AI, Generative Models and Human-Centered 3D Vision.

Email / Google Scholar / Github

news

May 01, 2026	Our paper DiscoForcing has been accepted by ICML 2026!
Jan 27, 2026	Our paper VLM-RMD has been accepted by ICLR 2026!
Jul 29, 2025	We have organized ICCV 2025 Workshop Challenge “Human-Robot-Scene Interaction and Collaboration”!
Jun 26, 2025	Our paper Human-X has been accepted by ICCV 2025 as Highlight!
Sep 01, 2024	I have joined VDI in 24Fall as a CS Master student!
Feb 27, 2024	Our paper S2Fusion has been accepted by CVPR 2024!

selected publications

ICML 2026
DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

Kaiyang Ji^*, Bingsheng Qian^*, Binghuan Wu, and 3 more authors

In Forty-third International Conference on Machine Learning (ICML), 2026

Abs arXiv Bib PDF Video Code Website

We study real-time audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly (tempo shifts, drops, or user edits). Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio-motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.
@inproceedings{ji2026discoforcing, title = {DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing}, author = {Ji, Kaiyang and Qian, Bingsheng and Wu, Binghuan and Chen, Kangyi and Shi, Ye and Wang, Jingya}, booktitle = {Forty-third International Conference on Machine Learning (ICML)}, year = {2026}, }
ACM MM 2026
ARFlow: Real-time Human Action-Reaction Synthesis with Reprojection Guidance

Wentao Jiang, Jingya Wang, Kaiyang Ji, and 3 more authors

In Proceedings of the 34nd ACM International Conference on Multimedia (ACM MM), 2026

Abs arXiv Bib PDF Website

Human action-reaction synthesis, a fundamental challenge in modeling causal human interactions, plays a critical role in applications ranging from virtual reality to social robotics. While diffusion-based models have demonstrated promising performance, they exhibit two key limitations for interaction synthesis: reliance on complex noise-to-reaction generators with intricate conditional mechanisms, and frequent physical violations in generated motions. To address these issues, we propose Action-Reaction Flow Matching (ARFlow), a novel framework that establishes direct action-to-reaction mappings, eliminating the need for complex conditional mechanisms. Our approach introduces a physical guidance mechanism specifically designed for Flow Matching (FM) that effectively prevents body penetration artifacts during sampling. Moreover, we discover the bias of traditional flow matching sampling algorithm and employ a reprojection method to revise the sampling direction of FM. To further enhance the reaction diversity, we incorporate randomness into the sampling process. Extensive experiments on NTU120, Chi3D and InterHuman datasets demonstrate that ARFlow not only outperforms existing methods in terms of Fréchet Inception Distance and motion diversity but also significantly reduces body collisions, as measured by our new Intersection Volume and Intersection Frequency metrics.
@inproceedings{jiang2025arflow, title = {ARFlow: Real-time Human Action-Reaction Synthesis with Reprojection Guidance}, author = {Jiang, Wentao and Wang, Jingya and Ji, Kaiyang and Jia, Baoxiong and Huang, Siyuan and Shi, Ye}, booktitle = {Proceedings of the 34nd ACM International Conference on Multimedia (ACM MM)}, year = {2026}, }
ICCV 2025 Highlight
Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

Kaiyang Ji, Ye Shi, Zichen Jin, and 5 more authors

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

Abs arXiv Bib PDF Video Code Website

Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners’ movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.
@inproceedings{ji2025towards, title = {Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis}, author = {Ji, Kaiyang and Shi, Ye and Jin, Zichen and Chen, Kangyi and Xu, Lan and Ma, Yuexin and Yu, Jingyi and Wang, Jingya}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, pages = {10173--10183}, year = {2025}, }
ICLR 2026
Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

Zekai Deng, Ye Shi, Kaiyang Ji, and 3 more authors

In The Fourteenth International Conference on Learning Representations (ICLR), 2026

Abs arXiv Bib PDF Video Code Website

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable longhorizon interactions with diverse object types—including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios.
@inproceedings{deng2025human, title = {Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy}, author = {Deng, Zekai and Shi, Ye and Ji, Kaiyang and Xu, Lan and Huang, Shaoli and Wang, Jingya}, booktitle = {The Fourteenth International Conference on Learning Representations (ICLR)}, year = {2026}, }
CVPR 2024
A unified diffusion framework for scene-aware human motion estimation from sparse signals

Jiangnan Tang, Jingya Wang, Kaiyang Ji, and 3 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Abs arXiv Bib PDF Video Code

Estimating full-body human motion via sparse tracking signals from head-mounted displays and hand controllers in 3D scenes is crucial to applications in AR/VR. One of the biggest challenges to this task is the one-to-many mapping from sparse observations to dense full-body motions, which endowed inherent ambiguities. To help resolve this ambiguous problem, we introduce a new framework to combine rich contextual information provided by scenes to benefit fullbody motion tracking from sparse observations. To estimate plausible human motions given sparse tracking signals and 3D scenes, we develop S2Fusion, a unified framework fusing Scene and sparse Signals with a conditional difFusion model. S2Fusion first extracts the spatial-temporal relations residing in the sparse signals via a periodic autoencoder, and then produces time-alignment feature embedding as additional inputs. Subsequently, by drawing initial noisy motion from a pre-trained prior, S2Fusion utilizes conditional diffusion to fuse scene geometry and sparse tracking signals to generate full-body scene-aware motions. The sampling procedure of S2Fusion is further guided by a specially designed scene-penetration loss and phase-matching loss, which effectively regularizes the motion of the lower body even in the absence of any tracking signals, making the generated motion much more plausible and coherent. Extensive experimental results have demonstrated that our S2Fusion outperforms the state-of-the-art in terms of estimation quality and smoothness.
@inproceedings{tang2024unified, title = {A unified diffusion framework for scene-aware human motion estimation from sparse signals}, author = {Tang, Jiangnan and Wang, Jingya and Ji, Kaiyang and Xu, Lan and Yu, Jingyi and Shi, Ye}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {21251--21262}, year = {2024}, }