en.Wedoany.com Reported - On June 8, the 5D world model EvoPhys-World, developed by the EvoPhys team at Peking University, ranked first in the "World Generation" track of Stanford University's WorldScore public benchmark. This human-centric model is designed for scene-level controllable generation and physical interaction tasks, with its native training fully completed using Moore Threads MTT S5000 GPUs and the MUSA software stack.
The technical focus of EvoPhys-World lies in advancing world models from "generating viewable scenes" to "generating interactive, controllable, and evolving scene systems." According to the project page, the model constructs a human-centric world twin using first-person interaction data and scene memory, further introducing controllable interaction and self-evolution mechanisms. This enables a single scene state to predict different futures under various action branches. Its core model includes two forms: World Engine and World Policy. The former emphasizes universal digital twinning and physical interactivity, while the latter focuses on world predictability and action selection. Together, they form a closed loop from scene generation, state prediction, action prediction, to feedback evolution. For embodied intelligence, robot training, virtual simulation, and complex scene generation, the value of such models lies in enabling AI to not only understand spatial relationships in images but also grasp connections between actions, causality, physical feedback, and task outcomes.
WorldScore is a unified evaluation benchmark for world generation tasks, covering the ability of 3D, 4D, and video models to generate worlds as instructed, with key metrics including controllability, quality, and dynamic performance. The public leaderboard shows that EvoPhys-World ranks among the top in indicators such as WorldScore-Static.
This progress also brings greater visibility to the adaptability of Chinese GPUs and software stacks in training cutting-edge models. World model training imposes high demands on long-sequence data throughput, distributed training stability, multimodal spatiotemporal modeling, operator support, and hardware-software co-efficiency. The fact that EvoPhys-World's native training was fully completed using Moore Threads MTT S5000 GPUs and the MUSA software stack means the model development team did not merely use domestic computing power for inference or later-stage adaptation but validated the entire training pipeline—from hardware and software stack to model workflow. For China's AI infrastructure industry, such cases are more complex than simply running language model inference, as world models involve diverse workloads such as video generation, physical interaction, state prediction, and action strategies, requiring higher standards for GPU clusters, communication efficiency, and training framework compatibility.
The application direction of EvoPhys-World is also closer to the physical world. The project page showcases scenarios including human hand operations, desktop interactions, moving cups, warehousing, chemical plants, cities, and ancient towns, indicating that the model aims to cover multi-level generation tasks from localized hand movements to large-scale scene navigation, from object contact to task reasoning. If this trajectory continues, world models are expected to become a crucial foundation for embodied intelligence training, providing robots with low-cost, highly controllable, and repeatedly evolvable virtual training environments before real-world deployment. They can also be applied to industrial simulation, digital twins, complex task rehearsal, and human-machine collaboration validation.
Going forward, the impact of EvoPhys-World will depend on the openness of model capabilities, developer ecosystem development, results from more real-world task validations, and the sustained stability of China's GPU software stack in larger-scale training. Topping the WorldScore leaderboard at least demonstrates that a Chinese university team has entered the forefront of international public evaluations in the world model direction, providing an observable sample of how domestic AI computing power supports cutting-edge multimodal model training.
This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com









