en.Wedoany.com Reported - On June 24, China's Qwen officially released the native language world model Qwen-AgentWorld, along with the AgentWorldBench evaluation benchmark covering seven major domains. The model and benchmark are now openly accessible on Hugging Face and ModelScope, targeting scenarios such as AI agent environment simulation, task training, and capability evaluation.
The core positioning of Qwen-AgentWorld is as a "language world model," not an ordinary conversational large model. It simulates state changes in the agent's environment through language, predicting the next environmental feedback based on the agent's actions and historical interaction records. For AI agents, this type of model provides a virtual interactive space for repeated trial and error, used to train and evaluate the agent's planning, execution, and error-correction capabilities in complex tasks.
The released Qwen-AgentWorld covers seven major agent interaction domains, including MCP tool invocation, search, terminal, software engineering, Android, web, and operating systems. These domains encompass both text-based environments and graphical interface and software operation environments, covering common task entry points for current AI agents. The model can simulate results of terminal command execution, web operation feedback, mobile application interface changes, software engineering task progress, and environmental responses after tool invocation.
According to official information, Qwen-AgentWorld-35B-A3B is trained based on Qwen3.5-35B-A3B-Base, with a total parameter size of 35B and approximately 3B activated parameters, supporting a context length of 262K. Its training process includes three stages: continuous pre-training, supervised fine-tuning, and reinforcement learning, with the goal of focusing on environment modeling from the early training stages, rather than temporarily adding simulation capabilities on top of a general language model.
The concurrently released AgentWorldBench is used to evaluate the simulation quality of language world models in different interactive environments. This benchmark scores the model's predicted environmental observations from five dimensions: format, factuality, consistency, realism, and quality, helping researchers compare the performance of different models in environment simulation tasks. The Hugging Face page shows that the AgentWorldBench dataset is open in the form of a test set, containing approximately 2,170 samples.
This type of model has direct significance for AI agent research and development. Current agent training faces a practical problem: the high cost of invoking real environments, the complexity of task states, and the difficulty of large-scale stable reproduction of API, web, terminal, and mobile application environments. If a language world model can accurately simulate environmental feedback, researchers can allow agents to conduct multiple rounds of trial and error in a virtual environment, then transfer the acquired strategies to real tasks.
The release of Qwen-AgentWorld also indicates that competition among large models is shifting from "answering questions" to "understanding the environment and predicting environmental changes." In the past, large models primarily competed on knowledge, reasoning, and generation capabilities. In the agent era, the emphasis is more on judging the consequences of actions in multi-turn interactions. The value of a world model lies in establishing a trainable, evaluable, and scalable simulation bridge between actions and results.
However, language world models still cannot replace real environments. Web pages, operating systems, mobile applications, and tool invocations are all affected by version, permissions, network status, and changes in external services. Simulation results must be validated through real-world scenarios. Qwen-AgentWorld is more suitable as an infrastructure for agent training and evaluation, used to reduce trial-and-error costs, expand environmental coverage, and identify agent weaknesses, rather than being directly equivalent to real system operation.
With the simultaneous open release of the model and benchmark, developers can conduct secondary evaluations and fine-tuning around scenarios such as terminals, software engineering, mobile applications, search, and tool invocation. For AI agents to move from demonstration to practical usability, they require more stable environment simulation, reproducible evaluation standards, and a training loop oriented towards real tasks. Qwen-AgentWorld fills this gap by providing a new tool foundation.
This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com









