China's Alibaba Launches Qwen-Robot Series to Advance Embodied AI Model Deployment
2026-06-16 14:04
Favorite

en.Wedoany.com Reported - On June 16, Alibaba released the Qwen-Robot series of embodied intelligence large models, comprising three models: the VLA manipulation model Qwen-RobotManip, the VLN navigation model Qwen-RobotNav, and the world model Qwen-RobotWorld. This marks the first complete series of embodied intelligence models within the Qwen large model family, designed for robot manipulation, mobile navigation, and environmental understanding, respectively. They can be deployed individually or operate collaboratively, providing a universal model foundation for various forms of robots to enter real-world scenarios.

The key to embodied intelligence is enabling AI not only to understand and generate content within text, images, and videos but also to interact with the physical world. For robots to operate in real environments, they must simultaneously perform multiple capabilities such as "seeing objects, understanding tasks, planning paths, controlling actions, and judging outcomes." The Qwen-Robot series breaks down manipulation, navigation, and world modeling into three distinct model directions, indicating Alibaba's intention to extend general large model capabilities into the robot action chain, rather than confining them to dialogue or visual recognition.

Qwen-RobotManip is a VLA manipulation model. VLA stands for Vision-Language-Action model, primarily addressing the robot's "hand" problem. When a robot encounters desktop objects, tools, parts, or everyday items, it needs to identify targets, understand instructions, and then generate executable actions such as grasping, moving, placing, switching, and organizing. Traditional robot control relies on fixed programs and structured environments; once object positions, backgrounds, lighting, or task expressions change, generalization capabilities tend to decline. The value of the VLA model lies in integrating visual perception, language instructions, and action control into a single framework, enabling robots to generate action strategies based on natural language and real-time visual scenes.

Qwen-RobotNav is a VLN navigation model. VLN stands for Vision-Language Navigation model, primarily addressing the robot's "leg" problem. When service robots, inspection robots, quadruped robots, and mobile platforms enter office buildings, factories, warehouses, parks, or home environments, they must understand "where to go, how to get there, what to avoid, and what to do upon arrival." Mobile navigation involves not only path planning but also spatial semantic understanding, obstacle avoidance, multi-step instruction following, and task location confirmation. The VLN model allows robots to map language goals to visual environments, thereby completing navigation tasks in more complex open environments.

Qwen-RobotWorld assumes the role of a world model, primarily addressing the robot's "brain" problem. The world model is used to understand object relationships, spatial structures, action consequences, and environmental changes, helping robots predict and plan before execution. If a robot can only execute single-step actions based on instructions, it struggles to handle unexpected situations in the real world; the world model enables the system to anticipate "what will happen after doing this" and adjust strategies during tasks. For industrial, logistics, commercial service, and home service scenarios, this capability is crucial for robots to transition from demonstration-based tasks to continuous operations.

Alibaba had previously conducted research in the Qwen-VLA direction. Official technical documentation for Qwen-VLA indicates that the model integrates manipulation, navigation, and trajectory prediction into a unified action and trajectory prediction framework, adapting to different robot platforms through embodied perception prompts. Related research emphasizes that a unified model can serve multiple embodied platforms without requiring individually designed output heads for each platform. With the release of the Qwen-Robot series, the Qwen embodied intelligence roadmap has further shifted from a research framework to a productized model system.

From an industry perspective, the release of the Qwen-Robot series occurs against the backdrop of accelerating deployment of humanoid robots, mobile robots, and industrial intelligent agents. Robot companies generally face a common challenge: while hardware bodies are advancing rapidly, general task capabilities, scene generalization, and data loops remain bottlenecks. Different robot forms vary significantly in sensors, joints, actuators, and control methods. If each product requires training models from scratch, costs are high, development cycles are long, and cross-platform capabilities are difficult to accumulate. The goal of embodied intelligence large models is to provide reusable perception, understanding, planning, and action generation capabilities for different robots.

For Alibaba, the Qwen-Robot series also completes a link in the Qwen large model's journey from language, multimodality, and agents to physical world interaction. General large models are transitioning from online task execution to real-world scenario execution, while robots require stronger task understanding and action planning capabilities from large models. In the future, whether embodied models can truly be deployed depends on robot hardware interfaces, training data scale, simulation-to-real environment transfer, action safety boundaries, and industry scenario adaptation. Model release is just the starting point; subsequent validation results in warehousing, inspection, manufacturing, commercial services, and home services will determine their industrial value.

The significance of the Qwen-Robot series lies in Alibaba's entry into the core of embodied intelligence with a complete model combination. VLA addresses manipulation, VLN addresses navigation, and the world model addresses environmental understanding and planning. With their synergy, robots have the opportunity to transition from executing single skills to handling multi-step tasks. As embodied intelligence moves from laboratories to real operational environments, universal model foundations, hardware adaptation capabilities, and scenario data loops will become key variables in the competitive landscape of the robotics industry.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com