Alibaba's Tongyi Lab Launches FIPO Algorithm, 32B Model Inference Performance Surpasses o1-mini
2026-04-08 10:21
Favorite

en.Wedoany.com Reported - Alibaba's Tongyi Lab Qwen Pilot team launched a new algorithm, FIPO (Future-KL Influenced Policy Optimization), on April 7, 2026. In a pure reinforcement learning (Pure RL) setting with a 32B scale model, its performance surpassed OpenAI's o1-mini and the similarly scaled DeepSeek-Zero-MATH. According to a paper submitted by the team to arXiv on March 20, 2026, when evaluated on Qwen2.5-32B, FIPO extended the average chain-of-thought length from about 4000 tokens to over 10,000 tokens. The Pass@1 accuracy on AIME 2024 improved from 50.0% to a peak of 58.0%, converging at approximately 56.0%. This performance is also better than DeepSeek-R1-Zero-Math-32B (approx. 47.0%) and o1-mini (approx. 56.0%).

Traditional GRPO-style reinforcement learning relies on outcome-based rewards (ORM), distributing global advantage uniformly across each token in the trajectory. This coarse-grained credit assignment fails to distinguish critical logical pivots from trivial tokens, causing reasoning trajectories to stagnate at intermediate lengths. FIPO introduces discounted future KL divergence into policy updates, reweighting based on a token's influence on subsequent trajectory behavior. This constructs a dense advantage representation at the token level, achieving precise reward reweighting per token. The team also introduced the signed log-probability difference (Δlog p) as a new observation dimension to capture the direction of optimization, replacing the industry-common but less precise metrics of entropy and KL divergence for identifying key tokens.

FIPO was tested on the foundation model Qwen2.5-32B-Base, which had no prior exposure to any long chain-of-thought synthetic data. Research statistics show that the model's probability of self-misleading in long reasoning chains (nearly 3%) is three times higher than its probability of achieving insight (about 1%). The root cause lies in the global uniform reward mechanism's inability to distinguish critical logical nodes from redundant reflection. FIPO incorporates three robustness mechanisms—extreme value filtering, soft decay windows, and influence weight clipping—to ensure training stability. The related paper, code, and model have been open-sourced, with the training system built on the verl framework.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com