Chinese Team Completes Full-Parameter Post-Training of Trillion-Parameter Model on Domestic Computing Power

2026-06-09 13:54

Favorite

en.Wedoany.com Reported - The AI Training Platform Project Team of Shenzhen Hetao College, in collaboration with Harbin Institute of Technology (Shenzhen), Shenzhen Institute of Big Data, and Huawei GTS (Global Technical Services), has conducted joint research on large model training using domestic computing power. Based on the Ascend 910C domestic computing cluster, the team successfully achieved stable full-parameter continued training and SFT (Supervised Fine-Tuning) of DeepSeek-V4-Pro within one month. The training accumulated over 1,500 steps, with a model MFU (Model FLOPS Utilization) exceeding 30%, and the efficiency of key training operators improved by approximately 14%.

This marks the industry's first full-parameter post-training engineering practice of DeepSeek-V4-Pro by a third-party institution on a domestic computing cluster, signifying that domestic AI infrastructure is advancing from inference deployment and lightweight fine-tuning to the full-parameter post-training phase of ultra-large models.

DeepSeek-V4-Pro is an open-source flagship MoE (Mixture of Experts) model with 1.6 trillion parameters, employing innovative mechanisms such as CSA+HCA hybrid sparse attention and mHC connections. Compared to its predecessor DeepSeek-V3/R1, it imposes higher demands on domestic training frameworks.

The joint research has achieved stable full-parameter post-training of DeepSeek-V4-Pro on a thousand-card Ascend 910C domestic computing cluster. The model iteration exceeded 1,500 steps without any skipped steps or NaN anomalies. The efficiency of key training operators improved by approximately 14% compared to the initial version, with the final MFU stabilizing at 34.9% and single-step training time stabilizing at 27 seconds. The team also successfully established the complete pipeline for full-parameter continued training and SFT of DeepSeek-V4-Flash.

The project outcomes demonstrate reproducible and engineering-deliverable stable training capabilities for trillion-level MoE models on domestic computing power, and have completed closed-loop validation in industrial-grade automated operations research modeling scenarios, indicating that domestic computing power can achieve specialized enhancement training for industry-specific large models within a short cycle and at low cost.

On the technical front, the project achieved three major breakthroughs: first, successfully constructing a distributed deployment scheme covering weights, gradients, activations, and optimizer states, enabling coordinated operation of data parallelism, tensor parallelism, pipeline parallelism, and expert parallelism; second, optimizing MoE routing and sparse attention operators, establishing an expert load balancing mechanism to effectively alleviate communication congestion and load imbalance; third, building a long-term stable monitoring system with full indicator visualization, with no loss divergence or NaN values observed during multi-day continuous training.

In the capability validation phase, the project designed an experiment to enhance the mathematical modeling ability of large models. The team built an SFT modeling data production workflow, generating 3,000 high-quality SFT samples for mathematical modeling tasks, covering 4 types of target tasks and 3 problem forms. Training results show: the model's LM Loss converged to 0.2056, MTP 1 Loss converged to 0.2538, with a stable gradient curve. Benchmark evaluations indicate comprehensive improvement in the model's four core indicators, with ORGEval WL increasing by over 5 percentage points, significantly enhancing complex reasoning and modeling capabilities.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com