NVIDIA's Open-Source Solution Boosts MoE Fine-Tuning Throughput by Up to 3.7x

2026-06-26 13:54

Favorite

en.Wedoany.com Reported - NVIDIA has released the NeMo AutoModel open-source solution, achieving a 3.4x to 3.7x improvement in training throughput during Mixture of Experts (MoE) fine-tuning, while reducing GPU memory usage by 29% to 32%.

NeMo AutoModel is compatible with the Hugging Face Transformers v5 API. Users only need to add a single line of import code to accelerate MoE model fine-tuning. On a single node with 8 NVIDIA H100 80GB GPUs, using the Qwen3-30B-A3B model as an example, the solution boosts throughput per GPU (TPS/GPU) from 3075 to 11340, a 3.69x increase.

The MoE architecture has become the mainstream choice for cutting-edge models, but the engineering challenges it introduces—such as expert parallelism, communication fusion, and kernel optimization—require supporting infrastructure. NVIDIA's solution builds on Transformers v5 by incorporating three key technologies: Expert Parallelism (EP), DeepEP, and TransformerEngine.

Expert Parallelism distributes expert weights across multiple GPUs, reducing the memory pressure on each individual GPU. For example, with 8 GPUs and ep_size=8, the MoE memory footprint per GPU is reduced to one-eighth of the original. For the Qwen3 model, this technology reduces peak memory from 68.2 GiB to 48.1 GiB, a 29% decrease. For the Nemotron Nanomo model, memory usage drops from 62.1 GiB to 42.5 GiB, a 32% decrease. The freed-up memory can be used to support training with larger batch sizes and longer sequences.

DeepEP achieves the fusion of computation and communication. In traditional setups, there is a communication cost between token distribution and expert computation. DeepEP integrates token distribution and combination operations through optimized GPU kernels, allowing the communication process to overlap with expert computation.

The TransformerEngine kernel accelerates operations such as fused attention mechanisms, linear layers, and RMSNorm, benefiting both MoE layers and standard Transformer layers.

Experiments on the Qwen3-30B-A3B and Nemotron 3 Nano 30B-A3B models show that, compared to Transformers v5, this solution increases training throughput by 3.4x to 3.7x while reducing memory consumption by 29% to 32%. NVIDIA also released results for full-parameter fine-tuning of the Nemotron 3 Ultra 550B A55B model on a 16-node H100 cluster with 128 GPUs, achieving a TPS/GPU of 815, TFLOP/s/GPU of approximately 293, and peak memory of 58.2 GiB. NVIDIA stated that Transformers v5 cannot run at this scale due to memory exhaustion.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com