NVIDIA Accelerates DiffusionGemma Speed by 4x

2026-06-11 10:25

Favorite

en.Wedoany.com Reported - Google DeepMind has released an experimental open model called DiffusionGemma, designed for extremely fast text generation. NVIDIA has optimized this model to run faster on NVIDIA GeForce RTX GPUs, NVIDIA RTX PRO platforms, and NVIDIA DGX Spark systems, covering environments from local PCs to the cloud.

Unlike traditional text generation methods that produce words one by one, DiffusionGemma can generate multiple words in parallel to output entire blocks of text. The model is built on Gemma 4, a mixture-of-experts model with 26 billion parameters, activating only 3.8 billion parameters per step, and combines a diffusion head with Google's Gemma 4 architecture. In terms of performance, DiffusionGemma can achieve up to 4x faster text generation on local hardware compared to equivalent autoregressive models. As an open model, DiffusionGemma releases weights under the permissive Apache 2.0 license and runs entirely locally on RTX and DGX Spark without cloud dependency, with immediate support in Hugging Face Transformers, vLLM, and Unsloth. Additionally, users can test DiffusionGemma for free via the NVIDIA-hosted API on build.nvidia.com.

Most large language models (LLMs) currently in widespread use adopt an autoregressive generation method, producing one token at a time, with each new word depending on the previous one. DiffusionGemma, based on the Gemma 4 26B mixture-of-experts architecture, generates text in the way diffusion models generate images: starting from noise and refining an entire block of text at once. At each step, the model denoises up to 256 tokens in parallel. For latency-sensitive single-user tasks such as interactive chat, agent loops, or on-device assistants, this parallelism enables response speeds that keep pace with development and iteration needs.

Traditional LLMs, when generating one token at a time, are often limited by memory bandwidth, leaving significant computational power underutilized. In contrast, DiffusionGemma processes complete token blocks in parallel through the Transformer, and its compute-intensive workload fully leverages the advantages of NVIDIA GPUs. Data shows that DiffusionGemma achieves 1000 tokens/sec on a single NVIDIA H100 Tensor Core GPU, 150 tokens/sec on NVIDIA DGX Spark, and the fastest local inference on NVIDIA DGX Station, approximately 4x faster than equivalent autoregressive models running in the same single-user scenario.

This performance advantage spans NVIDIA's entire product line, including the local DGX Spark desktop personal AI supercomputer powered by the NVIDIA GB10 Grace Blackwell Superchip with 128GB of unified memory; the RTX PRO 6000 workstation providing ample local running space for developers; the DGX Station offering fast inference speeds of up to 800 tokens/sec with 748GB of coherent memory; and GeForce RTX GPUs with upcoming support for llama.cpp.

Using Hugging Face Transformers is the fastest way to get started with DiffusionGemma on a GeForce RTX 5090 or DGX Spark. For higher-throughput inference, vLLM offers immediate service support. Users can fine-tune the model for specific tasks or domains using the Unsloth and NVIDIA NeMo frameworks. For more technical details, refer to the NVIDIA technical blog and Google DeepMind's official announcement.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com