Google Open-Sources Text Diffusion Model DiffusionGemma

2026-06-11 08:51

Favorite

en.Wedoany.com Reported - On June 10, Google released the experimental open-source model DiffusionGemma. This model adopts a text diffusion architecture, is open-sourced under the Apache 2.0 license, and is designed for researchers and developers to explore local fast inference, interactive text generation, and low-concurrency application scenarios. On dedicated GPUs, its text generation speed is up to 4 times faster than traditional autoregressive large language models.

DiffusionGemma's technical approach differs from common large language models. Traditional autoregressive models typically generate tokens one by one from left to right; the longer the text, the more noticeable the latency waiting for the next output. DiffusionGemma, in contrast, attempts to generate a text framework in one go, then refines the content through multiple iterative steps. Google designed it as a Mixture of Experts model with a total of 26B parameters, activating approximately 3.8B parameters during inference. It can be adapted via quantization to high-end consumer GPUs with 18GB of VRAM. For local developers, this means the model is not only for large-scale cloud deployment but can also handle tasks like rapid editing, code completion, text rearrangement, and experimental generation on a single high-performance graphics card.

The model's speed advantage primarily stems from its parallel generation mechanism. Each forward pass of DiffusionGemma can generate 256 tokens in parallel, allowing tokens within an output block to attend to each other and be continuously refined in subsequent iterations. This structure is well-suited for tasks such as inline editing, code completion, non-linear text structures, mathematical diagrams, and tasks requiring joint constraints from context. Google disclosed that on a single NVIDIA H100, DiffusionGemma can achieve an output of over 1000 tokens per second; on an NVIDIA GeForce RTX 5090, it can achieve over 700 tokens per second.

However, it is not a replacement for Gemma 4.

Google's positioning for DiffusionGemma is clear: it is an experimental model prioritizing speed-sensitive and interactive local workflows, with overall output quality lower than the standard Gemma 4. For applications demanding the highest generation quality, stability, and production-grade delivery, Google still recommends using the standard Gemma 4. DiffusionGemma's advantages are also not suitable for all deployment environments. In high-concurrency cloud services, autoregressive models can fully utilize compute power through batch processing, diminishing the benefits of parallel decoding in text diffusion, potentially even increasing service costs. In other words, it is more suitable for low-to-medium batch, local single-user, or development experimental environments, rather than directly replacing mainstream cloud-based large model architectures.

This release still holds significant implications for the information and communication technology and AI development ecosystem. Previously, diffusion models were more commonly known for image and video generation, while text generation was long dominated by autoregressive architectures. DiffusionGemma combines the text diffusion approach with the open Gemma model ecosystem, providing developers with an alternative speed-focused experimental platform. With the growing demand for local AI, personal workstations, AI PCs, and edge devices, developers increasingly need to perform fast generation, real-time modification, and privacy-sensitive task processing without relying on remote cloud infrastructure. The open-source license also facilitates further experimentation by research institutions, tool vendors, and developers on model architecture, inference engines, quantization schemes, and fine-tuning methods.

The impact on the industry chain will be concentrated on local AI inference, consumer GPUs, developer tools, and model service platforms. DiffusionGemma already supports obtaining weights via Hugging Face and can be used with tools like MLX, vLLM, and Hugging Face Transformers. Google is also collaborating with NVIDIA to optimize performance across the hardware stack, covering RTX consumer graphics cards, RTX PRO, Hopper, and Blackwell enterprise platforms. Subsequent milestones include the effectiveness of developer fine-tuning, the progress of ecosystem support like llama.cpp, the practical experience of the model in code completion and real-time editing, and whether the text diffusion architecture can continue to narrow the output quality gap with high-quality autoregressive models. If this path continues to mature, local AI applications may achieve faster generation responses, and it could also introduce a new technical branch to the open model ecosystem.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com