Google Unveils DiffusionGemma: 26B Open Model That Generates Text 4x Faster

Google's AI division has launched DiffusionGemma, a 26B Mixture of Experts (MoE) model that achieves up to 4x faster text output speeds on dedicated GPUs compared to traditional autoregressive architectures. It maintains compatibility with consumer-grade hardware for efficient text generation.
Breaking the Autoregressive Mold
How does DiffusionGemma differ from conventional text generation models? It processes entire text blocks simultaneously using advanced diffusion techniques, activating only 3.8B parameters during inference out of a total of 26B. This model can deliver over 1,000 tokens per second on NVIDIA H100 GPUs, revolutionizing text generation speed.
This shift from memory-bound to compute-bound operations optimizes hardware utilization, especially for local deployments where latency is critical for real-time text generation.
Technical Breakthroughs in AI Architecture
Innovative Architecture Features
What makes the architecture of DiffusionGemma innovative? Key features include:
- Bidirectional Attention Mechanism: Allows tokens to reference both past and future context during generation, enhancing output quality.
- Hybrid Attention Modes: Switches between causal attention (for prompt ingestion) and bidirectional attention (for refinement), optimizing performance.
- Block Autoregressive Diffusion Technique: Combines parallel processing with sequential stability for long-form outputs, improving efficiency.
Optimized Hardware Performance
How does DiffusionGemma perform on hardware? Quantized to 18GB VRAM, it achieves:
- 700+ tokens/sec on RTX 5090 consumer GPUs, showcasing powerful hardware optimization.
- Native support for NVIDIA's NVFP4 4-bit floating-point format enhances computational efficiency.
- Day-one compatibility with vLLM, Transformers, and Unsloth frameworks ensures seamless integration.
- The output quality of DiffusionGemma remains lower than that of standard Gemma 4.
- Speed gains of DiffusionGemma peak in low-concurrency local deployments.
- For cloud production workloads, it is still recommended to use autoregressive models.