On June 10, 2026, Google DeepMind released DiffusionGemma 26B-A4B-it — a model that replaces the standard one-token-at-a-time autoregressive inference loop with a text diffusion approach that generates 15–20 tokens per forward pass. The result: 1100+ tokens per second on an H100 in FP8, which is roughly 4x faster than a comparably-sized autoregressive model on the same hardware.
It’s open-weight, Apache 2.0, fits in 18GB of VRAM when quantized to NVFP4, and has day-zero support in vLLM, Transformers, MLX, and Unsloth.
The trade-off is real: DiffusionGemma scores substantially lower than Gemma 4 on reasoning, math, and coding. Google is explicitly calling this release experimental and recommending Gemma 4 for quality-critical production workloads. For builders, the question is whether the speed profile makes sense for your specific use case — and in several cases, it might.
What Text Diffusion Actually Means (and Why It’s Fast)
Standard autoregressive language models generate one token at a time. Each token depends on the previous tokens, so the process is inherently sequential. You can batch multiple requests, but you can’t parallelize the generation of a single response.
Text diffusion works differently. The model starts with a “canvas” of 256 masked tokens and iteratively denoises them — predicting the most likely tokens, committing the high-confidence ones, and re-running the diffusion pass on the remainder. Each denoising step can update 15–20 tokens simultaneously. With up to 48 maximum denoising steps and adaptive stopping when predictions stabilize, a typical 512-token response might require 20–30 denoising passes instead of 512 autoregressive steps.
The architecture uses an encoder-decoder design: an autoregressive encoder processes and caches the prompt context, then a bidirectional-attention decoder runs the diffusion passes over the generation canvas. The bidirectional attention in the decoder is what allows parallel token prediction — unlike autoregressive models, the decoder can “see” the entire canvas position at once, not just positions to the left.
This is the same principle that made diffusion models dominant in image generation (where every pixel can be denoised in parallel). DiffusionGemma applies it to discrete text tokens.
Model Specs
| Property | Value |
|---|---|
| Total parameters | 25.2B |
| Active parameters | 3.8B |
| Expert configuration | 8 active / 128 total + 1 shared |
| Layers | 30 |
| Context window | 256K tokens |
| Canvas size | 256 tokens per pass |
| Vocabulary | 262K tokens |
| Vision encoder | ~550M parameters |
| Supported inputs | Text, image, video (≤60s @ 1fps) |
| Supported languages | Pre-trained on 140+, strong on 35+ |
| Training data cutoff | January 2025 |
| License | Apache 2.0 |
The MoE design keeps the active compute low. At 3.8B active parameters per forward pass (despite 25.2B total weights), the inference cost per token batch is closer to a 4B dense model than a 26B dense model. This is why the VRAM requirement is manageable: you need room to store the 25.2B weights plus activation memory, but the compute per step is light.
Benchmark Results
Be honest with yourself about these numbers before deploying.
Reasoning and Knowledge
| Benchmark | DiffusionGemma | Gemma 4 |
|---|---|---|
| MMLU Pro | 77.6% | 82.6% |
| AIME 2026 (no tools) | 69.1% | 88.3% |
| GPQA Diamond | 73.2% | 82.3% |
| BigBench Extra Hard | 47.6% | 64.8% |
The AIME gap is the most telling: 69.1% vs 88.3%. On competition math, the quality cost of diffusion-based generation is significant. Gemma 4 is clearly better for tasks requiring multi-step reasoning.
Multimodal
| Benchmark | DiffusionGemma | Gemma 4 |
|---|---|---|
| MMMU Pro | 54.3% | 73.8% |
| MATH-Vision | 70.5% | 82.4% |
| OmniDocBench 1.5 | 0.319 | 0.149 |
The OmniDocBench number is the interesting exception: DiffusionGemma outperforms Gemma 4 on document parsing. OCR, layout-aware extraction, and dense document understanding are tasks where the bidirectional attention in the diffusion decoder provides a structural advantage. If document processing is your primary workload, this score is worth taking seriously.
Inference Speed
| Hardware | Tokens per second (FP8 / NVFP4) |
|---|---|
| H100 | 1100+ |
| RTX 5090 | 700+ |
These numbers are for batch inference. Single-request latency depends on the diffusion pass count, which varies with output length and content complexity.
Hardware Requirements
| Configuration | VRAM | Notes |
|---|---|---|
| NVFP4 quantized | ~18GB | RTX 5090 viable (700+ tok/s) |
| BF16 full precision | ~50GB+ | Multi-GPU setup required |
| FP8 | ~28GB | Single H100 preferred |
The NVFP4 path is the practical consumer-GPU option. An RTX 5090 with 32GB VRAM can run the quantized model. RTX 4090 (24GB) is marginal — you may be able to run quantized inference, but with limited headroom for context.
NVIDIA has published an NVFP4-quantized model (nvidia/diffusiongemma-26B-A4B-it-NVFP4) specifically for this hardware path.
Deployment
vLLM (recommended for production)
pip install vllm
vllm serve "google/diffusiongemma-26B-A4B-it"
Once running, the server exposes an OpenAI-compatible endpoint. Any client that uses the openai Python library can redirect with two env variable changes:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="google/diffusiongemma-26B-A4B-it",
messages=[{"role": "user", "content": "Summarize this document..."}],
max_tokens=1024
)
Transformers (for experimentation)
from transformers import DiffusionGemmaForBlockDiffusion, AutoProcessor
MODEL_ID = "google/diffusiongemma-26B-A4B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = DiffusionGemmaForBlockDiffusion.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto",
)
messages = [{"role": "user", "content": "Why is the sky blue?"}]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
text = processor.decode(output[0], skip_special_tokens=False)
Note: DiffusionGemmaForBlockDiffusion is a new class specific to this model. Standard AutoModelForCausalLM will not work — this model’s generation loop is different from autoregressive models.
Image input
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://example.com/document.jpg"},
{"type": "text", "text": "Extract all text from this document."}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
Image placement matters: Google’s documentation is explicit — place images before text in your prompts for best performance.
Token budgets for image resolution: 70, 140, 280, 560, 1120 tokens. Lower budgets (70–280) for classification and captioning; higher budgets (560–1120) for OCR and document parsing where detail matters.
NVFP4 via NVIDIA build (for RTX 5090)
# Pull from NVIDIA's hosted endpoint (no local download required)
docker model run hf.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4
Or pull and run locally using the HuggingFace weights directly.
Diffusion Sampling Settings
The model has configurable denoising parameters. Google’s recommended defaults:
| Parameter | Value |
|---|---|
| Maximum denoising steps | 48 |
| Temperature start | 0.8 |
| Temperature end | 0.4 |
| Temperature schedule | Linear decay |
| Entropy bound | 0.1 |
| Entropy threshold | 0.005 |
Adaptive stopping kicks in when per-canvas entropy drops below 0.005 and predictions stabilize. For short factual responses, this typically terminates well before 48 steps. For longer, more complex outputs, expect the full step count.
You can trade quality for speed by reducing max_denoising_steps — fewer passes means faster generation but more noise in the output. Experiment on your specific workload to find the right balance.
Thinking Mode
DiffusionGemma includes a thinking mode that produces internal reasoning before a final answer:
# Enable thinking with system prompt
messages = [
{"role": "system", "content": "<|think|>Work through the problem step by step before answering."},
{"role": "user", "content": "What's the most efficient way to process 10,000 PDFs?"}
]
Thinking output is wrapped in <|channel>thought\n[reasoning]<channel|>[final answer] markers. The reasoning token stream is separate from the output and can be surfaced to users or discarded.
Important: thinking mode adds denoising passes for the reasoning canvas before the answer canvas. It will slow down overall generation relative to non-thinking mode. For high-throughput use cases, profile carefully before enabling thinking by default.
When to Use DiffusionGemma (and When Not To)
Strong use cases
Document extraction and OCR. The OmniDocBench 1.5 result (0.319 vs Gemma 4’s 0.149) is the clearest signal that DiffusionGemma has a genuine advantage in document-parsing workflows. If you’re processing PDFs, invoices, forms, or dense structured documents, the combination of speed and this specific quality advantage makes it worth evaluating seriously.
High-throughput batch inference. Summarizing thousands of short texts, generating product descriptions at scale, or any batch workflow where throughput matters more than frontier reasoning quality. At 1100 tokens/second on a single H100, you’re covering substantially more ground per GPU-hour.
Interactive applications requiring real-time streaming. The parallel generation means first-token latency is different from autoregressive models — but steady-state throughput is high. For chat-adjacent interfaces where users are watching tokens appear, the speed profile can feel noticeably snappier.
Multimodal pipelines with image/video input. The vision encoder handles object detection, captioning, chart reading, and video frame analysis. At 256K context, you can process a long document with many embedded images in a single inference call.
Cases where Gemma 4 is better
Multi-step reasoning. The AIME 69.1% vs 88.3% gap is not noise. If your workflow involves complex mathematical reasoning, extended logic chains, or any task that requires consistent multi-step coherence, use Gemma 4.
Coding. Google’s own model card notes weaker coding evaluation performance relative to Gemma 4. For anything beyond simple code generation or syntax correction, the quality difference is likely to show in production.
Research or analysis tasks. BigBench Extra Hard (47.6% vs 64.8%) suggests that tasks requiring extended, structured reasoning will produce meaningfully lower quality outputs.
Comparison to Alternatives
| Use Case | Best Choice | Why |
|---|---|---|
| Fastest open-weight text generation | DiffusionGemma | 4x faster, Apache 2.0 |
| Highest quality open-weight general reasoning | Gemma 4 (27B) | +5–19 points across benchmarks |
| Best open-weight agentic coding | North Mini Code (Cohere) | Purpose-built, SWE-Bench 61% pass@1 |
| Document OCR and extraction | DiffusionGemma | OmniDocBench best-in-class |
| Consumer GPU (24GB VRAM) | Gemma 4 27B FP8 | DiffusionGemma NVFP4 marginal on 4090 |
| Consumer GPU (32GB VRAM) | DiffusionGemma NVFP4 | RTX 5090 viable, 700+ tok/s |
| No GPU budget | Gemini 3.5 Flash API | Speed + quality at hosted API rates |
Limitations to Know Before Deploying
Training cutoff is January 2025. The model’s world knowledge stops there. For tasks requiring up-to-date information, you need retrieval augmentation.
Canvas size constrains coherence. Each 256-token canvas is denoised semi-independently. Very long structured outputs — particularly those requiring strict formatting or referencing content from many pages back — can show incoherence at canvas boundaries. This is an architectural constraint, not something you can tune away.
Diffusion != autoregressive for all tasks. The generation mechanics change what “temperature” and “sampling” mean. Some prompting techniques that work reliably for autoregressive models (few-shot examples, specific chain-of-thought formats) behave differently under text diffusion. Plan for a re-evaluation pass if you’re porting an existing prompt library.
Experimental positioning. Google’s own release notes describe DiffusionGemma as experimental and do not recommend it for production workloads where quality is the primary constraint. That’s a meaningful signal from the team that built it.
Builder Checklist
- Confirm your use case: speed-critical (batch/throughput) OR document extraction → evaluate DiffusionGemma; reasoning/coding → use Gemma 4
- Hardware path: NVFP4 for RTX 5090 (700+ tok/s), FP8 for H100 (1100+ tok/s), BF16 for multi-GPU research
- Use
DiffusionGemmaForBlockDiffusionclass in Transformers — standardAutoModelForCausalLMwill not work - Place images before text in multimodal prompts
- Set image token budget based on detail level: 70–280 for captioning/classification, 560–1120 for OCR/documents
- Benchmark your specific workload — don’t rely only on MMLU numbers for your decision
- Profile thinking mode overhead before enabling by default in high-throughput pipelines
- Plan for canvas boundary effects in very long structured outputs
- Implement retrieval augmentation if your task requires post-January-2025 world knowledge
- Review Apache 2.0 license terms for your commercial deployment (permissive; confirm with legal for regulated industries)
Model weights: google/diffusiongemma-26B-A4B-it NVFP4 variant: nvidia/diffusiongemma-26B-A4B-it-NVFP4 Official announcement: blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/ NVIDIA build endpoint: build.nvidia.com/google/diffusiongemma-26b-a4b-it