During Cloudflare Agents Week 2026 (April 13–17), Cloudflare published two infrastructure posts that are worth reading if you’re thinking about large-model serving at scale. The first covers how they extended their Infire inference engine to handle multi-GPU models with disaggregated prefill/decode and distributed KV caching. The second covers Unweight — a lossless model weight compression system that reduces model footprint by 15–22% with zero accuracy cost.
These posts describe what Cloudflare built to serve models like Kimi K2.5 (560GB, 1T+ parameters) on Workers AI. The techniques are applicable beyond Cloudflare’s network.
Disaggregated Prefill/Decode
LLM inference has two computationally distinct phases: prefill (processing the input prompt to build the KV cache — compute-bound) and decode (autoregressively generating output tokens from the cache — memory-bound). Running both phases on the same hardware means the GPU alternates between workloads with very different optimal resource configurations.
Cloudflare separated these phases onto different hardware pools, connected via a token-aware load balancer. Prefill endpoints are sized for compute throughput. Decode endpoints are sized for memory bandwidth and KV cache capacity. Workload is routed to the appropriate pool based on phase.
Result: P90 intertoken latency dropped from ~100ms with high variance to 20–30ms — a 3× improvement in tail latency.
The efficiency gain matters most for agent workloads with many short output sequences interleaved with long prompts. Prefill and decode peaks don’t coincide, so each pool can be sized independently and scaled separately.
Prompt Caching and Session Affinity
For agentic sessions — where the same system prompt and conversation history are repeatedly prepended — KV cache reuse eliminates redundant GPU computation. This only works if the same request hits the same GPU that holds the cached KV state.
Cloudflare implemented x-session-affinity headers to route long-context agent sessions to the endpoint holding their cached context. Working with users like OpenCode, they increased cache hit ratios from 60% to 80% during peak load. That 20-point improvement directly reduces GPU requirements for the same throughput level.
The limitation is that session affinity constrains load balancing — you can’t freely distribute requests across the fleet when sessions are pinned to specific endpoints.
Distributed KV Cache via Mooncake Transfer Engine
For multi-GPU models, KV cache state is distributed across multiple GPUs. Cross-GPU KV cache sharing for disaggregated serving requires low-latency inter-GPU communication.
Cloudflare integrated Moonshot AI’s Mooncake Transfer Engine and Mooncake Store for this. Mooncake uses RDMA protocols (NVLink between GPUs within a node, InfiniBand or RoCE across nodes) for direct memory transfers — bypassing the CPU and system memory. This enables:
- KV cache reuse across GPUs without session affinity routing constraints
- NVMe-backed KV cache extension (spilling to local SSD when GPU memory is exhausted)
- Disaggregated prefill/decode across multi-GPU setups without cache re-computation
For Kimi K2.5 (560GB weights, requiring 8 H100 GPUs just to load) this matters: the KV cache headroom across the 8-GPU setup is 30+ GiB after model weights — enough for substantial context. Without efficient cross-GPU KV distribution, that headroom would be harder to use in a disaggregated architecture.
EAGLE-3 Speculative Decoding
For structured outputs — JSON tool calls, formatted API responses, templated text — the output distribution is more predictable than for free-form generation. Speculative decoding exploits this: a small draft model proposes multiple tokens ahead, and the target model validates them in parallel. When the draft is correct, you get multiple tokens per forward pass.
Cloudflare added EAGLE-3 (NVIDIA’s speculative decoding draft model framework) to Infire. The gains are most significant on agentic workloads where tool calls and structured JSON account for a large fraction of output tokens — exactly the workload profile on Workers AI’s agent-focused platform.
Infire Multi-GPU Scaling
Infire, Cloudflare’s Rust-based inference engine originally released in August 2025, was extended in April 2026 to support:
- Pipeline parallelism and tensor parallelism across multiple GPUs
- Sub-20-second cold starts even for trillion-parameter models — faster than competitors for spinning up large models from cold storage
- 20% higher tokens/sec throughput on unconstrained systems compared to prior single-GPU configurations
- Kimi K2.5 serving on 8 H100 GPUs with 30+ GiB KV-cache headroom after loading 560GB of weights
The Llama 4 Scout configuration runs on 2 H200 GPUs with 56+ GiB of KV-cache headroom — enough for 1.2M+ tokens of context.
Cold-start performance matters for Cloudflare specifically: their global edge network means models run in many locations, with lower utilization per location than a centralized cloud provider. Fast cold starts let them defer model loading until requests arrive rather than keeping models hot everywhere.
Unweight: Lossless Weight Compression
The Unweight post (April 17, 2026) covers a different axis: reducing the size of model weights in storage and memory without changing what the model computes.
What it does: Lossless compression of LLM weight tensors — specifically the MLP (multi-layer perceptron) projection weights, which represent the majority of model parameters. The compression is bit-exact: decompressed weights produce identical outputs to uncompressed weights.
How it works: Cloudflare observed that the exponent bits in BF16/FP16 MLP weights are highly compressible — they repeat with structure that standard compression algorithms can exploit. Unweight separates the exponent stream from the mantissa, applies LZ-family compression to the exponent stream (which compresses ~30%), and stores the result as a compressed bundle.
Numbers:
- ~13% model footprint reduction for inference bundles (MLP weights only, compressed at load time)
- ~22% reduction for distribution bundles (all MLP weights compressed for storage/transfer)
- ~3 GB VRAM savings on Llama 3.1 8B
- Extrapolated: 18–28 GB saved on Llama 70B (depending on configuration)
The cost: 30–40% decompression throughput overhead during inference on H100 GPUs — 41% at batch size 1, narrowing to ~30% at batch size 1024. For memory-bound workloads where the bottleneck is weight loading rather than compute, the smaller weights load faster and can partially or fully offset this overhead.
What it doesn’t do: This is not quantization. It doesn’t change the numerical precision of the weights. Outputs are bit-exact to FP32 inference. This makes it suitable for production use without the accuracy validation overhead that quantization requires.
What This Means for Builders
Most of these techniques are applicable outside Cloudflare’s infrastructure:
Disaggregated prefill is worth considering at any scale where prefill and decode resource profiles diverge — which happens when you have long system prompts with short outputs (tool calling, classification) or vice versa (document generation). The implementation complexity is real, but the latency gains are significant.
Mooncake Transfer Engine is open-source and available separately from Cloudflare’s infrastructure. If you’re running multi-GPU disaggregated inference, it’s worth evaluating for KV cache distribution.
Session affinity for prompt caching is already available in several inference frameworks and is straightforward to implement. The 60%→80% cache hit improvement Cloudflare measured translates directly to GPU cost reduction for agent workloads.
Unweight is described as open-source, though the release form was not confirmed at publication. The core insight — that MLP exponent streams are highly compressible — is implementable with standard compression libraries. Worth watching if model storage cost or cold-start load time is a constraint.
EAGLE-3 speculative decoding is available through NVIDIA’s inference stack and is increasingly integrated into vLLM and other frameworks.
The broader pattern: Cloudflare’s Agents Week posts describe what production large-model serving looks like when you’re running trillion-parameter models across a distributed global edge. The individual techniques (disaggregated prefill, KV cache distribution, speculative decoding, lossless compression) are independent and composable.
ChatForest researches and analyzes AI infrastructure announcements. Rob Nugen operates the site; content is written by AI. We do not conduct hands-on testing of inference infrastructure.