Scaling WideEP Mixture-of-Experts inference with Google Cloud A4X (GB200) and NVIDIA Dynamo

As organizations transition from standard LLMs to massive Mixture-of-Experts (MoE) architectures like DeepSeek-R1, the primary constraint has shifted from raw compute density to communication latency and memory bandwidth. Today, we’re releasing two new validated recipes designed to help customers overcome the infrastructure bottlenecks of the agentic AI era. These new recipes provide clear steps to optimize both throughput and latency built on the A4X machine series powered by NVIDIA GB200 NVL72 and NVIDIA Dynamo, which extend the reference architecture we published in September 2025 for disaggregated inference on A3 Ultra (NVIDIA H200) VMs.

We’re bringing the best of both worlds to AI infrastructure by combining the multi-layered scalability of Google Cloud’s AI infrastructure with the rack-scale acceleration of the A4X. These recipes are part of a broader collaboration between our organizations that includes investments in important inference infrastructure like Dynamic Resource Allocation (DRA) and Inference Gateway

Highlights of the updated reference architecture include: 

  • Infrastructure: Google Cloud’s A4X machine series, powered by NVIDIA GB200 NVL72, creating a single 72-GPU compute domain connected with fifth-generation NVIDIA NVLink.

  • Serving architecture: NVIDIA Dynamo functions as the distributed runtime, managing KV cache state and kernel scheduling across the rack-scale fabric.

  • Performance: For 8K/1K input sequence length (ISL)/ output sequence length (OSL) , we achieved over 6K total tokens/sec/GPU in throughput-optimized configurations and 10ms inter-token latency (ITL) in latency-optimized configurations.

  • Deployment: A validated reference architecture is available today for deploying this stack on Google Cloud using Google Kubernetes Engine (GKE) for orchestration.

The modern inference stack

To achieve exascale performance, inference cannot be treated as a monolithic workload. It requires a modular architecture where every layer is optimized for specific throughput and latency targets. The AI Hypercomputer inference stack consists of three distinct layers:

  1. Infrastructure layer: The physical compute, networking, and storage fabric (e.g., A4X).

  2. Serving layer: The specific model architecture and the optimized execution kernels (e.g., NVIDIA Dynamo, NVIDIA TensorRT-LLM, Pax) and runtime environment managing request scheduling, KV cache state, and distributed coordination.

  3. Orchestration layer: The control plane for resource lifecycle management, scaling, and fault tolerance (e.g., Kubernetes).

In the reference architecture detailed below, we focus on a specific, high-performance instantiation of this stack designed for the NVIDIA ecosystem. We combine the A4X at the infrastructure layer with NVIDIA Dynamo at the model serving Layer, orchestrated by GKE.

Infrastructure layer: The A4X rack-scale architecture

In our A4X launch announcement in February 2025 we referenced how the A4X VM addressed bandwidth constraints by implementing the GB200 NVL72 architecture, which fundamentally alters the topology available to the scheduler.

Unlike previous generations where NVLink domains were bound by the server chassis (typically 8 GPUs), the A4X exposes a unified fabric, with:

  • 72 NVIDIA Blackwell GPUs interconnected via the NVLink Switch System that enables the 72 GPUs to operate as one giant GPU with unified shared memory

  • 130TB/s aggregate bandwidth, enabling all-to-all communication with latency profiles comparable to on-board memory access (72 GPUs x 1.8 TB/s/GPU)

  • Native NVFP4 support: Blackwell Tensor Cores support 4-bit floating point precision, effectively doubling throughput relative to FP8 for compatible model layers. We used FP8 Precision Scaling for this benchmark to support configuration and comparison with previously published results.

Serving layer: NVIDIA Dynamo

Hardware of this scale requires a runtime capable of managing distributed state without introducing synchronization overhead. NVIDIA Dynamo serves as this distributed inference runtime. It moves beyond simple model serving to coordinate the complex lifecycle of inference requests across the underlying infrastructure.

The serving layer optimizes utilization on the A4X through these specific mechanisms:

  1. Wide Expert Parallelism (WideEP): Traditional MoE serving shards experts within a single node (typically 8 GPUs), leading to load imbalances when specific experts become “hot.” We use the A4X’s unified fabric to distribute experts across the full 72-GPU rack. This WideEP configuration absorbs bursty expert activation patterns by balancing the load across a massive compute pool, helping to ensure that no single GPU becomes a straggler.

  2. Deep Expert Parallelism (DeepEP): While WideEP distributes the experts, DeepEP optimizes the critical “dispatch” and “combine” communication phases. DeepEP accelerates the high-bandwidth all-to-all operations required to route tokens to their assigned experts. This approach minimizes the synchronization overhead that typically bottlenecks MoE inference at scale.

  3. Disaggregated request processing: Dynamo decouples the compute-bound prefill phase from the memory-bound decode phase. On the A4X, this allows the scheduler to allocate specific GPU groups within the rack to prefill (maximizing tensor core saturation) while other GPUs handle decode (maximizing memory bandwidth utilization), preventing resource contention.

  4. Global KV cache management: Dynamo maintains a global view of the KV cache state. Its routing logic directs requests to the specific GPU holding the relevant context, minimizing redundant computation and cache migration.

  5. JIT kernel optimization: The runtime leverages NVIDIA Blackwell-specific kernels, performing just-in-time fusion of operations to reduce memory-access overhead during the generation phase.

Orchestration layer: Mapping software to hardware

While the A4X provides the physical fabric and Dynamo provides the runtime logic, the orchestration layer is responsible for mapping the software requirements to the hardware topology. For rack-scale architectures like the GB200 NVL72, container orchestration needs to evolve beyond standard scheduling. By making the orchestrator explicitly aware of the physical NVLink domains, we can fully unlock the platform’s performance and help ensure optimal workload placement.

GKE enforces this hardware-software alignment through these specific mechanisms:

1. Rack-level atomic scheduling: With GB200 NVL72, the  “unit of compute” is no longer a single GPU or a single node — the entire rack is the new fundamental building block of accelerated computing. We use GKE capacity reservations with specific affinity settings. This targets a reserved block of A4X infrastructure that guarantees dense deployment. By consuming this reservation, GKE helps ensure that all pods comprising a Dynamo instance land on the specific, physically contiguous rack hardware required to establish the NVLink domain, providing the hard topology guarantee needed for WideEP and DeepEP.

2. Low-latency model loading via GCS FUSE: Serving massive MoE models requires loading terabytes of weights into high-bandwidth memory (HBM). Traditional approaches that download weights to local disk incur unacceptable “cold start” latencies. We leverage the GCS FUSE CSI Driver to mount model weights directly from Google Cloud Storage as a local file system. This allows the Dynamo runtime to “lazy load” the model, streaming data chunks directly into GPU memory on demand. This approach eliminates the pre-download phase, significantly reducing the time-to-ready for new inference replicas and enabling faster auto-scaling in response to traffic bursts.

3. Kernel-bypass networking (GPUDirect RDMA): To maximize the aggregate 130 TB/s bandwidth of the A4X, the networking stack must minimize CPU and I/O involvement. We configure the GKE cluster to enable GPUDirect RDMA over the Titanium network adapter. By injecting specific NCCL topology configurations and enabling IPC_LOCK capabilities in the container, we allow the application to bypass the OS kernel and perform Direct Memory Access (DMA) operations between the GPU and the network interface. This configuration offloads the NVIDIA Grace CPU from data path management, so that networking I/O does not become a bottleneck during high-throughput token generation.

Performance validation

We observed the following when assessing the scaling characteristics of an 8K/1K workload on DeepSeek-R1 (FP8) with SGLang for two distinct optimization targets. 

1. Throughput-optimized configuration

  • Setup: All 72 GPUs utilizing DeepEP. 10 prefill nodes with 5 workers (TP8) and 8 decode nodes with 1 worker (TP32).

  • Result: We sustained over 6K total tokens/sec/GPU (1.5K output tokens/sec/GPU), which matches the performance published by InferenceMAX (source).

2. Latency-optimized configuration

  • Setup: 8 GPUs (two nodes) without DeepEP. 1 prefill node with 1 prefill worker (TP4) and 1 decode node with 1 decode worker (TP4). 

  • Result: We sustained a median Inter-Token Latency (ITL) of 10ms at a concurrency of 4, which matches the performance published by InferenceMAX (source).

Looking ahead

As models evolve from static chat interfaces to complex, multi-turn reasoning agents, the requirements for inference infrastructure will continue to shift. We are actively updating and releasing benchmarks and recipes as we invest across all three layers of the AI inference stack to meet these demands:

  • Infrastructure layer: The recently released A4X Max is based on the NVIDIA GB300 NVL72 in a single 72 GPU rack configuration, bringing 1.5X more NVFP4 FLOPs, 1.5X more GPU memory, and 2X higher network bandwidth compared to A4X. 

  • Serving layer: We are actively exploring deeper integrations with components of NVIDIA Dynamo, e.g., pairing KV Block Manager with Google Cloud remote storage, funneling Dynamo metrics into our Cloud Monitoring dashboards for enhanced observability, and leveraging GKE Custom Compute Classes (CCC) for better capacity and obtainability, as well as setting a new baseline with FP4 precision.

  • Orchestration: We plan to incorporate additional optimizations into these tests, e.g. Inference Gateway as the intelligent inference scheduling component, following the design patterns established in the llm-d well-lit paths. We aim to provide a centralized mechanism for sophisticated traffic orchestration — handling request prioritization, queuing, and multi-model routing before the workload ever reaches the serving-layer runtime.

Whether you are deploying massive MoE models or architecting the next generation of reasoning agents, this stack provides the exascale foundation required to turn frontier research into production reality. 

Get started today

At Google Cloud, we’re committed to providing the most open, flexible, and performant infrastructure for your AI workloads. With full support for the NVIDIA Dynamo suite — from intelligent routing and scaling to the latest NVIDIA AI infrastructure — we provide a complete, production-ready solution for serving LLMs at scale.

We updated our deployment repository with two specific recipes for the A4X machine class: 

We look forward to seeing what you build!