Introducing Ironwood TPUs and new innovations in AI Hypercomputer

Today’s innovation isn’t born in a lab or at a drafting board; it’s built on the bedrock of AI infrastructure. AI workloads have new and unique demands — addressing these requires a finely crafted combination of hardware and software for performance and efficiency at scale, and the ease-of-use and flexibility to access this infrastructure however it’s needed. At Google Cloud, we deliver this through AI Hypercomputer

AI Hypercomputer is an integrated supercomputing system that’s distilled from more than a decade of Google’s expertise in AI. AI Hypercomputer underpins nearly every AI workload running on Google Cloud; it runs under the hood when you’re using Vertex AI, or you can get fine-grained control of your infrastructure with direct access to AI Hypercomputer’s performance-optimized hardware, open software, and flexible consumption models — all designed to deliver more intelligence at a consistently low price for training and serving AI workloads. This integrated systems approach is differentiated in the market, and is one of the reasons Gemini Flash 2.0 can achieve 24x higher intelligence per dollar than GPT-4o and 5x higher than DeepSeek-R11.

Today we’re introducing new innovations throughout the AI Hypercomputer stack that are designed in concert to deliver the highest intelligence per dollar for AI workloads.

image2

Advances in performance-optimized hardware

We continue to expand our portfolio of performance-optimized hardware to offer a wide range of options for compute, networking, and storage.

Ironwood, our 7th generation TPU: Ironwood is built for inference, offering 5x more peak compute capacity and 6x the high-bandwidth memory (HBM) capacity compared to the prior-generation, Trillium. Ironwood comes in two configurations: 256 chips or 9,216 chips, each available as a single scale-up pod, with the larger pod delivering a staggering 42.5 exaFLOPS of compute. Ironwood achieves this while being 2x more power efficient compared to Trillium, delivering significantly more value per watt. Developers can access Ironwood through our optimized stack, across PyTorch and JAX. Learn more about this groundbreaking TPU here.

A4 and A4X VMs: Google Cloud was the first hyperscaler to offer both NVIDIA B200 and GB200 NVL72 GPUs with A4 and A4X VMs. We announced general availability of A4 VMs (NVIDIA B200) last month at NVIDIA GTC, and A4X VMs (NVIDIA GB200) are now in preview. Learn more about A4 and A4X here

Enhanced networking: To support the ultra-low latency required for AI workloads, our new 400G Cloud Interconnect and Cross-Cloud Interconnect offers up to 4x more bandwidth than our 100G Cloud Interconnect and Cross-Cloud Interconnect, providing connectivity from on-premises or other cloud environments to Google Cloud. Read more in today’s networking blog here.

Hyperdisk Exapools: Block storage with the highest performance and capacity per AI cluster of any hyperscaler, so you can provision up to exabytes of block storage capacity and many TB/s of throughput in a single, thinly-provisioned pool.

Rapid Storage: A new Cloud Storage zonal bucket that enables you to colocate your primary storage with your TPUs or GPUs for optimal utilization. It provides up to 20x faster random-read data loading than a Cloud Storage regional bucket.

Cloud Storage Anywhere Cache: A new, strongly consistent read cache that works with existing regional buckets to cache data within a selected zone. Anywhere Cache enables responsive and real-time inference interactions by keeping data close to accelerators, reducing latency by 70%.

aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud infrastructure’), (‘body’, <wagtail.rich_text.RichText object at 0x3e0cd44f1e80>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/compute’), (‘image’, None)])]>

Open software capabilities for training and inference

The real value of hardware is unlocked by co-designed software. AI Hypercomputer’s software layer helps AI practitioners and engineers move faster with open and popular ML frameworks and libraries such as PyTorch, JAX, vLLM, and Keras. For infrastructure teams, that translates to faster delivery times and more cost-efficient resource utilization. We’ve made significant advances in software for both AI training and inference.

Pathways on Cloud: Pathways, developed by Google DeepMind, is the distributed runtime that powers Google’s internal large-scale training and inference infrastructure, now available for the first time on Google Cloud. For inference, it includes features like disaggregated serving, which allows dynamic scaling of the prefill and decode stages of inference workloads on separate compute units, each independently scaling to deliver ultra-low latency and high throughput. It is available to customers through JetStream, our high-throughput and low-latency inference library. Pathways also enables elastic training, allowing your training workloads to automatically scale down on failure and scale up on recovery while providing continuity. To learn more about Pathways on Cloud, including additional use cases for the Pathways architecture, read the documentation.

Train models with high performance and reliability

Training workloads are highly synchronized jobs that run across thousands of nodes. A single degraded node has the potential to disrupt an entire job, resulting in longer time-to-market and higher costs. To provision a cluster quickly you need VMs tuned for specific model architectures located in close proximity. You also need the ability to predict and troubleshoot node failures quickly and ensure workload continuity in the event of a failure.

Cluster Director for GKE and Cluster Director for Slurm. Cluster Director (formerly Hypercompute Cluster) lets you deploy and manage a group of accelerators as a single unit with physically colocated VMs, targeted workload placement, advanced cluster maintenance controls, and topology-aware scheduling. Today we are announcing new updates for Cluster Director, coming later this year:

  • Cluster Director for Slurm, a fully-managed Slurm offering with simplified UI and APIs to provision and operate Slurm clusters, including blueprints for common workloads with pre-configured software to make deployments reliable and repeatable.

  • 360° observability features including dashboards for visibility over cluster utilization, health, and performance, plus advanced features like AI Health Predictor and Straggler Detection to proactively detect and remediate failures, down to individual nodes. 

  • Job continuity capabilities like end-to-end automated health checks that continuously monitor the fleet and preemptively replace unhealthy nodes. The result is uninterrupted training even in degraded clusters, with multi-tier checkpointing for faster save and retrieval.

Cluster Director for GKE will natively support new Cluster Director features as they become available. Cluster Director for Slurm will be available in the coming months, including support for both GPUs and TPUs. Register for early access.

Run inference workloads efficiently at any scale

AI inference has evolved rapidly over the last year. Longer and highly variable context windows are resulting in more sophisticated interactions; reasoning and multi-step inferencing is shifting the incremental demand for compute — and therefore cost — from training to inference time (test-time scaling). To enable useful AI applications for end-users, you need software that can efficiently serve today’s and tomorrow’s interactions.

Announcing AI inference capabilities in GKE: Inference Gateway and Inference Quickstart. 

  • GKE Inference Gateway offers intelligent scaling and load-balancing capabilities, helping you handle request scheduling and routing with gen AI model-aware scaling and load-balancing techniques.

  • With GKE Inference Quickstart, you can choose an AI model and your desired performance, and GKE will configure the right infrastructure, accelerators, and Kubernetes resources to match. 

Both features are available in preview today, together reducing serving costs by over 30%, tail latency by 60%, and increasing throughput by up to 40% compared to other managed and open-source Kubernetes offerings.

vLLM support for TPUs: vLLM is well known for being a fast and efficient library for inference. Starting today, you can easily run inference on TPUs with vLLM and get their price-performance benefits without changing your software stack, with only a few configuration changes. vLLM is supported in Compute Engine, GKE, Vertex AI, and Dataflow. And with GKE custom compute classes, you can use TPUs and GPUs in tandem within the same vLLM deployment.

Making consumption even more flexible

Dynamic Workload Scheduler (DWS) is a resource management and job scheduling platform that helps you get easy and affordable access to accelerators. Today we’re announcing expanded accelerator support in DWS, including for TPU v5e, Trillium, A3 Ultra (NVIDIA H200), and A4 (NVIDIA B200) VMs in preview via Flex Start mode, with Calendar mode support for TPUs coming later this month. Additionally, Flex Start mode now supports a new provisioning method in which resources can be provisioned immediately and scaled dynamically, making it suitable for long-running inference workloads and a wider range of training workloads. This is in addition to the queued provisioning method of Flex Start mode that requires all nodes to be provisioned simultaneously.

Learn about AI Hypercomputer at Next ‘25

Don’t miss the action. Tune in for all of our announcements and deep-dives on the event website. Start with What’s next in compute and AI infrastructure, then check out the following breakouts:


1. arXiv (LMArena), Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference, Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios 1  Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica, 2024. Accurate as of Mar 19, 2025. This benchmark compares model output quality (as judged by human reviewers) to the price/1M tokens required to generate the output, creating an efficiency comparison. We define ‘intelligence’ as a human’s perception of model output quality.