Cloud Computing

Boosting LLM Performance with Tiered KV Cache on Google Kubernetes Engine

Large Language Models (LLMs) are powerful, but their performance can be bottlenecked by the immense NVIDIA GPU memory footprint of the Key-Value (KV) Cache. This cache, crucial for speeding up LLM inference by storing Key (K) and Value (V) matrices, directly impacts context length, concurrency, and overall system throughput. Our primary goal is to maximize […]

Boosting LLM Performance with Tiered KV Cache on Google Kubernetes Engine Read More »

Unlock 2x better price-performance with Axion-based N4A VMs, now in preview

Decision makers and builders today face a constant challenge: managing rising cloud costs while delivering the performance their customers demand. As applications evolve to use scale-out microservices and handle ever-growing data volumes, organizations need maximum efficiency from their underlying infrastructure to support their growing general-purpose workloads. To meet this need, we’re excited to announce our

Unlock 2x better price-performance with Axion-based N4A VMs, now in preview Read More »

Announcing Ironwood TPUs General Availability and new Axion VMs to power the age of inference

Today’s frontier models, including Google’s Gemini, Veo, Imagen, and Anthropic’s Claude train and serve on Tensor Processing Units (TPUs). For many organizations, the focus is shifting from training these models to powering useful, responsive interactions with them. Constantly shifting model architectures, the rise of agentic workflows, plus near-exponential growth in demand for compute, define this

Announcing Ironwood TPUs General Availability and new Axion VMs to power the age of inference Read More »

Announcing Axion C4A metal: Arm-based Axion VMs for specialized use cases

Today, we are thrilled to announce C4A metal, our first bare metal instance running on Google Axion processors, available in preview soon. C4A metal is designed for specialized workloads that require direct hardware access and Arm®-native compatibility.  Now, organizations running environments such as Android development, automotive simulation, CI/CD pipelines, security workloads, and custom hypervisors can

Announcing Axion C4A metal: Arm-based Axion VMs for specialized use cases Read More »

From silicon to softmax: Inside the Ironwood AI stack

As machine learning models continue to scale, a specialized, co-designed hardware and software stack is no longer optional, it’s critical. Ironwood, our latest generation Tensor Processing Unit (TPU), is the cutting-edge hardware behind advanced models like Gemini and Nano Banana, from massive-scale training to high-throughput, low-latency inference. This blog details the core components of Google’s

From silicon to softmax: Inside the Ironwood AI stack Read More »