Openness without compromises for your Apache Iceberg lakehouse

Today, at the Apache Iceberg Summit in San Francisco, we are announcing the preview of read and write interoperability between BigQuery and Iceberg-compatible engines, including Trino, Spark, and others in Apache Iceberg tables in Google-managed Iceberg REST Catalog. With this new capability, you get the benefits of enterprise-grade native storage for your lakehouse without sacrificing Iceberg’s openness and flexibility.

Why it matters: If you’re building a lakehouse today, you’re probably using Apache Iceberg, which has gained massive popularity among data platform teams that need to support multiple compute engines (like Spark and BigQuery) that access the same data for different workloads. However, we consistently hear from customers that achieving openness often requires compromises.

Compared to using enterprise storage, there’s often price-performance overhead on using Iceberg, wiping out the cost benefits of a single-copy architecture. In order to make Iceberg work for all production use cases, data teams have to invest in custom infrastructure to handle real-time streaming, build complex pipelines to replicate operational data, and navigate fragmented governance across different compute engines. Ultimately, these limitations become bottlenecks to innovation.

Over the years, Google has purpose-built storage infrastructure to solve these exact challenges at scale, powered by highly scalable, real-time metadata, unified governance, and deep vertical integration across Cloud Storage, metadata, and various query engines. We are making this infrastructure available directly in Iceberg.

This enables access to BigQuery’s advanced runtime, automatic table management, partitioning, multi-statement transactions, and change data replication for Google-managed Iceberg REST catalog tables. These features will be available in preview for Google-managed Iceberg REST catalog tables and will be generally available (GA) for BigQuery-managed Iceberg tables, coming next month.

Write and read interoperability across engines

Previously, customers building lakehouses chose between Iceberg tables in the Google-managed Iceberg REST catalog or tables managed by BigQuery based on their primary ETL engine. That means that customers relying on Apache Spark for ETL to Iceberg REST Catalog tables couldn’t write through BigQuery or use its storage management features.

With this preview, you can create, update, and query Iceberg tables in the Google serverless Iceberg REST catalog with BigQuery or other Iceberg-compatible engines such as Spark, Flink, Trino and others. This two-way read and write interoperability enables data teams to implement multi-engine use cases on a single table type in a fully open manner, using native Iceberg libraries.

Additionally, Iceberg REST Catalog offers table-level access controls using credential vending for uniform governance across BigQuery, Spark and other compute engines that query or modify your Iceberg tables.

Google Cloud also supports a robust ecosystem of partners integrated with the Iceberg REST Catalog across data platforms and engines, transformation and ingestion services, and governance platforms. We work closely with the Iceberg ecosystem to strengthen these partnerships with many more to come.

Improved price-performance with BigQuery and Spark

Automate table management

Achieving strong query performance on Apache Iceberg tables out of the box can be hard. You need to choose the optimal target file size (which tends to be different for different compute engines), data organization strategy (partitioning and sort-order choices have their tradeoffs), and take care of table management to avoid small files problems and metadata bloat.

Apache Iceberg lakehouse customers can now offload table maintenance — compaction and garbage collection — to Google Cloud BigLake, which optimizes performance for you. In addition to Iceberg tables in BigQuery, it will be available for Google-managed Iceberg REST catalog tables in preview, coming next month. You can opt-in to table management by setting a single property, and improve your BigQuery performance on the industry standard TPC-DS 10T benchmark by ~40%.

Improve BigQuery price-performance with advanced runtime

BigQuery advanced runtime offers a set of performance enhancements designed to automatically accelerate analytical workloads without requiring user action or code changes. In particular, it extends the vectorized query execution enhancements in BigQuery to open table formats. Advanced runtime will be available in preview for Google-managed Iceberg REST catalog tables and in GA for BigQuery-managed Iceberg tables, coming next month. According to an internal TCP-DS 10T benchmark, advanced runtime can help additionally accelerate BigQuery query performance on Iceberg tables, providing 2x faster performance vs. a self-managed approach based on internal benchmarking.

Chart based on benchmarks from internal data and testing.

Accelerate Spark performance with Lightning Engine

Apache Spark is a leading compute engine for Apache Iceberg lakehouses, for use cases ranging from ETL to feature engineering. However, achieving high performance and cost efficiency for Spark workloads at scale can be challenging. Lightning Engine accelerates Apache Spark query performance by over 4 times compared to open source Spark (based on a TPCH-like benchmark).

Optimize table layout with BigQuery partitioning and clustering

Many open-source libraries and engines rely on Iceberg table partitioning for effective data pruning. BigQuery time-based partitioning will be available in preview for Google-managed Iceberg REST catalog tables and will be generally available (GA) for BigQuery-managed Iceberg tables, coming next month. Additionally, when you are creating Iceberg tables in BigQuery, you can define clustering columns to organize data in Parquet files, helping to achieve optimal query performance and avoiding common issues with partitioning such as high-cardinality columns, small partition inefficiencies, and multiple filter columns. For example, one common pattern is to combine time-based table partitioning with clustering on other dimensions that are frequently used for query filtering, such as region, store, etc.

Advanced analytics with Apache Iceberg

Streaming with Apache Iceberg

To operationalize real-time analytics with Iceberg, you can leverage BigQuery’s Vortex streaming infrastructure for high-throughput ingestion with zero-read latency. This removes the need for bespoke infrastructure, addresses small file issues, and lets you query data immediately from the streaming buffer to achieve near-zero read latency. This feature is generally available for BigQuery-managed Iceberg tables and will be available in preview for Google-managed Iceberg REST catalog tables, coming next month.

Replicate data from operational databases to Iceberg tables with Datastream

You can now easily replicate data from a variety of operational datastores, including MySQL, Postgres, SQLserver, Oracle, Salesforce, and MongoDB , into managed Iceberg tables in BigQuery using Datastream integration (GA).

Illustration of Datastream creation to replicate MySQL data to managed Iceberg tables in BigQuery.

Incremental processing with change data capture ingestion to Iceberg tables

The BigQuery storage write API’s change data replication feature lets you stream insert, update, and delete changes from OLTP databases to Iceberg tables in real time, removing the need for complex MERGE-based ETL pipelines. This feature will be available in preview for Google-managed Iceberg REST catalog tables and generally available (GA) for BigQuery-managed Iceberg tables, coming next month.

Illustration of change data capture ingestion to a managed Iceberg table in BigQuery.

Multi-statement transactions

Many analytics workloads require transactions that span multiple tables to commit or roll back changes atomically. This provides consistency across logical groups of tables, synchronizes dimensions and fact tables, and simplifies multi-stage ETLs. You can now leverage BigQuery multi-statement transactions to radically simplify complex multi-table processing with Iceberg. This feature will be available in preview for Google-managed Iceberg REST catalog tables and generally available (GA) for BigQuery-managed Iceberg tables, coming next month.

Illustration of a multi-statement transaction in a managed Iceberg table in BigQuery.

Get started

With bidirectional interoperability across BigQuery and other Iceberg-compatible engines on Google-managed Iceberg REST catalog tables, you can modernize your lakehouse with Apache Iceberg without compromising on performance, governance, or advanced analytics.

Ready to start building today? Learn more about our lakehouse capabilities and explore our quickstart guides.

Openness without compromises for your Apache Iceberg lakehouse

Write and read interoperability across engines

Improved price-performance with BigQuery and Spark

Advanced analytics with Apache Iceberg

Get started

Related Posts