To effectively operate and troubleshoot applications, developers and site reliability engineers (SREs) need to understand the full context of their system’s behavior, typically as part of their logging and observability tooling. Today, we’re excited to announce a variety of new capabilities in our Google Cloud Observability suite:
-
Log Analytics is now Observability Analytics.
-
Trace data within Observability Analytics is generally available (GA).
-
The Observability API for management and configuration is GA.
Together, these bring logs and traces together into a unified experience, helping you go from viewing high-level trends to deep, contextual, root-cause analysis for agentic as well as traditional workloads, and to configure and manage those workloads programmatically, as part of observability buckets.
Further, support for SQL in Cloud Trace is an important new tool in your toolbelt. You can, for instance, write a single SQL query that joins your application logs with your distributed trace spans and find any checkout requests that took longer than 5 seconds, to instantly see which internal microservice spent the most time processing them. Or, for AI agents, you can analyze telemetry across thousands of runs to identify which tool calls most frequently fail, or calculate the aggregated P95 response time for all external tool executions to pinpoint performance bottlenecks. The possibilities are endless!
In this blog, let’s take a closer look at Observability Analytics, and a few key use cases leveraging traces and logs, so you can put these new capabilities to work in your environment right away.
What is Observability Analytics?
Observability Analytics, formerly Log Analytics, brings the power of BigQuery and SQL to your telemetry data directly within Cloud Observability. It allows you to run complex analytical queries joining high-volume log and trace data to identify patterns, troubleshoot issues, and generate insights into your agent and application’s health and performance without having to move or duplicate data. This brings a number of important benefits:
-
Unified telemetry: Run SQL queries to analyze and JOIN high-volume log and trace data in a single place.
-
Business correlation: Join your observability datasets with business-critical data stored in BigQuery (e.g., conversion rates, revenue, operational costs) to quantify the business impact of technical issues.
-
In-place analysis: Analyze your data where it’s already stored (in Cloud Logging and Cloud Trace), reducing duplicate export storage costs and complexity.
For instance, with Cloud Observability, you can analyze how application latency impacts conversion rates or identify the financial implications of service outages, transforming raw telemetry into actionable business intelligence.
Unlock deeper insights with traces and logs
Correlating logs and traces in a single analytics view breaks down data silos and accelerates troubleshooting. You can now analyze performance trends from trace data and directly correlate them with corresponding application or infrastructure logs to understand the “why” behind the “what.” Let’s take a couple of examples.
Use case 1: AI agent optimization (analyzing tool failures and latency at scale)
AI agents often perform complex, multi-step tasks by executing various external tools (e.g., database queries, web searches, API calls). When optimizing agents at scale, inspecting individual trace graphs in a UI often isn’t enough. You need to answer systemic questions like “Which tools are failing most frequently?” and “Which ones are causing latency bottlenecks?”
With Observability Analytics, you can run aggregate queries across millions of span events to calculate failure rates and latency percentiles (like P95) for every tool in your system.
Example query: Rank agent tools by failure rate and 95th percentile latency over the last 7 days.
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECTrn JSON_VALUE(attributes, ‘$.”agent.tool.name”‘) AS tool_name,rn COUNT(span_id) AS total_calls,rn — Calculate failure rate (status.code = 2 represents ERROR in OpenTelemetry)rn SAFE_DIVIDE(COUNTIF(status.code = 2), COUNT(span_id)) * 100 AS failure_rate_percentage,rn — Calculate P95 latency in millisecondsrn APPROX_QUANTILES(duration_nano / 1000000, 100)[OFFSET(95)] AS p95_latency_msrnFROMrn `YOUR_PROJECT_ID.us._Trace.Spans._AllSpans`rnWHERErn name = ‘Agent.executeTool’ — Filter for spans representing tool executionrn AND start_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND CURRENT_TIMESTAMP()rnGROUP BYrn tool_namernORDER BYrn failure_rate_percentage DESC, p95_latency_ms DESCrnLIMIT 10′), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7fa733a50be0>)])]>
With the above query, you can:
-
Spot bottlenecks: Instantly see if a tool like DatabaseQueryTool has a P95 latency of 8 seconds, indicating you need to optimize database indexes or connections.
-
Identify flaky tools: Discover if a specific API tool has a 15% failure rate, suggesting API rate limits or integration bugs.
-
Drill down to the prompt: Once you identify a flaky tool, you can write a follow-up query joining these trace spans with application logs to extract the exact LLM prompt and reasoning that led to the failures. Here’s that SQL query:
- code_block
- <ListValue: [StructValue([(‘code’, ‘SELECTrn t.name AS tool_name,rn l.timestamp,rn — Retrieve the agent’s thoughts and the prompt from application logsrn JSON_VALUE(l.json_payload.agent_thoughts) AS agent_reasoning,rn JSON_VALUE(l.json_payload.llm_prompt) AS prompt_sent_to_llmrnFROMrn `YOUR_PROJECT_ID.us._Trace.Spans._AllSpans` trnJOINrn `YOUR_PROJECT_ID.us._Default._AllLogs` lrnONrn t.trace_id = SPLIT(l.trace, ‘/’)[SAFE_OFFSET(3)]rn AND t.span_id = l.spanIdrnWHERErn t.name = ‘Agent.executeTool’rn AND JSON_VALUE(t.attributes, ‘$.”agent.tool.name”‘) = ‘NameOfFlakyTool’rn AND t.status.code = 2 — Filter for failed tool callsrn AND l.severity = ‘ERROR”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7fa733a50400>)])]>
Use case 2: Identify latency impact on specific customers (business context)
If you don’t propagate user or customer identifiers in your trace attributes (e.g., for privacy or technical reasons), but you do log them in your application access logs, you can join traces and logs to identify which customers are experiencing the worst performance.
Example query: Find the top 10 customers experiencing the highest 95th percentile latency.
- code_block
- <ListValue: [StructValue([(‘code’, “SELECTrn JSON_VALUE(l.json_payload.customer_id) AS customer_id,rn AVG(t.duration_nano / 1000000) AS avg_latency_ms,rn APPROX_QUANTILES(t.duration_nano / 1000000, 100)[OFFSET(95)] AS p95_latency_ms,rn COUNT(t.span_id) AS total_requestsrnFROMrn `YOUR_PROJECT_ID.us._Trace.Spans._AllSpans` AS trnJOINrn `YOUR_PROJECT_ID.us._Default._AllLogs` AS lrnONrn t.trace_id = SPLIT(l.trace, ‘/’)[SAFE_OFFSET(3)]rn AND t.span_id = l.spanIdrnWHERErn t.start_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND CURRENT_TIMESTAMP()rn AND t.kind.name = ‘SPAN_KIND_SERVER’rn AND JSON_VALUE(l.json_payload.customer_id) IS NOT NULLrnGROUP BYrn customer_idrnORDER BYrn p95_latency_ms DESCrnLIMIT 10″), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7fa733a501f0>)])]>
You can find more query examples for trace in this github repo.
Observability Analytics page vs. log and trace explorers
Cloud Logging and Trace will both continue to offer log and trace explorers — tools that are optimized for finding and inspecting individual log entries and traces, making them ideal for investigating a specific issue.
Observability Analytics, in contrast, is designed for aggregations and in-depth analysis. Think of it as your tool for answering broad questions about your services, such as “What is the 95th percentile latency for my checkout service over the last week?” or “Which API endpoints have the highest error rate after our last deployment?”
Enabling AI agents to query traces and logs using SQL
Finally, with rapid growth in agentic assistants, you need to be able to access your telemetry programmatically. The Observability API lets you create linked BigQuery datasets for your observability buckets, making the data available to query directly from the BigQuery ecosystem. Now, your AI agents or analytical workloads can query this data directly via standard BigQuery APIs and tooling.
Get started today
You can start analyzing your trace data in Observability Analytics today. Simply navigate to the Observability Analytics page in the Google Cloud console to begin exploring your trace data. Ensure you have enabled the Observability API to unlock configurations and management capabilities.


