Skip to content

How Google SRE Is Using Agentic AI to Improve Operations

A Google Cloud Blog piece by Stevan Malesevic and Christopher Heiser details how Google's Site Reliability Engineering (SRE) team is adopting agentic AI as a "force multiplier" — called SRE AI.

Why Now?

Three drivers are pushing SRE toward AI:

  • Microservice complexity — modern systems have exploded in scale and interdependency.
  • Regulatory requirements — compliance demands are growing faster than manual processes can handle.
  • AI code generation — AI-produced code creates orders of magnitude more code to monitor, debug, and maintain.

Core Principles

Google's approach to SRE AI is grounded in clear principles:

  • Don't replace existing non-AI automation — augment it.
  • Meet the same security and safety bar as human operators.
  • Strong identity and access control for AI agents.
  • Transparency over black-box — AI decisions must be auditable.
  • Business continuity plans for AI failures.

Infrastructure Stack

The SRE AI system is built on:

  • Gemini models (including custom fine-tuned variants)
  • Vertex AI Agent Platform
  • Agent Development Kit (ADK)
  • MCP servers for tool integration
  • BigQuery AI/ML for data analysis

Application Areas

  1. Reliability Design — AI agents monitor and improve runbooks, automatically generating playbooks from incident data.
  2. Anomaly Detection — The TimesFM model augments static threshold-based alerting with time-series forecasting.
  3. Incident Management (IMAG) — AI handles communication management, handoff documentation, postmortems, and stakeholder communications.
  4. Incident Investigation — Specialized sub-agents work through playbooks, alerting data, and anomaly detection in parallel.
  5. AI Insights System — Embedding models combined with vector databases continuously review past incidents to surface patterns.

A key part of the framework is an autonomous levels framework that tracks how autonomous AI systems truly are — providing honest assessment rather than aspirational labeling.


Source: Google Cloud Blog — How Google SRE Is Using Agentic AI to Improve Operations