How Google SRE Is Using Agentic AI to Improve Operations¶

A Google Cloud Blog piece by Stevan Malesevic and Christopher Heiser details how Google's Site Reliability Engineering (SRE) team is adopting agentic AI as a "force multiplier" — called SRE AI.

Why Now?¶

Three drivers are pushing SRE toward AI:

Microservice complexity — modern systems have exploded in scale and interdependency.
Regulatory requirements — compliance demands are growing faster than manual processes can handle.
AI code generation — AI-produced code creates orders of magnitude more code to monitor, debug, and maintain.

Core Principles¶

Google's approach to SRE AI is grounded in clear principles:

Don't replace existing non-AI automation — augment it.
Meet the same security and safety bar as human operators.
Strong identity and access control for AI agents.
Transparency over black-box — AI decisions must be auditable.
Business continuity plans for AI failures.

Infrastructure Stack¶

The SRE AI system is built on:

Gemini models (including custom fine-tuned variants)
Vertex AI Agent Platform
Agent Development Kit (ADK)
MCP servers for tool integration
BigQuery AI/ML for data analysis

Application Areas¶

Reliability Design — AI agents monitor and improve runbooks, automatically generating playbooks from incident data.
Anomaly Detection — The TimesFM model augments static threshold-based alerting with time-series forecasting.
Incident Management (IMAG) — AI handles communication management, handoff documentation, postmortems, and stakeholder communications.
Incident Investigation — Specialized sub-agents work through playbooks, alerting data, and anomaly detection in parallel.
AI Insights System — Embedding models combined with vector databases continuously review past incidents to surface patterns.

A key part of the framework is an autonomous levels framework that tracks how autonomous AI systems truly are — providing honest assessment rather than aspirational labeling.

Source: Google Cloud Blog — How Google SRE Is Using Agentic AI to Improve Operations