How Google SRE Is Using Agentic AI to Improve Operations¶
A Google Cloud Blog piece by Stevan Malesevic and Christopher Heiser details how Google's Site Reliability Engineering (SRE) team is adopting agentic AI as a "force multiplier" — called SRE AI.
Why Now?¶
Three drivers are pushing SRE toward AI:
- Microservice complexity — modern systems have exploded in scale and interdependency.
- Regulatory requirements — compliance demands are growing faster than manual processes can handle.
- AI code generation — AI-produced code creates orders of magnitude more code to monitor, debug, and maintain.
Core Principles¶
Google's approach to SRE AI is grounded in clear principles:
- Don't replace existing non-AI automation — augment it.
- Meet the same security and safety bar as human operators.
- Strong identity and access control for AI agents.
- Transparency over black-box — AI decisions must be auditable.
- Business continuity plans for AI failures.
Infrastructure Stack¶
The SRE AI system is built on:
- Gemini models (including custom fine-tuned variants)
- Vertex AI Agent Platform
- Agent Development Kit (ADK)
- MCP servers for tool integration
- BigQuery AI/ML for data analysis
Application Areas¶
- Reliability Design — AI agents monitor and improve runbooks, automatically generating playbooks from incident data.
- Anomaly Detection — The TimesFM model augments static threshold-based alerting with time-series forecasting.
- Incident Management (IMAG) — AI handles communication management, handoff documentation, postmortems, and stakeholder communications.
- Incident Investigation — Specialized sub-agents work through playbooks, alerting data, and anomaly detection in parallel.
- AI Insights System — Embedding models combined with vector databases continuously review past incidents to surface patterns.
A key part of the framework is an autonomous levels framework that tracks how autonomous AI systems truly are — providing honest assessment rather than aspirational labeling.
Source: Google Cloud Blog — How Google SRE Is Using Agentic AI to Improve Operations