Loading video player...
How do you ensure your AI applications are reliable, performant, and cost-effective once they hit production? In this session, Santosh Kumar Perumal breaks down the essential pillars of Operational Observability using the Google Cloud Operations Suite. As AI systems become more complex, monitoring "the brain" isn't enough—you need a full-stack view of your infrastructure, logs, and latency. Santosh explores how to use native Google Cloud tools to maintain visibility and high availability for AI-driven workloads. 🕒 Timestamps 00:00 - Introduction to Operational Observability 05:40 - Google Cloud Operations Suite: The Integrated Toolset 12:15 - Monitoring: Metrics, Dashboards, and Intelligent Alerting 25:30 - Logging: Best Practices for Cloud Logging & Data Privacy 40:20 - Cloud Trace: Visualizing Request Flows and "Spans" 55:10 - Managing Reliability with SLOs and Error Budgets 1:10:00 - Q&A Session 🧠 Key Topics Covered Full-Stack Monitoring: Learn how to manage application-level and infrastructure metrics, build custom dashboards, and set up uptime checks to catch issues before your users do. Native Cloud Logging: Discover why utilizing Cloud Logging (via Log Query Language/LQL) is superior for avoiding egress costs and ensuring data privacy compared to external logging tools. Cloud Trace & Latency: Understand how to track request flows through "spans" to identify bottlenecks in complex AI function calls. Service Level Objectives (SLOs): A deep dive into using SLOs and "Error Budgets" to balance the speed of innovation with the necessity of system reliability. 👤 About the Speaker Santosh Kumar Perumal is a seasoned expert in cloud architecture and operations. He specializes in helping organizations build resilient systems on Google Cloud, focusing on the intersection of DevOps, SRE, and modern AI observability.