Loading video player...
73% of organizations experience outages from alerts they IGNORED. Not because they lacked monitoring—because they had too much of it. Subscribe for weekly platform engineering insights: https://www.youtube.com/@PlatformEngineeringPlaybook Full show notes: https://platformengineeringplaybook.com/podcasts/00076-alert-fatigue-signal-driven-observability In this episode, we break down: THE ALERT FATIGUE PARADOX - Teams receive 2000+ alerts weekly, only 3% need action - 27% of alerts are simply ignored - 83% of software engineers report burnout - $5,600/minute cost of unplanned downtime WHAT'S CAUSING THE NOISE - Static thresholds: set once, never tuned, drift into irrelevance - Compound rule blind spots: "CPU less than 80 AND memory less than 90" misses real issues - Alert storms: one root cause triggers 50+ cascading alerts - The golden signal fallacy: no baseline = arbitrary thresholds SLO-DRIVEN OBSERVABILITY - Error budgets over thresholds - Multi-window, multi-burn-rate alerting patterns - 14.4x burn rate = page now, 3x burn rate = create ticket - Align alerting with actual user impact AIOPS THAT ACTUALLY WORKS - Anomaly detection: learn what "normal" looks like - Event correlation: topology-aware alert grouping - Root cause acceleration: 40% reduction in investigation time - The "Alert Fatigue 2.0" problem and how to avoid it PRACTICAL MIGRATION PATH - Week 1-2: SLO audit of top 5 services - Week 3-4: Pilot on highest-fatigue service - Week 5-8: Measure and iterate toward 30-50% actionable rate Key Statistics: - 73% outages from ignored alerts (Splunk 2025) - 48% OpenTelemetry adoption - 64% AI-driven observability in production - 80% alert noise reduction achievable Resources Mentioned: - Google SRE Workbook: Alerting on SLOs - Pyrra and Sloth for SLO management - OpenTelemetry documentation - Grafana, Datadog, Nobl9 for SLO platforms #PlatformEngineering #DevOps #SRE #Observability #AlertFatigue #SLO #AIOps #Monitoring #CloudNative #Prometheus #Grafana #OpenTelemetry See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub!