Loading video player...
As LLM-powered applications move into production, teams are realizing that traditional observability is not enough. Latency spikes, token explosions, silent failures, and safety blocks don’t always show up as standard infrastructure issues, yet they directly impact user experience and cost. The inspiration for LLM Health Guardian came from a simple gap: there is no clear, unified way to observe, score, and respond to LLM health in real time. We wanted to treat an LLM system like a first-class production service, with proactive monitoring, meaningful alerts, and automated incident response, not just logs and hope. What it does LLM Health Guardian is an end-to-end observability and incident-response system for an LLM application powered by Google Vertex AI / Gemini, built entirely on Datadog. It continuously monitors: 1. Performance (p95 latency, request volume) 2. Reliability (error rate spikes) 3. Cost behavior (token and cost anomalies) 4. Safety signals (blocked or rejected requests) All signals are visualized in a single dashboard, and critical detections automatically trigger a Datadog Incident, giving AI engineers immediate context and next steps. At a glance, LLM Health Guardian answers: Is my LLM healthy right now, and if not, what should I do? How we built it 1. Built an LLM API service hosted on Google Cloud Run 2. Used Vertex AI / Gemini as the LLM provider 3. Emitted custom LLM telemetry (latency, errors, tokens, cost, safety blocks) 4. Streamed all metrics and logs into Datadog 5. Designed a single-pane-of-glass dashboard for LLM health Created focused monitors for: 1. Error rate spikes 2. Latency regression (p95) 3. Token anomalies 4. Cost anomalies 5. Safety blocks Wired a critical monitor directly into Datadog Incident Management, enabling: 1. Automatic incident creation 2. Clear triage instructions for AI engineers This results in a fully automated Monitor --- Detect --- Incident --- Respond workflow. Challenges we ran into 1. Designing monitors that are useful even with low traffic 2. Avoiding alert fatigue while still catching real LLM failures 3. Normalizing very different signals (latency, cost, safety) into one coherent system 4. Ensuring incidents provide actionable context, not just alerts Accomplishments that we're proud of 1. Built a production-grade LLM observability system, not just a demo 2. Successfully wired monitors directly into incident response 3. Created a composite LLM health score to summarize system health 4. Demonstrated how Datadog can be used beyond infrastructure, for AI reliability What we learned 1. LLM systems require new observability primitives 2. p95 latency is far more meaningful than averages for LLM UX 3. Cost and tokens behave like reliability signals, not just billing data 4. Automated incident creation dramatically shortens response time 5. Datadog is powerful enough to be an AI operations platform, not just monitoring What's next for LLM Health Guardian 1. Adaptive thresholds based on traffic patterns 2. Multi-model comparison (Gemini vs other providers) 3. Fine-grained safety category tracking 4. Auto-generated remediation suggestions using LLMs 5. SLO-based LLM health scoring