Loading video player...
75% of executives know AI is critical, but few understand how to run it reliably in production. LLM-powered applications are non-uniform, expensive, and behave differently than traditional microservices. This #InfoQ talk, by Sally O’Malley, gives senior software developers, architects, and engineering leaders the exact open-source observability stack they need to run business-critical AI workloads with full transparency. Dive into a live demo where we build an end-to-end monitoring solution using vLLM, Llama Stack, Prometheus, Tempo, and Grafana on Kubernetes. Learn what unique signals (cost, performance, quality) you must track for RAG, agentic, and multi-turn applications. ⏱️ Video Timestamps (For Navigation) 0:00 - Introduction: Why AI Observability Matters Now 1:15 - Live Demo Preview: RAG with Llama Stack & Safety Features 4:30 - The State of AI in Enterprise: Moving from Research to Business-Critical 6:55 - Unique Monitoring Challenges Posed by LLMs 9:15 - Prefill vs. Decode: The Core Difference in LLM Serving Patterns 12:05 - Building the Open-Source Stack: Prometheus, Grafana, Tempo, and OTel 15:00 - Kubernetes Deep Dive: ServiceMonitors Explained 18:45 - Deploying the Model: Using llm-d for vLLM Quick Start 22:10 - Configuring Tracing with Llama Stack and OTel Sidecars 27:50 - Critical Signals to Monitor: Performance, Cost, and Quality 32:00 - Live Demo: Analyzing GPU Usage, vLLM Dashboards & Traces in Grafana 37:45 - Q&A: Open-Source Cost, Langfuse, and Actionable Analytics for Different Personas 🔗 Transcript available on InfoQ: https://bit.ly/43BuI9r #Observability #vLLM #LLMs