Loading video player...
Running LLM inference on Kubernetes? Your GPU might be saturated but which model is causing it? In this episode we build a full observability stack for AI inference on Kubernetes: - NVIDIA DCGM Exporter — GPU utilization, memory, temperature & power per pod - vLLM / Gateway API Inference Extension — inference-aware metrics: KV cache, queue depth, token throughput, TTFT - OpenTelemetry Collector — scrapes both layers, enriches with Kubernetes metadata - Dynatrace — correlate GPU pressure with model-level bottlenecks in real time See exactly how the Endpoint Picker exposes pool-level routing metrics and how to wire it all into your OTel pipeline. 📁 Full tutorial, collector configs & dashboards → GitHub (link below) - https://github.com/isItObservable/K8s-LLM Tags Kubernetes OpenTelemetry LLM GPU Monitoring vLLM Dynatrace DCGM Inference Observability CloudNative CNCF IsItObservable