Loading video player...
At Ray Summit 2025, Sunny Hwang and Raja Jadeja from Google share how teams can overcome the layered challenges of scaling Ray on Kubernetes—eliminating operational toil, improving observability, and unlocking high-performance infrastructure purpose-built for large distributed workloads. They begin by addressing a common pain point: operational overhead. Many platform teams struggle with manually managing and updating the Ray operator, a fragile and time-consuming process. Sunny and Raja introduce the KubeRay GKE Addon, a fully managed, auto-updating component that removes this burden entirely—allowing teams to scale Ray without constant operator maintenance. The speakers then tackle the observability black hole. When Ray jobs fail, debugging often becomes guesswork: is the problem in the Ray application, the Kubernetes layer, or the underlying infrastructure? They unveil the new RayJob observability dashboard in Google Cloud Logging & Monitoring, which unifies Ray logs, metrics, pod events, and cluster signals into a single pane of glass—dramatically accelerating root-cause analysis. Liked this video? Check out other Ray Summit breakout session recordings https://www.youtube.com/playlist?list=PLzTswPQNepXllnU0C36WtkC0dqkAoDulh Subscribe to our YouTube channel to stay up-to-date on the future of AI! https://www.youtube.com/c/anyscale 🔗 Connect with us: LinkedIn: https://www.linkedin.com/company/joinanyscale/