Loading video player...
This video walks you through building a fully self-hosted AI inference platform on Kubernetes, giving your organization the ability to run large language models on infrastructure you control. If you're in healthcare, finance, government, or any field where data privacy and regulatory compliance matter, sending prompts through third-party APIs may not be an option — and this guide shows you the alternative. The video covers why inference (as opposed to training or fine-tuning) is the critical piece for most teams, examines the current landscape of open-weight models including the rapid rise of Chinese models like Qwen and DeepSeek, and honestly addresses the trade-offs of self-hosting versus using commercial APIs. From there, the video moves into a hands-on build using Crossplane and Kubernetes with GPU nodes on AWS. You'll see how to define simple custom resources that let any team in your company provision a GPU-enabled cluster and deploy a model — without needing to understand the underlying complexity of EKS node groups, NVIDIA GPU Operators, or vLLM configuration. By the end, you have a working Inference-as-a-Service platform serving an OpenAI-compatible API endpoint, fully contained within your own network. The video also lays out the architecture and sets the stage for future topics like disaggregated inference, KV-cache routing, autoscaling, and multi-cluster patterns. ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: Kilo Code 🔗 https://kilo.ai ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #SelfHostedAI #KubernetesInference #InferenceAsAService Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join ▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/kubernetes/building-inference-as-a-service-on-kubernetes 🔗 Crossplane: https://crossplane.io 🎬 Why Self-Hosting AI Models Is a Bad Idea: https://youtu.be/pWtDTkfNaUU ▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below). ▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/ ▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox ▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 AI Inference (Self-Managed) 01:24 Kilo Code (sponsor) 02:53 Self-Hosted AI Inference Explained 13:43 GPU Kubernetes Cluster Setup 15:35 Deploy and Serve LLMs 18:48 Inference Platform Architecture