Loading video player...
A production cluster's etcd disk has filled. The kube-apiserver is returning errors and the on-call has fifteen minutes before the cluster goes fully read-only. This gravity-high episode walks the high-stakes etcd recovery procedure: the storage quota and alarm state, auto-compaction configuration, taking the snapshot before any intervention, the manual compact and defrag sequence, safe disk expansion, and the multi-master rolling restart that preserves quorum. Every step shown here must be validated in a non-production environment before use on a live cluster. Watch the next video for the Kubernetes upgrade rollback procedure. ▶ Watch next: Kubernetes Upgrade Rollback: Recovering When the Upgrade Fails Midway https://www.youtube.com/watch?v=TGl6Sn8Kgaw Chapters: 0:00 The etcd Storage Quota and the Alarm State 2:05 Auto-Compaction: Periodic vs Revision Mode 4:08 etcdctl snapshot save: The Pre-Recovery Insurance 6:15 etcdctl compact and defrag: The Manual Recovery 8:19 Disk Expansion on the Underlying Volume 10:26 Multi-Master Rolling Restart and Quorum Maintenance 12:26 Quiz Time #Kubernetes #K8s #DevOps --- Disclosure The avatars and voices in this video are AI-generated. All content -- research, scripts, lesson design, and the custom video engine -- is created by a CISSP, CISM, and PMP certified professional with a Master's in Project Management, a B.S. in Information Technology, and a Doctorate in Business Administration in progress. This channel exists to make learning accessible and straightforward. This channel is not affiliated with, endorsed by, or sponsored by the Cloud Native Computing Foundation (CNCF), The Linux Foundation, Red Hat, SUSE, Mirantis, AWS, Google Cloud, Microsoft Azure, DigitalOcean, or any Kubernetes distribution vendor. All Kubernetes mechanics, kubectl commands, controller behaviors, CNI plugin specifics, and reproduction steps are sourced from the upstream Kubernetes documentation at kubernetes.io, the official kubectl reference, CNCF working-group output, named-outlet reporting, and engineering blog posts from production operators, and are provided for educational purposes only. Production cluster behavior varies significantly across versions, distributions, network plugins, storage drivers, cloud providers, and workload patterns. Commands shown in any episode that mutate cluster state — kubectl delete, kubectl apply, kubectl drain, helm upgrade, etcd snapshots, node-level systemctl restarts, iptables rules — should never be run directly against production from a tutorial. Always reproduce in a non-production environment, capture diffs, peer-review the change, and follow your organization's change-management process. Upstream Kubernetes documentation: kubernetes.io/docs | kubectl reference: kubernetes.io/docs/reference/kubectl | CNCF: cncf.io | Kubernetes Slack for community help: slack.k8s.io.