Loading video player...
Why does a forty-year-old protocol keep taking down billion-dollar infrastructure? The October 2024 AWS outage lasted fifteen hours because of a DNS race condition. Kubernetes defaults create 5x query amplification. We investigate how DNS really works in modern platforms—CoreDNS plugin chains, the ndots:5 trap, GSLB failover—and deliver the five-layer defensive playbook to prevent your platform from becoming the next postmortem. 🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00023-dns-platform-engineering 📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub! Summary: • CoreDNS plugin-based architecture: middleware → backend chain, Kubernetes plugin watches API server and generates responses on-the-fly for cluster.local, forward plugin handles external queries • ndots:5 trap creates 5x DNS query amplification—api.stripe.com tries 4 search domains before absolute query; fix by lowering to ndots:1, using FQDNs with trailing dot, implementing app-level caching • AWS October 19-20, 2024 outage: two DNS Enactors racing in DynamoDB DNS automation, cleanup deleted all IPs for regional endpoint, 15+ hours of cascading failures (DynamoDB → dependent services → Slack/Atlassian/Snapchat) • Five-layer defensive playbook: (1) optimize—fix ndots, tune CoreDNS cache to 10K records/30s, latency less than 100ms warning; (2) failover—GSLB with health checks, TTL 60-300s for backends; (3) security—DNSSEC + DoH with internal resolvers; (4) monitoring—track p95 latency, error rates by type, top requesters; (5) testing—DNS failure game days, kill CoreDNS pods, inject latency, model failover scenarios • TTL balancing trade-off: low TTL (60-300s) enables fast failover but increases query load; high TTL (3600-86400s) improves performance but delays failover; no perfect answer, depends on SLO