Loading video player...
๐๐น๐ผ๐๐ฑ๐ก๐ฎ๐๐ถ๐๐ฒ ๐ช๐ถ๐๐ฑ๐ผ๐บ โ ๐๐ช๐ฆ ๐จ๐ฆ-๐๐๐ฆ๐ง-๐ญ ๐ข๐๐๐ฎ๐ด๐ฒ ๐๐ ๐ฝ๐น๐ฎ๐ถ๐ป๐ฒ๐ฑ With Richard Simon (CloudTherapist) & Saim โ short forensic recap and lessons for architects, platform engineers and SREs. In this episode, we rewind to October 20, 2025 and unpack the major AWS us-east-1 outage: what failed, why DynamoDB (and DNS) were central to the cascade, how AWS responded, and the practical resilience trade-offs every engineering leader should consider. We dig past the headlines and give actionable guidance for teams hit by cloud outages. ๐๐ต๐ฎ๐ฝ๐๐ฒ๐ฟ๐ / ๐ง๐ถ๐บ๐ฒ๐ฐ๐ผ๐ฑ๐ฒ๐ 0:00 Intro 0:31 Why DynamoDB mattered: service dependencies & ripple effects. 1:02 DNS resolution failure & retry feedback loop 2:02 Richard Simon: explaining the โenactorsโ and the race condition. 3:58 How a second enactor overwrote partial DNS state โ the cascade begins. 5:28 AWS response, mitigations and what they proposed to fix automation. 7:44 Community reactions 10:19 Multi-region HA vs cost and practical options. 13:34 Expectations for AWS re: Invent; what to watch for. 14:45 Closing ๐๐ถ๐ป๐ธ๐ & ๐๐ผ๐๐ฟ๐ฐ๐ฒ๐ - Official AWS Health/Incident page (US-EAST-1 Oct 20, 2025): https://aws.amazon.com/message/101925/ - Reuters summary & timeline: https://www.reuters.com/business/retail-consumer/amazons-cloud-unit-reports-outage-several-websites-down-2025-10-20/?utm_source=chatgpt.com - The Verge analysis / AWS post-mortem coverage: https://www.theverge.com/news/805904/amazon-breaks-down-the-dynamodb-dns-problem-that-took-down-aws-on-monday?utm_source=chatgpt.com - ThousandEyes / technical writeups & postmortems: https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025 If this episode helped you think differently about resilience, please like, subscribe, and leave a comment with your outage story โ Richard and I may reach out and feature it on a future show. Want to collaborate/share an incident? Drop a link in the comments or DM us.