Loading video player...
In this video, I build a real-world AI-powered cloud monitoring system from scratch — where an AI agent automatically diagnoses infrastructure failures and fires a Slack alert before you even open your laptop. No fake demos. The app is intentionally broken. The AI catches it. Check Out Kestra - https://kestra.io/ ────────────────────────────────── WHAT YOU WILL LEARN ────────────────────────────────── • How to provision multi-cloud infrastructure (AWS + GCP) using Terraform modules • How to deploy a Flask app on AWS EC2 and GCP VM with a systemd service • How to inject intentional chaos into your app to simulate real production failures • How to set up Kestra as a workflow orchestrator for scheduled monitoring • How to use the Kestra AI Agent task powered by Google Gemini 2.5 Flash • How the Kestra errors block works to trigger tasks only on failure • How AI is sandwiched between failure detection and Slack alerting • How to stream live app logs to AWS CloudWatch using the CloudWatch Agent • How to read app logs over HTTP without SSH using a Flask /logs endpoint • How to send AI-generated incident diagnosis directly to a Slack channel ────────────────────────────────── TECH STACK ────────────────────────────────── • Terraform — multi-cloud infra as code (AWS + GCP providers) • AWS EC2 (t2.micro, Ubuntu 22.04) — primary app host • GCP VM (e2-micro) — secondary cloud node • Flask — Python web app with chaos engineering built in • Kestra — open-source workflow orchestrator (self-hosted via Docker) • Google Gemini 2.5 Flash — AI model for root cause analysis • AWS CloudWatch Agent — log streaming from EC2 • Slack Incoming Webhooks — real-time incident alerts ───────────────────────────────── HOW THE SYSTEM WORKS ────────────────────────────────── • Terraform provisions VMs on both AWS and GCP in a single apply • A Flask app called CrashLab API runs on the EC2 instance — it intentionally throws HTTP 500 errors 30% of the time and simulates slow responses to generate real noise • Kestra runs on Docker and schedules two flows every hour • The health check flow hits /health on the Flask app — if it returns non-200, Kestra immediately jumps to the errors block • The AI Agent task (Gemini 2.5 Flash) receives the failure context and responds with a single-line diagnosis: Cause + immediate fix • The Slack alert task fires with the AI diagnosis embedded in the message • A second flow generates live traffic and reads logs directly over HTTP — no SSH needed • CloudWatch Agent streams /app/logs/app.log to AWS CloudWatch for log retention ───────────────────────────────── PROJECT STRUCTURE ────────────────────────────────── • terraform/modules/aws/ — VPC, subnet, security group, EC2 • terraform/modules/gcp/ — VPC network, firewall rules, router, NAT, VM • terraform/scripts/ — Startup script templates (injected via templatefile) • terraform/app/ — Flask app source code and requirements • kestra/ — Kestra flow YAML files ────────────────────────────────── SOURCE CODE ────────────────────────────────── GitHub →https://github.com/rahulwagh/workflow-orchestrator ────────────────────────────────── WHO THIS IS FOR ────────────────────────────────── • DevOps and Platform engineers exploring AI-assisted monitoring • Cloud engineers learning Terraform multi-cloud patterns • Developers curious about Kestra as an alternative to Airflow or Prefect • Anyone building production-grade alerting pipelines with AI in the loop