Loading video player...
Izzy Miller is an AI engineer at Hex, an AI analytics platform that was one of the first companies to ship data agents to real paying users. Today, Hex runs a multi-agent system with nearly 100K tokens of tools, and Izzy is building a 90-day simulation to evaluate whether those agents actually get smarter over time. In this conversation, he walks through the harness decisions that shaped their architecture, the failure modes Hex is seeing at scale, and what it takes to build an eval that no current model can pass. We also discuss: • Why data agents are harder to verify than coding agents • Under the hood of Hex’s agents • How Hex is unifying separate agents • Why most eval sets are bad • The 90-day simulation for long-horizon evals • How Izzy went from marketing to AI engineer References: • Andon Labs: https://andonlabs.com/ • Anthropic: https://www.anthropic.com/ • Barry McCardel: linkedin.com/in/barrymccardel • ChatGPT: http://chatgpt.com • Claude Code: https://code.claude.com/docs/en/overview • Claude Sonnet 4.6: https://www.anthropic.com/news/claude-sonnet-4-6 • DBT: https://www.getdbt.com/ • GPT-3.5 Turbo: https://developers.openai.com/api/docs/models/gpt-3.5-turbo • GPT-5.3 Codex Spark: https://openai.com/index/introducing-gpt-5-3-codex-spark/ • GPT-5.4: https://openai.com/index/introducing-gpt-5-4/ • Hex: https://hex.tech/ • LangChain: https://www.langchain.com/ • LangSmith: https://www.smith.langchain.com/ • Looker: https://lookerstudio.google.com/ • OpenAI: https://openai.com/ • Opus 4.6: https://www.anthropic.com/news/claude-opus-4-6 • Satya Nadella: https://www.linkedin.com/in/satyanadella • Snowflake: https://www.snowflake.com/en/ • Vending Machine: https://andonlabs.com/vending Where to find Izzy: • LinkedIn: https://www.linkedin.com/in/izzy-miller/ • Twitter/X: https://x.com/isidoremiller Where to find Harrison: • LinkedIn: https://www.linkedin.com/in/harrison-chase-961287118/ • Twitter/X: https://x.com/hwchase17 Where to find LangChain: • Website: http://langchain.com • Docs: https://docs.langchain.com/ Send feedback or questions to maxagency@langchain.dev Timestamps: 01:35 Where Hex's notebook agent started 03:46 The moment Hex knew it was time for agents 07:36 Why data agents are harder to verify than coding agents 09:30 How Hex is unifying separate agents 13:28 Under the hood of the notebook agent 15:41 The harness features that are now holding the agent back 17:41 Why Hex built their own orchestrator 18:59 Managing nearly 100K tokens of tools 20:49 Ephemeral queries and agent behavior trade-offs 24:46 The UX problem with showing agents' thinking 27:28 Why verification is harder than transparency for data agents 31:00 Memory, context conflicts, and collapse modes 34:38 How Hex built their internal eval system 39:29 Why most eval sets are bad 44:30 The 900% quota eval that every model fails 46:55 Model upgrades and the "in distribution" debate 51:34 How Izzy went from marketer to AI engineer 59:59 The 90-day simulation for long-horizon evals