Stanford CS 224G: AI Observability & Evaluations | Guest Lecture by Arsh Shah Dilbagi | DailyDevLists

Loading video player...

Stanford CS 224G: AI Observability & Evaluations | Guest Lecture by Arsh Shah Dilbagi

Adaline

69 days ago

59:41

AI Evaluation & Monitoring

Rank #2

Description

In this guest lecture for Stanford’s CS 224G Building & Scaling LLM Applications, Arsh Shah Dilbagi (Co-founder, Adaline) breaks down what it actually takes to move from “LLM demos” to reliable, scalable production systems. Most teams can get a prototype working. The harder problem is building the development discipline and infrastructure that makes LLM behavior reviewable, measurable, and safe in real products—especially when prompts and context become part of your executable business logic. This session introduces a practical framework for building production-grade LLM apps using four pillars: Iterate → Evaluate → Deploy → Monitor (Observability) You’ll learn why LLM development differs from traditional software testing, why evals should be treated as feedback loops (not unit tests), and how monitoring closes the loop by turning production usage into a data flywheel that continuously improves quality, cost, and reliability. Course Information This video is part of Stanford University’s CS 224G (Winter 2026): Building & Scaling LLM Applications Course schedule and materials: https://web.stanford.edu/class/cs224g/schedule.html Instructors: John Whaley, Jan Jannink Guest Lecturer (this video): Arsh Shah Dilbagi CS 224G is a 10-week, project-driven course focused on building, shipping, and scaling real LLM applications—covering modern model capabilities, context engineering, agentic workflows, evaluation, reliability, and production deployment constraints. What You’ll Learn In This Lecture 1. Why LLM apps fail in production (even when demos look great)? The “demo → production” gap is rarely about raw model capability. It’s usually a lack of instrumentation: you can’t debug, measure, or govern what you can’t observe. 2. Why prompts aren’t “config?” Prompts and context aren’t throwaway strings—they behave like executable business logic. That changes your risk profile (especially in domains like healthcare, finance, support, and compliance). 3. How to iterate without guesswork? Use use-case–driven development, structured prompt experiments, version history/diffs, and side-by-side comparisons—so changes are reviewable and learnable. 4. How to evaluate LLM behavior without pretending evals are unit tests? Evals help you ship faster with confidence, but they’re feedback loops—not “perfect test suites.” Learn how to build datasets from production logs, when to use human review vs LLM-as-judge vs deterministic checks, and why binary criteria often outperform fuzzy scoring. 5. Why is observability the most important pillar? Monitoring closes the loop. You need traceability across tool calls, retrieved context, model inputs/outputs, costs, latency, and failure modes—so production behavior continuously becomes a better training signal for iteration and evals. Who This Is For - AI engineers shipping LLM features into real products. - Product builders and founders moving beyond prototypes. - Teams working on LLM workflows, RAG systems, or tool-using agents. - Anyone who needs reliability, governance, and observability—not just “it works on my laptop.” Resources Course schedule and lecture materials: https://web.stanford.edu/class/cs224g/schedule.html Build reliable LLM features faster with Adaline — iterate on prompts with versioning + diffs, run evals as continuous feedback loops, and monitor production behavior end-to-end so you can ship with confidence (not vibes). 👉 Sign up for Adaline and start turning LLM demos into production systems: https://go.adaline.ai/tbv8nRb Chapters 00:00 — Intro and agenda. 01:10 — Why demos don’t become production. 01:51 — Real failures: Chevy $1 and the Air Canada chatbot. 02:20 — What is a prompt? 03:05 — Deterministic vs. stochastic outputs. 05:37 — Computing history, and why new tools appear. 08:57 — The four pillars: Iterate, Evaluate, Deploy, and Monitor. 10:29 — Why data matters? 12:55 — Iterate: How to improve prompts systematically. 13:57 — Audience examples: Chunking and prompt cache. 17:55 — Iterate demo: Test cases and version changes. 21:31 — Inputs Can Override Instructions. 23:46 — Q&A: How to know if a prompt is better. 30:32 — Evaluation basics: Not unit tests. 40:56 — Types of evals: Human in the loop, LLM-as-a-judge, and simple checks. 45:37 — LLM judge tip: Use pass/fail. 56:30 — Monitoring: Logs, alerts, and the feedback loop. 59:13 — Wrap-up.

Watch on YouTube

Video Details

Category

AI Evaluation & Monitoring

Featured Date

Quality Rank

#2

AI Recommended