LLM & RAG Evaluation Playbook for Production Apps by Paul Iusztin | DailyDevLists

Loading video player...

LLM & RAG Evaluation Playbook for Production Apps by Paul Iusztin

Open Data Science and AI Conference

180 days ago

58:16

AI Evaluation & Monitoring

Rank #2

Description

Building proof-of-concept LLM/RAG apps is easy—we know that. The next step, which consumes the most time and is the most challenging, is bringing the app to a production-ready level. You must increase accuracy, reduce latency and costs, and create reproducible results. You must start optimizing your LLM and RAG layers to ensure compliance with all these requirements. You must begin digging into open-source LLMs, fine-tuning LLMs for your specialized tasks, optimizing them for inference, and so on. However, before optimizing anything, you must first determine what to optimize. Thus, you must quantify your system's key metrics (e.g., latency, costs, accuracy, recall, hallucinations, etc.). Thus, as developing AI applications is an iterative process, the first critical step to getting to production is learning how to evaluate and monitor your LLM/RAG applications. The best strategy is to build something simple end-to-end, attach an evaluation layer on top of it, and then quickly iterate in the right direction by clearly indicating what needs improvement. Thus, this workshop will focus on evaluating LLM/RAG apps. We will take a simple, predefined agentic RAG system built in LangGraph and understand how to evaluate and monitor it. To do that, we will explore the following topics: Add a prompt monitoring layer. Visualize the quality of the embeddings. Evaluate the context from the retrieval step used for RAG. Compute application-level metrics to expose hallucinations, moderation issues, and performance (using LLM-as-judges). Log the metrics to a prompt management tool to compare the experiments. Prerequisites: You will require an OpenAI API Key, as the evaluation providers work only with OpenAI. Still, the expected costs will be minimal, under 1$. All setup please check here: https://github.com/decodingml/workshops/tree/main/workshops/odsc-2025-evaluation-playbook/template Tools/Languages utilized: API: OpenAI; Cloud dashboard & API: Opik Code Notebooks: -https://github.com/decodingml/workshops/tree/main/workshops/odsc-2025-evaluation-playbook/template -https://github.com/decodingml/workshops/tree/main/workshops/odsc-2025-evaluation-playbook/solution Slides: https://drive.google.com/file/d/1POd8AyE6v6CdaMHPX_lQ8lMsgFfd5er-/view?usp=sharing 👉 🔔 Subscribe like, share, and engage in the AI revolution! ODSC Upcoming Events: https://odsc.com/ 🎯📚 Master the skills that power AI using the latest tools, languages, and frameworks with renowned instructors and leading experts. Join us on our online learning platform! https://aiplus.training/ 🚀🧠 Want to stay ahead on AI? Then subscribe to ODSC today - https://www.youtube.com/c/OpenDataScienceCon?sub_confirmation=1 You can also follow ODSC on: LinkedIn - https://www.linkedin.com/company/open-data-science/ Twitter - https://twitter.com/_odsc Facebook - https://www.facebook.com/OPENDATASCI/ Medium - https://odsc.medium.com/

Watch on YouTube

Video Details

Category

AI Evaluation & Monitoring

Featured Date

December 1, 2025

Quality Rank

#2

AI Recommended