Loading video player...
Description Testing LLM applications requires structured evaluation methods rather than manual prompt checks. This episode explores a four-dimension RAG evaluation framework, a six-phase LLM testing cycle, and how golden datasets enable measurable A/B testing. We also discuss why prompt injection is emerging as the most critical security risk for AI applications. Research & References https://krishcnaik.substack.com/p/a-complete-guide-to-llm-chatbot-evaluation https://rhesis.ai/post/how-to-test-llm-applications https://dev.to/ritwikareddykancharla/ab-testing-llm-systems-2mb6 https://owasp.org/www-project-top-10-for-large-language-model-applications/ https://github.com/explodinggradients/ragas https://gandalf.lakera.ai/pinj Companion newsletter and episode notes: https://daily.testingeducation.org/