#285 FRAMES: Benchmark Dataset for RAG systems

Data Science Gems

1 day ago

13:50

RAG & Vector Search

Rank #1

Description

Large Language Models (LLMs) have shown significant improvements across cognitive tasks, with an emerging application in enhancing retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand queries, retrieve relevant information, and synthesize accurate responses. Given their increasing real-world deployment, comprehensive evaluation is crucial. FRAMES (Factuality, Retrieval, And reasoning MEasurement Set) is a high-quality dataset designed to test LLMs’ factual responses, retrieval capabilities, and reasoning in generating final answers. Unlike previous work evaluating these abilities in isolation, FRAMES offers a unified framework for assessing LLM performance in end-to-end RAG scenarios. FRAMES comprises challenging multi-hop questions requiring integration of information from multiple sources. Baseline results show that even state-of-the-art LLMs struggle, achieving 0.408 accuracy without retrieval. However, the proposed multi-step retrieval pipeline significantly improves accuracy to 0.66 (more than 50% improvement). In this video, I talk about the following: What does the FRAMES benchmark dataset contain? How do single-step LLMs perform on FRAMES? How do multi-step retrieval perform on FRAMES? For more details, please look at https://aclanthology.org/2025.naacl-long.243.pdf Krishna, Satyapriya, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. "Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation." arXiv preprint arXiv:2409.12941 (2024). Thanks for watching! LinkedIn: http://aka.ms/manishgupta HomePage: https://sites.google.com/view/manishg/

Watch on YouTube