RAG Pipelines for AI-Enhanced Discovery in Web Archives

Description

Corey Davis, Digital Preservation Librarian, University of Victoria Libraries NDSA Digital Preservation Conference Session 1A2 Web archives are full of valuable cultural and historical content, but they’re notoriously hard to work with: messy layouts, repetitive boilerplate, and clunky keyword search make discovery a real challenge. Retrieval-Augmented Generation (RAG) offers a way to cut through the noise by letting people ask natural language questions and get grounded, source-based answers. In this talk, I’ll share a custom RAG pipeline we built at UVic Libraries to improve access to WARC-based web archives. We were inspired by WARC-GPT—an open-source tool from the Harvard Library Innovation Lab—and wanted to take the next step by building our own version from scratch. That gave us a chance to dig into the components, experiment, and adapt everything to our local infrastructure and needs. Our setup includes cleaner text extraction, smarter chunking, GPU-accelerated embedding, and prompt strategies to cut down on hallucinations and improve results. To test it, we used a web archive of the Bob’s Burgers Wiki (yes, really), which gave us a great sandbox for measuring retrieval accuracy, citation quality, and system performance. The custom pipeline ended up being faster, smaller, and more precise, reducing index size by over 95% and giving clearer, more accurate answers. I’ll walk through what we built, what we learned, and why this kind of system could help libraries and archives make web collections more useful, without giving up on trust, provenance, or human oversight.

12:28

Deploying AI Literature Agents at Scale | Nishanth Joseph Paulraj | Conf42 MLOps 2025

Conf42

RAG Pipelines for AI-Enhanced Discovery in Web Archives

Description

Video Details

More from RAG & Vector Search

MASTER SERIES - RAG - 15 - SQL DATABASES PARSING AND PROCESSING

MASTER SERIES - RAG - 14 - JSON FILES PARSING AND PROCESSING

MASTER SERIES - RAG - 13 - PARSING CSV AND EXCEL FILES

Tips for Building Enterprise GraphRAG Pipelines

Building RAG Application from Scratch | Product Space

RAG Explained: From Simple Guide to Hidden Challenges

AdvancedRAG Demo

Deploying AI Literature Agents at Scale | Nishanth Joseph Paulraj | Conf42 MLOps 2025