
MASTER SERIES - RAG - 15 - SQL DATABASES PARSING AND PROCESSING
DATASKILLED
Corey Davis, Digital Preservation Librarian, University of Victoria Libraries NDSA Digital Preservation Conference Session 1A2 Web archives are full of valuable cultural and historical content, but they’re notoriously hard to work with: messy layouts, repetitive boilerplate, and clunky keyword search make discovery a real challenge. Retrieval-Augmented Generation (RAG) offers a way to cut through the noise by letting people ask natural language questions and get grounded, source-based answers. In this talk, I’ll share a custom RAG pipeline we built at UVic Libraries to improve access to WARC-based web archives. We were inspired by WARC-GPT—an open-source tool from the Harvard Library Innovation Lab—and wanted to take the next step by building our own version from scratch. That gave us a chance to dig into the components, experiment, and adapt everything to our local infrastructure and needs. Our setup includes cleaner text extraction, smarter chunking, GPU-accelerated embedding, and prompt strategies to cut down on hallucinations and improve results. To test it, we used a web archive of the Bob’s Burgers Wiki (yes, really), which gave us a great sandbox for measuring retrieval accuracy, citation quality, and system performance. The custom pipeline ended up being faster, smaller, and more precise, reducing index size by over 95% and giving clearer, more accurate answers. I’ll walk through what we built, what we learned, and why this kind of system could help libraries and archives make web collections more useful, without giving up on trust, provenance, or human oversight.