
MASTER SERIES - RAG - 15 - SQL DATABASES PARSING AND PROCESSING
DATASKILLED
This video presents our CS441 project on building an incremental Retrieval-Augmented Generation (RAG) indexing pipeline using Apache Spark, Delta Lake, and Ollama. In this system, we process a corpus of academic PDFs, extract and normalize their text, generate embeddings locally using Ollama, and store the results in Delta Lake tables. Unlike a traditional batch indexer that reprocesses everything on each run, this pipeline performs intelligent change detection by identifying new, modified, and deleted documents. It then re-chunks and re-embeds only the relevant data, demonstrating a truly incremental and idempotent RAG ingestion workflow. The video walks through the full architecture, configuration, local execution in a Linux environment, and log verification that proves the incremental behavior. We also highlight how the design supports real-world deployment, including compatibility with AWS EMR and S3. By the end of this demonstration, viewers will understand how to construct a scalable, production-style embedding index that automatically stays up to date as source documents evolve.