Loading video player...
Here's the standard rack pipeline everybody teaches. You take your document, create embeddings, store them in a vector database, and when somebody asks a question, you retrieve the relevant chunk and feed them to an LLM. The pitch is always the same. Just embed your documents and ask questions. It sounds magical. And for simple use cases, it works. But here is where things get interesting. Now let me show you what actually happens when you throw real documents at a rag system. Chunk
lose context. When you split documents into 500 token pieces, you destroy the relationship between sections. The chunk about the purchase price does not know it's connected to exhibit B. Three pages later. Cross references become invisible. Embedding can't follow these links. Similarity is not the same as relevance. Just because two pieces of text are semantically similar does not mean they are both useful for answering your question and there is no reasoning
happening. Rag is pattern matching on surface level. It does not understand your document. It's just finding things that look similar. Let's talk about the main culprit. So here's what a rack system actually sees when you ask a question about a purchase price. So chunk 47 talks about purchase price and then it cuts off. Chunk 48 mentions section 2.3b which outlines the methodology. These chunks are neighbors in the document but the rack system does not know that. What you actually need is
a full context. So it's actually a chain. Documents are structured. They have hierarchy. Chunking destroy all of that structure. Now here's a typical document and tell document 10 documents in a folder. When you ask a question you might need information from three or four of them, but they reference each other. So look at this line. The term in exhibit B are subject to schedule 4.2. This is one sentence that spans two documents. So ARG system retrieves one chunk. It
might find the sentence, but it won't follow the chain. Now the problem is this is not a niche problem. Legal documents, technical specification financial filing, they all work in the same way. Cross references are the norm not the exception. What if instead of computing chunks and hoping for the best, we actually let the AI navigate documents like a human would. Think about how you read documents. So let me give you a concrete example. You are looking for the return policy. So what
do you do first? You don't read page one. You look at the table of contents. Chapter 3 is written policies. That looks relevant. So you go to page 28 scan for electronics. Now this is how intelligent document search should work. Not find similar text but understand and navigate. So here is what a rack system would have returned for the same question. Some chunks that mention return policy exist and tells you to see section 3.2. It found a related text but not the answer. But the agentic approach
finds the actual answer, follows the references to extend warranty info and gives you complete context. That's the difference. Okay. Now how this agentic file search looks like. This is different than the normal agentic rag approaches that you probably have seen. So it starts with a user query same as a rack but instead of retrieving precomputed chunks the agent scans all the documents in parallel. It gets a quick preview of everything all at once. Then it decides which documents are
actually relevant and does a deep dive reading the full context of all those files. And critically if it finds a reference to another document it goes back and reads that too. I call this a threephase strategy. Phase one, get preview of all the documents in parallel. This is fast. You're not reading everything, just enough to categorize. Now phase two, now you read the full content of the document that look relevant. And phase three backtrack. If document A says, see
document B and you skipped document B earlier, go back and read it. So the agent decides what to read based on the question. That's the key difference from rack. Okay, so let's summarize the fundamental differences. Rack uses precomputed embeddings. They are fixed at index time. You get the same chunks no matter how you phrase your question. Agentic file search is dynamic. It adopts to each query. Different question leads to different exploration paths and it follows logical connections not just
semantic similarity. Now the practicalities matter. You're going to be using a lot more tokens. Now I'm not saying rag is useless. It has its place. So use rag when you have simple question answer or large corpus. When documents are independent and don't reference each other and when speed is critical rack is faster. Use agentic file search when you need complex multi-document reasoning. when document reference each other when structure matters like legal or
technical documents and when accuracy is more important than speed they are different tools for different jobs so in summary rag is retriever it finds similar textic file search is reasoning it understand documents embedding finds similar text agents understand documents and I think future is not just retrieval, it's exploration. Okay, so now let me show you a practical example before technical explanation of how this works. So here's a quick demo of how this agentic file
search explorer looks like in practice. I can select a folder. There are 26 different files in it. This is demi data from a large acquisition that I created. The file varies from a few pages to tens of pages. Now since this is regarding acquisition, we're going to ask a question like what happens to employees after the acquisition cover retention benefits and non-competes. A traditional rack system may not be able to answer complex questions like these. This is
going to be relatively slower compared to a traditional retrieval step. So you can't really use this for realtime chatting. This is extremely helpful if you want more accurate answers, but you're not constrained by latency. Second, it uses an approach similar to what coding agents like clot code does. So when you ask it for a code change clot code is going to use the grip command along with regular expression based search mechanism to look for code snippets that are most relevant to the
user question. It does a very similar thing. It intelligently identifies which document to process. It doesn't read all of the documents all at once. So here's the answer that we got. You're going to see that it sites the files in line where the information is coming from. It talks about the employee retentions. How many total employees were there? And then it talks about benefits and equity for each one of them. It also talks about non-compete and then at the end it lists all the
sources that were consulted. Now let's walk through the step-by-step process of how the agent was thinking about it. So first it used the scan folder tool. We're going to go into the technical details in the later part of the video. So this uses a parallel scan to categorize all documents in the folder based on their relevance to the employee matters. Then based on that initial assessment, it decided to parse this one document which it thinks is most relevant. It also explicitly cross
references several other documents. Key employee retention agreement. These are different documents in the folder. And you can see the agent is reading through the document and trying to identify if it's cross referencing any other document in this folder. Now this is the key innovation because in a lot of technical documentation whether it's legal or financial document you're going to see there are cross references to different documents. So the agent is smart enough to identify those cross
references and then go and look for those specific documents. So after reading through the first file it goes and reads through another file which is cross referenced. It does the same for a few other files as well. In here, it says I have identified the key employees and their retention bonuses, but I need to find the specific terms for the treatment of equity awards. So, it's backtracking here and then it identifies that there is another file that I should
be looking at. So it looks at a couple of other files, reads the content of those files and based on the information that just collected, it comes up with the final comprehensive solution. Now if you were to use this same prompt in a traditional embedding based semantic similarity retrieval system or even just keyword based retrieval system, it will not be able to go through this multi-step process even if you're building an agentic rack system based on semantic similarity. But here you get a
lot more comprehensive responses. Now a couple of very important points before I show you the technical details. The agentic file search is going to be a lot slower compared to traditional rack system. It's going to use a lot more tokens as well. So it's going to be relatively more expensive compared to a traditional rack system where you reduce the context that is going to be fed into the large language model. Now the good news is that if you are running a
completely local model, you can also use a long context openweight model for this purpose. But the bottom line is that you want to use this system if latency is not an issue and you can wait for more accurate responses that is going to take more time. Okay. So now let me walk you through how this system was built. I'm not going to walk you through the installation process. Link to the GitHub repo is going to be in the video description. It's pretty straightforward
to install this system and run it locally. So this is file search explorer an open source agentic document search system built on top of the llama index example. Let me show you how exactly this is built. So here's the highlevel architecture. A user query comes in and goes to the workflow engine. The workflow coordinates an agent which talks to Gemini 3 flash for decision making. Now in this case the agent does not just return text. It returns structured action. There are six tools
that the agent can use. Think of these as the agent's hands. It can scan folders, preview files, parse documents read text, search for grip, and find files with glob. The agent decides which tool to use based on what it has learned so far and it's not a fixed pipeline. Also, you don't really see any indexing step here. Now, let's talk about the core components. First, the workflow engine. I'm using Llama index workflow for this. It's event driven. The agent
emits events. The workflow reacts. This gives you async execution, timeout handling and clean separation of concerns. The second is the agent. This is a thin wrapper around Gemini 3 flash. You can replace it with another model if you want. The important part is structured JSON output. The model returns a schema not free text. The third is the document parser. In this case, I'm using dockling. It's open source and it's run locally. Now API calls for parsing. It handles PDF, Word
documents, PowerPoint and everything that you can imagine. Everything gets converted to clean markdown text. I personally like a standard format that the agent can work with. So here are the six tools and let me walk you through each one of them because this is the core of the whole pipeline. So the first one is scan documents. This is the most important tool. It scans all documents in folder in parallel and returns quick preview of each one. The second one is
preview file. So it's a quick preview of a single document. We take about 3,000 characters roughly the first page. Useful for spot checks. The third one is parse file. It's full document extraction. This is the deep dive. You get everything. Then there is a tool for read for plain text files faster than parse files when you don't need document processing. Then there is grip which is regular expression search within a file useful for when you're looking for specific patterns. Then last one is
glob. It finds file by patterns like find all PDFs in a folder. The agent chooses which tool to use based on the current context. The decision making is where the intelligence lives. Let me zoom in on the scan folder because this is the key component. Now here is the implementation. It finds all documents in a folder then uses thread pool executor to process them all in parallel. Four workers by default. Each document gets a quick preview about the first 1500 characters. Now why does this
matter? There are three main reasons. First a single API call. Instead of the agent calling preview file 10 times, it calls scan folder once and sees everything. Two, parallel IO document parsing is IO bound. So parallelism gives you a real speed up about four uh times faster than sequential. And the third reason is informed decisions. The agent can categorize all the documents all at once. These three are relevant. These seven I can skip. So this is the intelligent filtering step. Okay, let me
show you how these tools get used in practice. The system follows three-phase strategy. So phase one is parallel scan. When the agent encounters a folder, it's uses the scan folder to preview everything at once. It's a quick broad coverage. Then there is phase two, which is deep dive. Based on those previews the agent identifies which documents are relevant and calls the parse file on each. Now it's reading the full content in this phase. Then there is the last
phase which we call backtrack. If a document says see exhibit B and the agent skipped exhibit B in phase one, it goes back and reads it. The system prompt explicitly tells the agent to watch for references and backtrack when needed. Okay, so let's talk about the agent's decision loop. What happens on each iteration? So the agent receives full conversation history. The original question, all the tool calls so far, all results. It send those to Gemini 3 flash which returns structured JSON not free
text just a schema. And then the JSON contains an action type three possibility. One is tool call which means use this tool with the parameters go deeper which means navigate to this directory and stop which means I have the answer. So after executing the action the result gets added to history and we loop. This continues until the agent says a stop or when it hits a timeout. Okay. One optimization that matters is document caching. So when parse files get called we first check as
cache. If the document was already parsed maybe from a preview or previous question, we return the cacheed version instantly. Now implementation here is simple. A dictionary mapping file paths to pass content parse once reuse everywhere. This matters for backtracking. If the agent previewed a document then later needs the full content, the preview work isn't wasted. Next, we're going to look at how does the agent know when to backtrack and cross reference detection. So we tell
the agent to watch for patterns like see exhibit A as stated in the document referred to section 4.2 and so on. These are in the system prompt. So when the agent encounters a reference, it checks was the document in my skip list. If yes, it backtracks and reads it. If the reference document wasn't even in the folder scan, it might use globs to find it. This is where the agentic part shines. The agent is reasoning about what it needs, not just pattern matching
like simple rack system. The system is designed to be extendable. You want to add a new tool. There are simple four steps. Define the function in the fs.py. Add it to the tool directory in the agent.py. Update the tool in models.py. Document it in the system prompt. And you're done. Now you want to swap the lm. Replace the Google generative AI import with entropic or open AI. Just update the generate call function. The rest of the system does not change at all. The architecture is model tools
LLMs and workflows are separate concerns. Okay. Some key insights from building this parallel scanning is crucial. Scan folder call gives the agent full context to make smart decisions. Without this, you are making dozens of round trips. Two, structured outputs prevents errors. When the LLM returns a schema instead of a free text you eliminate an entire class of bugs. You don't have to worry about regular expression, no parsing failures. And three, backtracking enables thoroughess.
The ability to revisit skip documents based on across references is what make this work on real documents. So the code is open source. Link is in the description. If you want to see more deep dives into agentic systems, let me know in the comments.
In this video we will look at file search exploration as a potential replacement to RAG. This is built on top of the the fs-explorer from llamaIndex. LINK to the Repo: https://github.com/PromtEngineer/agentic-file-search My voice to text App: whryte.com Website: https://engineerprompt.ai/ RAG Beyond Basics Course: https://prompt-s-site.thinkific.com/courses/rag Signup for Newsletter, localgpt: https://tally.so/r/3y9bb0 Let's Connect: 🦾 Discord: https://discord.com/invite/t4eYQRUcXB ☕ Buy me a Coffee: https://ko-fi.com/promptengineering |🔴 Patreon: https://www.patreon.com/PromptEngineering 💼Consulting: https://calendly.com/engineerprompt/consulting-call 📧 Business Contact: engineerprompt@gmail.com Become Member: http://tinyurl.com/y5h28s6h 💻 Pre-configured localGPT VM: https://bit.ly/localGPT (use Code: PromptEngineering for 50% off). Signup for Newsletter, localgpt: https://tally.so/r/3y9bb0