Build Your First RAG Pipeline for Better RAG (step-by-step) | DailyDevLists

Loading video player...

Full Transcript

3,856 words • EN

Today we're going to be talking about

rag pipelines and the importance of

keeping your database up to date. At

this point, I'm assuming you've already

built some sort of vector database rag

agent before. If you haven't, I built a

full course on that. You can go watch

that video up here. And then when you're

done with that video, come back over

here and we're going to build out a data

pipeline. In today's example, we're

going to make sure whenever we drop a

PDF into a Google Drive, it gets put

into our vector database. Whenever we

update that file in Google Drive, it

will also get put into our database and

the old one will be deleted. And then of

course if we delete the file out of

Google Drive, it will also be deleted

out of our vector database. So I don't

want to waste any time. Let's get into

the video. All right, real quick before

we hop into Nen, I wanted to do a few

slides about why data pipelines matter

for the success of your AI agents. The

idea of setting up a knowledge base that

all of your agents can pull from is that

the knowledge in there is accurate and

up-to-date. So if your data is messy,

outdated, or scattered everywhere, your

AI agents going to struggle to deliver

actual real answers. So what we need to

do is we need to design automated rag

pipelines to make sure that they're

constantly checking to make sure that

the vector database is accurate or

wherever the data is being stored. So

when I think of a data pipeline, I think

of three steps. I think of the raw

material that we take in. I think of the

processing line of what actually happens

to that raw material and then I think of

where it ends up sitting. So a quick

practical and example of this is in this

workflow I've got my transcripts

pipeline. So the raw material that I'm

giving it is I'm uploading a URL of a

YouTube video right here. And then we

move into the processing flow where I

get the transcript from it. I'm

extracting the actual transcript and I'm

extracting the timestamps, merging that

back together. And this was basically me

cleaning up and getting the data ready

to be ingested into the final product,

which is our Superbase vector database

right over here. So we've got four

essential components to be thinking

about. The first one is the trigger.

what actually starts the process of

getting data into a vector database or

deleting data from a vector database.

This could be new emails come in and you

want those to get vectorized. This could

be a new row in a Google sheet. It could

be a file upload or it could even be

some sort of criteria is met. And let me

show you again what I mean by that with

a real example of this YouTube

transcript pipeline. After we get a

YouTube video into our vector database,

we then put it in a Google sheet. And so

the Google sheet would look like this

where we'd get the title, the URL, and

the transcript. And then also we would

have a status. So it would be processed

or if I changed it to remove then that

would trigger off this second flow down

here. This would go off whenever a row

status equals remove and then it would

filter out all the other rows and then

it would get rid of the vectors where it

came from that video. So hopefully that

makes sense. If it didn't, you can go

ahead and watch this YouTube transcript

video which I'll tag right up here. But

that's just a way for me to make sure

that my vector database is only having

YouTube videos that I want to chat with.

And then we have inputs. And these are

the data sources that we need to

process. You really want to know exactly

what your data sources look like and how

they're going to be coming in because

predictability is your best friend. Are

they going to be PDFs? Are they going to

be CSVs? Are they going to be both? Are

there going to be images? Or is it just

going to be text? You need to understand

this stuff in order to make that middle

portion of your rag pipeline actually

good. And then of course, we take those

inputs and we process them. We clean

them up. We remove duplicates. We make

sure that they're ready to go. We give

them metadata, stuff like that. And then

we actually shove them into our vector

database or a relational database,

wherever we actually want to keep them.

So I really just wanted to preface this

stuff because it's really, really

important to think about what data am I

currently processing and then later how

can I scale this up. So, a great example

today is we're just going to be building

a flow to handle PDFs. But later on, if

we knew, okay, we might also need Word

Docs and Excel files and stuff like

that, then you could come in here and

build a system like this where you're

watching a Google Drive folder, but then

you also have a switch to handle PDFs,

text files, Excels, and all of them get

processed differently because they're

different types of files, but then

ultimately they all go into the same

vector database. So, that's just an

example I wanted to show you guys real

quick of what I meant by understanding

these core components and why

predictability is your best friend. So,

now that we got all of that boring stuff

out of the way, let's get started with

this build. So, the first thing we're

going to build is the pipeline that

takes a new doc that we drop into a

Google Drive folder and it puts it into

a vector database. So, super simple,

we're going to start off here by

grabbing a Google Drive node and we're

going to grab a trigger that is on

changes involving a specific folder.

First thing here after you connect your

Google Drive account is to choose the

folder.

So, after you connect your Google Drive

account, you want to choose the folder

that you're going to be looking in. We

are going to grab one that I just made

called rag. There we go. And then what

are we watching for? We're watching for

a new file being created in this folder.

So, what I'm going to do real quick is

go over to my Google Drive and we're

going to take this policy and FAQ

document. And I am just going to move

this into our folder called rag. As you

can see right here, moving into rag. And

then when we go back to end and I hit

fetch test event, we should now see that

that folder has arrived. or sorry, not

the folder, the file. You can see if I

scroll over somewhere, there it is. It

is called policy and FAQ document. So,

we've got that data here. What I'm going

to do now is just pin this to keep it

here for now. The next thing we're going

to do is actually download this file

because all that came back here was like

metadata about the file, its ID, its

title, all that kind of stuff. So, I'm

going to grab another Google Drive node.

We're going to do download file, and I'm

going to change the file we're looking

for to be by ID. And then all I have to

do here is find the ID of the file that

triggered this workflow. Okay, so I had

to scroll down a little bit, but I found

it. It is right here. I'm going to drag

that into the box. And now we have this

variable which represents the ID of the

incoming file. And I'm just going to

click execute step. And now we should

see the binary over here. Actually, I

forgot that this is a Google doc. So

what I'm going to do is I'm going to add

an option down here where I can actually

download any Google Doc as a PDF. So I

can click on add conversion and rather

than turning a doc into HTML, I can turn

a doc into PDF. And if I run this again,

we should now see right here that this

is coming through as a PDF. So perfect.

We've got what we want. And now it's as

simple as that. I'm just going to add a

superbase step. We're going to add a

superbase vector store and we're going

to add documents to it. So I'm choosing

the table to put it in in Superbase,

which is called documents. As you can

see, here's my environment and this is

the table we're going to put it in. We

don't need to add any other options. We

just need to add our default data

loader. And this is important because

right now it's looking for JSON, but

what we actually want to give it is

binary. As you can see, we have our PDF

right here as binary. So, I'm going to

change that to binary. We're going to

leave everything else up here as default

for the sake of the example, but we are

going to add some metadata. This is

going to be very important for us later

when we need to update and delete files.

So, I'm going to add metadata. The first

thing that I'm going to add is file

name. So I'm going to do So I'm just

going to do some camel case there. Put

in file name. And then we just need to

go back to the schema of this file and

find its name. So if I scroll down here,

we can see the name is policy and FAQ

document. I'm going to throw that right

in there. And then we're going to add

one more metadata property, which is

going to be date. And then I am just

going to type in two open curly braces

and do dollar sign. Now, so whenever we

get a new piece of information put into

our vector database, we can see the

exact date and time that it was

uploaded. That way, we can just later on

validate that if we update a file in our

Google Drive that it updates in Subbase

as well. Okay, cool. So, we have file

name and date as our metadata. That's

all we're going to do for now. And then

I'm going to add an embedding. So, I'm

going to choose OpenAI. I've already got

this all set up. We've got text

embedding three small which has to be

the same as your embedding model for

your database. So we're we're good to go

here. And now I'm just going to run this

and this is going to put that policy and

FAQ document into our superbase. Cool.

So it says five items should be there.

So we should refresh this and see five

items. Oh, they popped up right there.

And if we go to the metadata and open

this up, we can see that we have we have

title and producer because it I guess it

got that from the binary data itself.

But we also have the metadata down here

that we added which was date and file

name right there. And instead of file

name, you could have also done file ID

as long as you have some sort of unique

variable that you can reference later.

And you guys will see exactly what I

mean by that when we do this next

pipeline. Real quick before we build

that next pipeline, I'm just going to

build a really, really quick AI agent so

we can validate that it is able to read

this document. Okay, so set that up real

quick. I'm just going to ask it what is

our shipping policy?

Shoot that off and we should get an

answer from the vector database. I

didn't even give the agent a prompt or

anything. We just hooked it up to a tool

and look how smart this guy is. So,

we've got our shipping policy. Orders

are processed within one to two business

days. Standard shipping takes 3 to seven

business days. And you can see right

here that it is correct. All right,

cool. So the next step is we now need to

create a flow that when we update this

file, it will also update in our subase

factor database. So what we're going to

do is we're going to add another

trigger, which is going to be a another

Google Drive trigger. So you might think

to just do on changes to a specific

file, which is fine if your vector

database only has one file, but what

we're going to do is on changes

involving a specific folder instead,

just in case you drop in many files in

this folder. So, we're going to choose

that same one again, which was called

rag. And we're going to be watching for

a file updated rather than a file

created. All right. So, I just changed

the name. It used to be tech haven. As

you can see in the um vector database,

the policy and FAQ doc store name is

tech haven, but I just came in here and

changed it to green grass. So, now when

we test this trigger, it should pull in

that file because the file had a change

made to it. So, we got this information

back. But now before we download the

file, what we want to do is want to get

rid of all of the vectors in Superbase

where the file name equals policy and

FAQ document because these are now

outdated vectors. So to do this, we're

going to add another node and this is

going to be a superbase node, not a

vector store node, just a regular

superbase node and we're going to choose

delete a row. So once again, we need to

choose the table which is documents and

keep in mind this is a table that has

embeddings. So it is a vector store but

we're able to use the regular subbase

node here. So what we want to do is

delete. We're going to delete a row in

this documents table but instead of

build manually we're going to choose

string and then I'm going to change this

to an expression and paste in this

expression right here which is metadata

arrow arrow file name which is the

metadata field equals like period

asterisk. So kind of a mouthful and not

a super intuitive string but this is how

it's going to work. And what we need to

do now is just go down to grab the file

name of this file. And like I said, you

could use the ID. You could use anything

that's unique to this file. I just

decided to go with name because it looks

a little less intimidating for the sake

of the demo. So now we have any any

vectors where the metadata of file name

equals this. It's going to delete. So if

I hit execute step, we should see five

items were output because we had five

vectors right here. And these should

disappear any second. Now there you go.

They're gone. So now we know our vector

database is clean of old vectors. And

now all we have to do is same thing up

here. Download the file and then put it

into superbase. So I'm actually going to

copy this superbase right here and just

put it right here. And then I'm going to

grab another Google Drive node in order

to download the file. And we just need

to download by ID once again. And we're

going to choose the ID from the Google

Drive file that triggered this workflow

which is at the bottom right here. Same

thing actually though. I'm going to do

the file conversion and make sure the

doc is getting turned into a PDF and

then download it. Okay, one thing did

happen though, which let me explain what

happened. So, we pulled back five files,

but they're all the same one. And the

reason why that happened is because when

we deleted five rows from Superbase,

this has five items, which makes Google

Drive think that it needs to output five

items as well. So, we're going to click

on this node. We're going to go to

settings, and we're just going to say

execute only once. And now when we run

this again, it's only going to have one

as you can see. And now we're able to

just hook that puppy into Superbase. And

when we run this, I believe everything

should be set up. Um, we should still in

our data loader have the metadata, but

we need to fix this because the name is

not mapped correctly. So, we're going to

go back to the Google Drive trigger.

We're going to scroll down to get to the

name, which is all the way near the

bottom. They make it so hard to find.

There it is. Policy and FAQ doc. So now

this flow has the right metadata. And

you can see now it's good. We're passing

over a new date and time. And then we're

going to go ahead and run this one as

well. And because it's basically the

same file, it should still be five

items. We should go to Subabase and wait

for these to pop in. And you can see now

that the vector is updated because it

says store name is green grass rather

than tech haven. And once again, we

could come up here and chat with the

agent and say, what is our store name?

Send that off and it should hit subbase

right here and come back with green

grass. There you Our store name is green

grass. Okay, so we now have our flow

that puts a file into Subabase when it

is put into a Google Drive folder. We've

got one that when we update that file,

it's going to delete the old records and

then put the new ones into Subabase. And

now the last thing we need to do is what

happens if we actually just delete that

file or we want to delete it. How do we

make sure Subase deletes them? So, we

have a bit of a band-aid fix, but it

does work. And also keep in mind what

I'm trying to show you guys here is the

idea of building these pipelines. Not

saying that this is the optimized way to

chunk and split and embed data into a

subbase vector store. We're just keeping

it simple here with the main

foundational highle concepts. So

anyways, the fix we have here because as

you can see when we go to Google Drive

and we go to triggers and we do on

changes involving a specific folder,

there's not one. Sorry, let me just grab

the folder. There's not a watch for file

deleted. Not exactly sure why.

Obviously, they have something on their

end which is like sending over the data

off of their web hooks and triggers,

whatever. But there's not that option

there. So, what we do is we're going to

go file created, but we're going to

choose a different folder. We're going

to choose a new one that we made called

the recycling bin. So, now it's watching

a separate folder for anything that gets

put in the recycling bin. So, what I'm

going to do is go over to this policy

and FAQ doc. We're going to go ahead and

move it once again. So, I'm going to

move this to the recycling bin folder.

As you can see, now it's gone from

there. And it should have gone into our

recycling bin right here. Same file

though. And now, if we go into NN and we

fetch test event, we should see that we

got this file, which was once again

hopefully the policy and FAQ document

right here. And then, it's actually

really simple because we don't have to

ingest anything. We just have to delete.

All we have to do is throw this right in

here and make sure that everything's

mapped up correctly, which is metadata

file name like name. Execute that. We

should get five items over here. And

then we should see them be deleted from

Subbase. Just like that. So, it was

really that simple. And now we have two

different Google Drive folders that as

soon as we either drop something in,

update those contents or move it from

there to a recycling bin folder, our

Subbase vector database will be taken

care of. Once again, this wasn't for

showing you how to optimally process

data and put it in. This was more about

the idea of it and how you can think

about creating these different triggers

and using metadata to actually filter

and delete things. So, I hope you guys

were able to watch this one, understand

what's going on, and follow along. As

always though, you'll be able to

download this exact workflow. I'll also

have like sticky notes and a setup guide

and stuff. All you have to do to get

that for free is join my free school

community. The link for that will be

down in the description. Once you get in

there, it will look like this. You'll

just need to navigate to the YouTube

resources and every single one of my

videos here has some sort of resource.

So, right here is my developer agent and

you have the developer agent JSON to

download. And if you're looking to dive

a little deeper with more hands-on

learning experience, then definitely

check out my plus community. The link

for that is also down in the

description. Got a great community of

over 200 members who are building with

NADN every day and building businesses

with NAND. It's a super active group.

We've been having some really fun calls

and discussions lately. And we also have

a full classroom section where we dive

into the foundations with agent zero nit

with 10 hours to 10 seconds and then a

new course for our annual members called

one person AI automation agency. So I'd

love to see you guys in these calls in

the community. But that's going to do it

for the video. If you enjoyed this one

or you learned something new, please

give it a like. Definitely helps me out

a ton. And as always, I appreciate you

guys making it to the end of the video.

I'll see you on the next one. Thanks

everyone.

Build Your First RAG Pipeline for Better RAG (step-by-step)

Nate Herk | AI Automation

29 days ago

16:25

RAG & Vector Search

Rank #5

Description

Full courses + unlimited support: https://www.skool.com/ai-automation-society-plus/about All my FREE resources: https://www.skool.com/ai-automation-society/about Have us build agents for you: https://truehorizon.ai/ 14 day FREE n8n trial: https://n8n.partnerlinks.io/22crlu8afq5r If you’re building RAG agents in n8n, this is one of the most important tutorials you’ll ever watch. In this step-by-step video, I’ll show you how to build a RAG (Retrieval-Augmented Generation) pipeline completely with no code. This setup automatically keeps your database synced with your source files, so when you update or delete a file, your database updates too. That means your AI agents always search through accurate, trustworthy data instead of outdated information. Without this system in place, you can’t rely on your AI’s answers at all. By the end of this video, you’ll understand exactly how to connect everything inside n8n, Google Drive, and Supabase, even if you’re a complete beginner. Sponsorship Inquiries: 📧 sponsorships@nateherk.com TIMESTAMPS 00:00 Why Data Pipelines Matter 03:55 Initial Doc Upload 08:41 Update Doc Pipeline 12:54 Delete Doc Pipeline 15:24 Want to Master AI Automations?

Video Details

Category

RAG & Vector Search

Featured Date

November 14, 2025

Quality Rank

#5

AI Recommended