DEPLOY Fully Private + Local AI RAG Agents (Step by Step) | DailyDevLists

Loading video player...

Full Transcript

10,489 words • EN

When you upload documents to an AI

service, you're placing a lot of trust

in that company. Trust that they'll keep

those documents secure, that they won't

use them to train their models, and that

they won't end up being exposed in a

data breach down the road. And for a lot

of documents, that's fine. But for

sensitive ones like legal, medical

financial, or client docs, that's a much

bigger ask. For these, you need full

control. So today, we're going fully

local and airgapped. No external APIs.

We're going to build an AI agent in N8N

that can interrogate your private

documents using a technique called rag

and all running fully privately on your

machine and available to others in your

local network. And in many ways, this is

the future of AI in business with local

models getting more and more advanced

and companies looking to reduce risk by

deploying onrem. The stack we'll be

using today includes N8N, Olama

Dockling, and Docker. And while all of

that might sound complicated, there's no

need to worry because we're going to

build everything out step by step. So by

all means, follow along and soon you'll

have your very own local multimodal rag

agent, up, and, running., All right,, let's

get into it. So what do I mean by

multimodal rag? Well, here I'm talking

about retrieval across a knowledge base

that has multiple data types. So we

could have text documents or PDFs with

embedded images or tables. We could have

audio files like meeting transcripts or

even videos. And the benefit of

multimodal rag is that when you process

a PDF that has an embedded image, for

example, then that embedded image can be

retrieved and returned as part of the

chat conversation with the agent. So

this is incredibly powerful because a

lot of AI agents will only ever return

text from your knowledge base. So what's

the best way to process all of your

files locally and make them accessible

to your agent? Well, this is where we

use Docklane, which is an open-source

document processing library created by

IBM. With Dockling, you feed it PDFs

Word docs, PowerPoint presentations

images, audio files, and it spits out

clean structured markdown or JSON that

your agent can then search over. And

this isn't just basic text extraction

here. As you can see, it's able to

recognize headers. It's able to

recognize tables. It can extract these

diagrams as images. And the text in the

diagrams is actually searchable as well.

So, you are maintaining the semantic

structure of the document. Here we have

bullet points for example and under the

hood there are two distinct ways you can

actually process documents. The first is

using their standard pipeline which is a

pipeline of specialized models and

algorithms to analyze layout extract out

table structure carry out OCR and then

assemble the output to be exported into

a different format. And the beauty of

this approach is that even though there

are AI models involved here they're

non-generative models. So you don't end

up with hallucinations. It is copying

the text out verbatim. And there are

specialized pipelines for different file

formats. So for DOCX or PowerPoint, it

knows how to parse those markup formats

to actually create this dockling

document which you can then export to

markdown or JSON or XML for example.

There is also a different approach you

can take with dockling which is to use a

VLM which is a vision language model

similar to a large language model. With

the VLM pipeline, it takes a document

which could be a 100page PDF, for

example, breaks it into pages, and then

a batch processes those pages, sending

each one into a VLM. And here, you're

asking the VLM to extract out all of the

text as accurately as possible into a

specific format like Markdown. And from

there, the Dockling document, which is

the core of the Dockling library, is

created. And then it can be exported to

lots of different formats. and VLMs can

be quite powerful but because you are

dealing with a generative AI you can end

up with hallucinations in the extracted

text and in a way that needs to be

balanced with inaccuracies in OCR from

the standard pipeline so there is no

100% best approach but I do like the

standard pipeline for a lot of use cases

when it comes to VLMs there are various

options so to run a fully airgapped

local system you would need to use the

likes of IBM's granite dock lane small

dockling or quenv there are lots of

cloud-based proprietary VLMs like Gemini

OpenAI and Claude but it's not possible

to run any of those fully locally and if

you are looking at locally hosted VLMs

just go to alama.com click on models and

vision and you'll see that there's a

long list that you can actually use

ministral from mistral deepsee OCR so

you have plenty of options but then all

of that leads to the hardware

requirements that are needed to actually

run local AI because these LLMs, VLMs

embedding models, all are based off a

neural network which requires billions

or even trillions of parameters to be

loaded into memory to actually output

responses. And these computations are

far beyond the capabilities of

traditional CPUs and RAM. You

essentially need a graphics card to

actually run these. And within the

system that we'll be going through

today, we will be using a local LLM like

GPTO OSS 20 billion. We may want to use

a VLM to ingest documents. There are the

non-genai models within that standard

pipeline for dockling. And then we have

embedding models to create the vectors

that we can search over. So graphics

cards are essential here. And there are

various options that you can use. Nvidia

GeForce RTX cards are pretty common for

local AI, but there is a limitation on

the complexity of models that you can

actually run on these. And the same with

AMD Radon and Apple Silicon. And

probably the max size LLM that you can

run on these cards comfortably would be

in the region of 25 to 35 billion

parameters. It is possible to load in

larger models like a 70 billion

parameter model, but you would need to

heavily quantize it at which point

you're losing a lot of the quality of

the model. This really is a key

requirement if you are deploying local

AI in a business. There is an upfront

investment needed to build out the

server to actually host this system. And

the more concurrent users you have, the

more hardware you'll need to actually

run it. And tokens per second is

critical here because people are used to

the speed of response from the likes of

chatbt or claw. So there will be an

expectation that a local system should

be able to do the same thing. whether

that's a reasonable expectation or not.

An Nvidia RTX490 is coming in at around

$1,600.

The 5090 is at the $2,000 mark. And from

here, you'd need to build out a server

further. But you can see that this is

the fixed cost up front. And the benefit

then is you have your fully local

system, and there are no cloud fees

required to actually run it. An

important thing to note is you don't

need this hardware in place right now to

actually build out your local AI

application. This is what you need when

you actually use this in production to

air gap the actual system. But to

actually set up and design and test your

system with dummy data, you could use

cloud-based open-source models using the

likes of Lama Cloud or Open Router

which has lots of different open- source

models, available, to, use., So, at least

with this approach, you can get started

straight away building out your solution

and then in parallel, you can actually

start getting the infrastructure ready

to go for when your system is going to

be running in production. If you'd like

to get access to our state-of-the-art

local rag system, then check out the

link in the description to our

community, the AI automators, where you

can join hundreds of fellow builders all

looking to create production rag agents.

Docklin is an open-source MIT licensed

application that's available on GitHub.

And there are two particular projects to

note. So there's the core project which

you can see on screen and then there's

also dockling serve. This is an API

wrapper on the core dockling library.

And this is crucial because we want to

use nadn as an orchestrator for our rag

pipeline to push in documents to be

processed. So where do we go from here?

We obviously want to set up docklane and

nadn locally. So NAND has produced a

self-hosted AI starter kit which bundles

N8N Olama Quadrant and Postgres together

in a docker compose file. So this makes

it quite straightforward to spin all of

this up on your machine. The only thing

that's missing though is dockling. So

what I've done is I've forked this

starter kit repo and I've added in the

dockling docker compose into the starter

kit. I'll leave a link for this in the

description below so that you can follow

along. But before we set this up, let's

just take a helicopter view of how all

of this actually operates. So all of

these services are going to be running

in Docker containers. And if you haven't

heard of Docker before, Docker lets you

run applications in isolated

environments and isolated containers.

And if you think about the applications

we need to run locally for the system, n

Dockling, Quadrant, they all have

different system requirements, different

libraries. They're written in different

programming languages. So normally to

get all of these applications running

natively on your machine can be a bit of

a nightmare. And thankfully Docker

sidesteps all of that. So each

application runs in its own isolated

environment. And that way they can't

conflict with each other because they

can't see each other as internals. They

just communicate over a shared network.

And quickly some terminology for you to

understand. So we have Docker images and

these are essentially static. They pull

in the application code. They define the

environment for the application to run.

But as I said, they're static. So to

actually access those applications, you

need to run them within containers. And

that is a running instance of the static

image. And the thing about these

containers is that they're stateless. So

when you create a container, let's say

of NAND, it spins it up from this static

image. And when you remove a container

it's essentially destroyed and any

information that was created in it is

lost. And this is why you need Docker

volumes or bind mounts. So this is a way

of persisting or saving the data

longterm. So from an NAND perspective

if you were creating workflows in a

running instance of N8N, you would want

to save those workflows to a volume or

to a bind mount. That way when the

Docker container is deleted, you haven't

lost the workflow and you can simply

spin up the container again from the

static image and it'll load in

everything that's available in the bind

mount or in the volume. So these are the

three crucial concepts you need to

understand about Docker. And then when

it comes to the Docker network, as I

mentioned, they're isolated containers.

So they can't see the internals of each

other's containers. So they need to

communicate over a network. And this

trips up a lot of people that aren't

used to Docker. So if you have N8N as a

container and it's trying to speak to

Quadrant or to Dockane, it needs to

communicate over the Docker service

name. So it would be Quadrant and then

the port or Dockling and then the port.

Whereas if you're trying to access NAND

you would just use local host and then

the port. This will make more sense when

we actually start building out our

workflow. But what's important to

understand is this idea of the Docker

compose file because here we're

orchestrating the creation of multiple

services and we're defining these

volumes, the persistent layer. We're

defining the ports as well as other

things like environmental variables. If

you haven't used Docker before, I highly

recommend you install Docker Desktop

which is a visual interface into the

volumes, the images, and the containers.

And finally, if you're new to building

and deploying local AI systems, then you

should definitely use an AI code editor.

These things give you superpowers, and

they're brilliant for troubleshooting

issues with Docker Compose files or

networks. They can provide the prompts

that you need to use to actually spin up

containers, to help you version control

your system. The list is endless. So

for this project, I'll be using Cursor

and that's where I'm going to start. If

you're enjoying the video, make sure to

give it a like below and subscribe to

our channel for more deep AI and NAND

content. It really helps us out. So

open up Cursor. Again, you can also use

VS Code or anti-gravity. And I'm just

going to click clone repo. And I'll grab

the URL of our forked AI starter kit

repo. And we'll just select as the repo

destination. And then it starts cloning

in the repository. And here we go. We

can see all the files of the starter kit

on the left. We can see the docker

compose file that I talked about and

that includes the definitions of all of

the services that needs to be spun up.

So if we go back to the repo, there's

full instructions on what commands you

need to trigger. So we've already cloned

the repository and here it's asking to

change directory into the starter kit.

We're already in it here. Now we just

need to copy the environmental

variables. So we can copy that out. Now

you could just copy and paste it here

crl +v like that and rename it. So that

works or based off the terminal commands

we can open up terminal here with ct

controll j and then you can just paste

the command into there and click enter

and that also copies it. So either

works. So we need to set some encryption

keys and passwords in this environmental

variables document. There is of course

lots of ways to generate passwords. I

have open SSL installed in my git bash

here. So I'm just going to generate a 32

character key. So that looks good. So

that could be my postcrest password. I

might just get rid of the equals at the

end. And yeah, I'll just generate a

couple more. That could be my end

encryption key. Again, I'll remove

special characters just in case. And

then back to the instructions. So I am

on a Nvidia GPU here. So I can now run

this docker compose up command passing

the profile GPU Nvidia. But obviously if

you're on AMD or Apple Silicon, you have

other profiles that you can use. So I

just copy that out and then back into

here. We'll paste it in. And what that

does is it downloads the different

images that are needed to actually run

the system. So we're bringing in N8N.

It's downloading Postgres. Quadrant is

already imported. And this can take

quite a while. Docklin in particular has

some pretty heavyweight models. So

you're talking about a number of

gigabytes. If you're on a slow internet

connection, it'll take even longer

again. But eventually all of your images

will be downloaded and then it can start

spinning up the containers. As you see

here, we can see that we have our

self-hosted AI starter kit. And if we

open it out, you can see Dockling, NAN

OAMA, Quadrant, and Postgres. Now

there's also one other container called

static files. I'll talk about that in a

second. And within ports here, then you

can see different ports. So, if you

click on the first one, which is

docklane, which is port 50,0001, and

that opens up localhost port 501. Now it

says details not found but if you just

add in for slashui you now have your

dockling serve application and the same

goes for the rest. So for N8N if you

click that link you have 5678 and you're

brought to the setup page for quadrant

at 6333

and if you add forward/dashboard it'll

bring you to the dashboard. So this is

your vector store. I don't believe we

have a UI on Postgres but that's fine.

We could hit that with a database

client. And then we are serving static

files on port 8080. And this is how the

multimodal rag aspect is going to kick

in because the images we extract from

PDFs and word documents will be hosted

here and available within our chat. So

we can see this is now up and running.

So if you click on the actual group

you're able to see a stream of all of

the logs from the different services.

And if you want to see log files from

any particular service, just click into

it., So, dockling, for example,, you, can, see

that there's a lot of health checks

going on. This is how it started and

it's given links to the likes of the

docs. So if you click on that, it's

bringing us to uh a site can't be found

but that's fine. We just need to add in

local host instead of 0000. Okay, so

there's our uh docklane API docs. So

that's how you can track the logs for

the different applications. And that is

important if you're trying to

troubleshoot or debug a problem. So

let's start with NAN on port 5678. So

here we need to set up an owner account.

Now this is all local. This is not NAN

cloud. You just need to create an

account to be able to log in. And that

brings us straight into the list of

workflows. And there is a demo workflow

that's autoloaded by the NAND starter

package. And it has OMA chat configured.

So we'll get back to that in a second.

What I might do quickly though is let's

just go into settings. We just need to

enter an activation key because there

are certain features that are gated such

as the idea of pinning previous

executions which is really important

when you're building out workflows. So

if you click on unlock on the top left

here then you can just enter in your

email address and they will send you the

activation key. This is totally free and

everything is still local. Okay. So that

has been activated. So now let's create

a workflow and as a first step let's add

a local drive trigger. So we'll come in

here. Let me just move that out of the

way. And under other ways at the bottom

we can see local file. So we want to

trigger changes that involve a specific

folder. So here now we're going to start

building out our rag ingestion pipeline.

So we'll click on that. And at this

point now we want to watch a folder to

find files as they're dropped in. That

way we can drop in a file, 10 files

100,, a thousand, files, and, have, them, all

processed. So we need to add in a folder

to watch. And this gets back to the

volumes and bind mounts because this

needs to be a persistent folder. We

don't want this to be destroyed when we

delete the container. And within the

readme file for the repo, you can see

that they provide the path data shared

as the path to use. So if we drop that

in there and then we're going to execute

that step and let's see can we trigger

the files. Now actually there's one

change I need to make. So we'll just

stop listening. Um we need to use

polling. So for whatever reason on my

local system this doesn't work if I

don't use polling. So we'll just execute

that step. And if we come back into

cursor and let's go to the docker

compose just to explain what's actually

happening here under the nadn service.

You can see that we have volume

specified and we have a bind mount. So

you can see that the shared directory

which equates to this directory here is

mapped to data shared which is what we

just entered into NAD within the

container. So now if I create a file

here so let's just add in a file. Let's

create a new one test.ext

and as you can see that has just

appeared data shared test.ext. Now

there's nothing in it but just to prove

that it works. So okay let's delete

that. And within our version of the

docker compose, I've created a folder

called rag files. So that way we can

drop all of the files we want to process

into here. So under rag files, let's

create a new folder called pending. And

actually, let's create another folder

called processed as well. That way we

can ingest a file and then move it to

the processed folder. So now let's just

update or trigger an NN. So we're now

looking for files that are added to the

shared rag files pending path. So, data

shared rag files pending and let's

execute that again. And now let's get a

PDF that we can actually start

processing. And let's use the one that I

demonstrated in the intro, which is this

Whirlpool refrigerator spec sheet. It's

only one page, so it's a good test bed

to build out the pipeline. So, I have my

pending folder here. So, let's just drag

in this PDF into that folder. And as you

can see, because I had this local file

trigger executed, it was waiting for a

file to appear, and it has just done so.

There we go. So then a good trick at

this point is just to pin that data. So

just click P on the node. And that way

now if we click execute workflow, we

don't need to keep dragging that file

into that folder. That data is always

there. So next up, let's actually load

up this file. So if we click on the plus

and just type in read, we're going to

read this file from the disk, which is

that one here. And now we need to

provide the path for this file. So

that's the path there. We just drag it

in. And now if we click execute step

there's the binary file. And you can see

by opening it up that's it. So we now

have the file to actually play with. So

next up we need to send this to dockling

to actually extract out structured

information be it markdown or JSON. So

if we go back to docker compose we can

see dockling is on port 501. So if we

click that and again if you go forward/i

you can see dockling serves own

interface but we want to access the API

documentation. So that's done via

for/doccks. So then we just need to

figure out what API we need to hit. So

we're looking to convert this file. We

want to process the file. Now there's

two options. You can either

asynchronously process a file or

synchronously process it. So I'll show

both. So let's just do synchronous

processing. In other words, we're going

to wait for the response. And you can

see on the top right, this is the path

that we need to hit. So let's just copy

that, bring it in here, and let's use a

HTTP request node. And we're going to

post to this endpoint. Now you'll see

this is mentioning local host which is

incorrect and I'll show you why in a

second but essentially we want to pass

this file to this endpoint and in terms

of the body to send we're going to send

the binary file. So that's done using

either n binary file or you can also use

form data which is what I'm going to

use. And if we go back to the

documentation you'll see that this

requires a parameter called files and

that's an array of binary files. So

we'll just copy that and let's drop it

in here. And the value is data. And

let's leave it like that for the moment.

Um so let's save that. And now if we

execute the workflow, we're going to hit

an error which is to be expected. And

it's saying the service refused the

connection. And the problem is this

local host. And if we come back to our

Docker network diagram here, what's

happening is this N8N container is

trying to communicate with Dockling, but

it's using local host. And local host is

limited to the machine or to the

container. So when it's trying to hit

localhost 501, it's actually searching

within this container. So we need to hit

docklane port 501. And that way it's

actually looking at the broader Docker

network to pass the file. So we'll just

change this out for dockling and then

click execute step. And now we get a

different error which is great. We're

making progress. And it's saying the

request is invalid. And the issue here

is I'm passing a string as opposed to a

file. Um, so it's just this parameter

type. We just need to change this to NAM

binary file. And then yeah, put that

back in as data. And let's execute it

again. And there we go. It's thinking

about it. Okay, we do have a response.

And the fact that it was thinking about

it meant that it actually processed the

file. So that's why there's two

different endpoints. There's the

synchronous endpoint where you wait for

the response or if I put in async here

it'll just give me back a task ID and

then I can pull for the result. But

let's leave that off for a second. And

if you have a look at the data here

there's a lot of kind of image data.

This is base 64 image data. And then at

the end, we do have the actual text from

the document. So if you go through the

API documentation, you can see that

there are different parameters that you

can pass. And dockling is quite

comprehensive and flexible with the API

endpoint. So image export mode is the

next one. So let's actually just drop

that in here. We'll add a parameter.

This one is now text, not the binary

file. And then if you choose placeholder

for example, and then if you execute it

again, now you'll see the document has

been processed. And where there are

images, it just says image, but it is

just a placeholder. You've actually lost

the image. It hasn't extracted it. So

instead, let's use referenced because

what that's going to do is it's going to

save that image as you can see here to

the disk. So we have the actual image

name now. And if we go to cursor on the

left hand side here, we have a folder

called dockling scratch. And if we open

that up, you can see all of the images

that were just extracted from that PDF.

And that's what that referenced flag

does. It instead of providing the image

as a base 64 string, it saves it to the

container. And this is all made possible

by the way I set up the docker compose

file under the dockling configuration.

I've set the working directory as this

shared folder. And I've also set it in

various environmental variables. And

because this shared folder is accessible

to both Dockane and N8N, it's now

possible for NAND to actually pick up

those files and move them somewhere else

so that we could serve them as part of a

chat response. Which then brings me to

this static files container that I set

up here. So in Docker Compose, this is

essentially just a really simple engine

X server that makes a particular folder

available. And as you can see, that's on

port 8080. So if we click into the

actual port here, you can see we have

dockling scratch are the rag files. So

let's create an extracted images folder

and then we can dump all of the images

in there. So we'll come back in here and

under shared we'll create a new folder

extracted images. And now if we come

back here and refresh we can see

extracted images. So we probably should

lock this engine X server down to this

folder. So under the volume we can see

that shared is actually accessible. So

let's just lock this down further. So

yeah, it's now shared extracted images.

And, now, all, I'm, going to, do, is, delete

this static files container and recreate

it and it'll build it back up again off

the back of the server configuration. So

this is the beauty of Docker. So we'll

come in here, static files, and delete.

And now we just need to rebuild this

image. And down here, you can see

actually that we're still getting logs

of the various services that are

running. So we need to run this in

detached mode. So if you just press Ctrl

Z, that'll stop all of the containers.

And now we'll just make one change. So

if you press up, you're going to get the

previous command that you ran. And now

we're just going to do forward slashd.

And it'll run it in detached mode in the

background. So just press enter. And

that's going to reup all the containers.

And it's also going to rebuild that file

server container with the new

configuration. Okay. So now if we go

back to the index and if we refresh.

Cool. So there's nothing now in that

folder. We are also locked down to that

folder as well. Okay. So let's go back

into N8N. Let's just refresh it. And as

you can see, our workflow is still here

even though we just removed all the

containers and added them again because

we have a dedicated NAN volume. Okay, so

let's just run this again now. And the

document has been processed again. We

can see the markdown content, we can see

the image names. And if we go to cursor

under dockling scratch, we can see the

images themselves. So next up, let's

move these images into this extracted

images folder. And that way they'll be

able to be served in our AI agent chat.

So we essentially need to extract out

all of these images. So as usual with

N8N, there's lots of different ways you

can go about this. What I'm going to do

is I'm just going to copy this entire

output. So copy selection. And I have

cursor here with Opus 4.5 set. So I can

literally just ask cursor to do this job

for me to create a code node to extract

out an array of image names. So I'm

saying, can you create JavaScript code

to extract out an array of image names?

And I'm saying here's my JSON input

structure. So just copy that in. And

then I'm also saying here's the skeleton

of the code node for you to start with.

So this is important. So here, let's

just add our code node. We're going to

use JavaScript and just copy this out.

Now, you don't necessarily need this

actual addition of a new field. So you

can delete that. But yeah, copy that

out. And let's paste that into here. And

I'll just say just output in chat. No

need [snorts] to create a file. And off

it goes. And this is key because the AI

needs to understand the incoming data

structure as well as the skeleton of the

code node because it might not

understand this is NAN or the structure

of the input items in NN. Um, so copying

in the code node is a really good hack.

Okay, so it's produced the code. So

let's copy that. Let's paste it in here.

And actually before we run it, let's

just pin this as well so we don't need

to keep triggering docklane. And then

let's execute the step. Yeah, there we

go. image names. That's exactly what we

want. So now we can split out this

array. So let's do that quickly. Split

out. And let's pass in our image names

array. And we can execute that. And we

now have our individual images. And then

we need to move this file from this

dockling scratch folder into our static

files engineext server. Now

unfortunately there isn't any move node

that you can use for local files. And so

what we're going to do is just use an

execute command. So we just type in

command and this allows you to run a

shell command. And we're not going to do

this once. We want this to run for every

file. Now if you don't know CLI commands

again, you can just ask cursor.

Essentially it's NV for move. So we want

to move this file. And now we need to

get the path of this file as well. So

again back to cursor. You can see it's

under shared dockling scratch. And also

this is under data um because it's the

same as the trigger. This is all set in

the docker compose. So here when we have

a local file trigger we're looking under

data shared and so it's the same here

and that's because the bind mount is

against data shared not shared. Okay. So

we're going to move this file data

shared incratch and that's the file

name. And now let's move it into our

extracted images. Data shared extracted

images. Okay, that should do it. So now

let's run it. So execute workflow and it

has succeeded and let's have a look at

cursor. Let's refresh the file

directory. Yeah, all of the images are

now available under the extracted images

folder. So now if we go to the browser

or if we go to our static files

directory, we can see the images and if

we click into them, there's our

worldpool image and we have the

diagrams. Excellent. So this is probably

the hardest part of this entire project

is actually to extract out the images

and make them available to the AI agent.

Excellent. So next up, we want to import

this document into quadrant so that we

can actually carry out a vector search

over it. So we can see the markdown

content here. Um so actually let's split

off at this point because this idea of

moving the images can be done in

parallel essentially. So let's add a new

node. Let's look for quadrant. So

there's our quadrant vector store and

we're going to add documents to the

vector store. Now there is a local

quadrant database already preconfigured

in the NAD starter kit. So actually

let's just edit that. It's not working

as well. So I don't think you need an

API key. Let's delete it. Um and then

back to quadrant URL. This is the same

thing that we talked about here. So we

need to reference this as the service

name. And actually this is the exact

host that we need to hit. So let's use

that. So there's Quadrant URL. And

actually just click save. Yep, it has

succeeded. So yeah, for local AI

implementations, there is usually no API

keys required. Now, of course, you could

set API keys if you wanted to lock it

down within your network. Now, we don't

have a collection yet. So, let's go into

Quadrant. So, back to Docker. There's

Quadrant, which is this one. And then

for Quadrant, it's forward/dashboard.

Okay, so this is the Quadrant vector

store. So, if we click on collections

we can add a new collection. and we'll

call this one multimodal rag and

continue. So then it's asking what's the

use case and we're just using global

search here really. There's no per user

documentation or anything like that at

this point. And for this we'll just use

single dense vector embeddings. You

could use hybrid search if you wanted.

Okay. So we need to choose dimensions.

We have a few options here. We want to

use a local embedding model and the one

I typically use is Nomic embed text. So

this is available on Olama and if you go

to the Nomic website you can see the

number of dimensions that are in this

embedding model. So you can specify for

version 1.5. We'll go for the highest

number of dimensions to get the best

quality embeddings. So we'll just drop

768 in there. And then we're going to

use cosine similarity as our algorithm

to figure out what are the closest

vectors. So we'll click continue on

that. So we'll just click finish. Okay.

So we now have our quadrant vector store

or collection essentially set up. So

then if we come back into N8N, we should

be able to choose it now. So let me just

save that and go back in. There we go.

Multimodal rag and there shouldn't be

anything else to set there. So then for

the embedding model we now need to

choose nomic text embed. So we need to

use lama and again we need to specify a

credential to lama. So just click on

edit and it's looking for localhost

which again doesn't make sense here

because we need to use the service name

of the service within the docker

network. So that should be lama and

again no API key. We can delete it. And

if we click retry, yep, we have a green

message. So let's just save that and it

should load the models again. It didn't

immediately. So let's just get out and

go back in. And there we go. So that

this only has llama 3.2. So we need our

nomic text embed model within our lama

system. So if we go back to lama, you

can see that there is a command we can

use lama pull nomic embed text. So let's

copy that out. And if you go to exec

what you can do here is you can execute

commands within this container. So if I

right click and paste it in, this is

going to pull the nomic embed text model

into this container. And there is a

volume mounted for lama so that when we

destroy this container and we create it

we won't need to import nomic embed text

again. So we'll trigger that. And now as

you can see it is downloading this

model. Okay, so that is successful. So

now if we come back to nadn back into

embeddings and there you go, nomic embed

text latest. So let's click on that and

we'll save. And then the document we

need to attach a document parser or a

document loader. So there we go. There's

that one. I generally don't use the

simple one. I prefer to use custom. So

we'll hit custom. We'll add a uh

recursive character text splitter. And I

usually specify markdown as the split

code. So that way it's going to retain

some of the structure at least of the

document in terms of the the chunks it

creates. And we might just reduce the

chunk size a little bit. So maybe to

700. Okay, we are in good shape here. So

now let's connect this up. Let's just

remove this for a second because we just

want to see how the vector store side of

it works. And let's execute the workflow

again. Okay, so that has injected into

the quadrant vector store. And if you

come into quadrant and click

collections, yeah, you can see there's

now 19 points within the vector store.

Yeah, you can see all of the various

embeddings. So there's the uh the image

URL, that's the table for kind of model

sizes, product dimensions, etc. There is

some nice visualizations within

Quadrant. So if you click on visualize

and just hit run over the limit, it'll

actually show you where the points are

and how they are clustered. So doesn't

really mean much just with one document

but as you load more in, you can see how

they are clustered. There also is a

graph as well, which is kind of neat.

And if you double click it, then it uh

loads up other points close by. Cool. So

we have our vector store. We have the

data in the vector store. So now come

back into NADN. Let's just hook that

back up again. And now let's create an

AI agent that we can actually converse

with. So let's click on plus. We'll add

a chat trigger. And then let's add an AI

agent which is that one. And now we need

to add a model. So again we're going to

use O Lama. It has to run fully locally.

It has specified the local O Lama

service and it has selected Lama 3.2 2

which is imported by default with this

NADN self-hosted AI system. So we'll

just save that. Now that's a very small

model. You're not going to get uh huge

amounts of intelligence from it, but it

might just be enough to be able to

demonstrate this. So then in terms of a

tool, let's choose quadrant, which is

our vector store. It has already

specified the credential. Description

wise, we'll just say use this to fetch

information from the knowledge base. And

then we'll just choose our collection.

and let's limit it to five. Okay, so

we'll save that. We need our embedding

model. Let's uh grab that from here. So

obviously it has to be the same

otherwise you're not comparing like with

like. And I think we're in business

here. We might just set a very simple

system prompt. I just say you must use

the quadrant vector store to retrieve

information. Actually a good tip as a

starting point is if you add the prompt

from this question and answer chain. Um

so if you open it up and then just look

at the system prompt template. Uh

that's not a bad starting point. It

basically says don't make things up.

Yeah, let's use that instead. Okay, so

now let's ask it a question. Maybe show

me the cabinet opening diagram. Let's

try. And actually, I need to add one

more thing to the system prompt, which

is um, you must output images in

markdown format using the URL provided

in the retrieved results. Let's try that

for example. Okay, so show me the

cabinet opening diagram. Let's see how

it goes. Okay, it has triggered the

quadrant vector store and we do have a

response. Uh there is no image that I

can see anyway. Now that might be down

to the size of the model and it's not

exactly ideal for instruction following

if it's too small. Let me retry that

again. Okay, there is images this time.

Uh the images are broken links though.

So let's have a quick look at that. So

let's just rightclick and inspect them.

Of course, yeah, we haven't added the

full path into the vector store. So

let's close that out for a second and

let's go back to our ingestion flow. And

we essentially need to inject the full

URL here. Now, these image paths are

actually way longer than what the LMA

3.2 produced. So, I'd say we need to

upgrade the model anyway, but we

definitely need to add in the full path.

So, let's add another code node here.

And then same again. So, I'm just going

to copy this out. Let's bring it into

cloud to Opus. I'll just create a new

chat. It's always good to keep opening

new chats and cursor. um as as chats

kind of continue on and on and on um the

actual quality of response deteriorates

due to kind of context rot and a few

other things. Please create JS code that

injects the full URL of images into the

output MD content. Okay, so this is my

input. This is an example of the full

URL. So this is now my enginex file

server. So that's it there. So local

host 8080 essentially. And then again

let's drop in um our JS skeleton here.

Okay, so let's let it run. It's probably

just a reax to find the image and inject

in the the URL. Bit of pattern matching.

And here we go. So, let's copy that out.

Drop it [snorts] in here. Execute step.

And there we go. Yeah. HTTP localhost.

And that's the full image. And let's

just copy that into the browser to see

did it work. Excellent. Okay. Let's

delete everything now from Quadrant. And

let's reimpport the file. So, within

Quadrant, you can go through and delete

everything. Um, this gets pretty

tedious. So let's hit this endpoint to

delete the quadrant collection and

recreate, it., That, way we, can, quickly

kind of prune all of the vectors without

having to manually create a collection

every single time. So this is the

endpoint. So let's add another HTTP

request node. So what is this? This is a

delete method. You pass that in. You

pass in the collection name which is

multimodal rag that refused the

connection of course local host. So it

needs to be quadrant. Okay. So it

deleted it. And if we refresh, no

collection present. Obviously very

destructive. So only to be used when

actually building out your system. Okay.

So let's copy that. So now this one is

create collection. So we're going to

post I assume we create a collection.

That looks right. And then this is the

body that we need to pass. So copy that

out. Drop it in there. And then what was

it again? 768. Execute step. Didn't

work. Oh, it's a put, not a post. So

there we go. Brilliant. Okay. Okay, so

we have it up and running again. Okay

let's actually run this fully now. So

let's come back to cursor. Let's just

delete out the images that are there.

Okay, so that's now gone to do clean.

It's going to extract the images again.

So if I refresh, yep, they've just

appeared and refresh. 19 points.

Perfect. And if we look through the

vectors, we can see that this one has

the full URL now. So now let's ask the

same question to this agent and let's

see, can we get a better answer? Um, I

feel like we probably can't. I don't

think the model is big enough to be able

to output the full URL reliably anyway.

Oh, it did actually. There you go. Cool.

All right., Multimodal, rag., There, you, go.

And using a very small model actually.

So that's LMA 3.2 which is a three

billion parameter model that's installed

here. I'm quite impressed actually that

it was able to spit out that image URL

and actually called the vector store cuz

my experience of very small language

models is that they can't even reliably

call tools. But again, I'm sure if I ran

this exact query 10 times, it might

struggle to produce the image accurately

10 times. And that's really where you

need probably a bigger model to be a

little bit more reliable. I have no

memory assigned here as well. So every

time I refresh this, it's a fresh call.

But yeah, it worked again there.

Excellent. Yeah, that's that's great. As

I mentioned earlier, if you don't have

the graphics card to hand right now, you

could hook up an open source model in

the cloud to actually get all of this

stuff up and running. So let's try that.

So I'll come back into chat model. Let's

create a new credential. And now let's

just choose uh ola.com.

And if we go to lama.com, if you create

an account and go to API keys, you can

create a new API key. So this is

tutorial. Generate the key. And we can

copy it. And then if you paste it in

here and click save, you'll get your

green success message. And now the model

list is a lot bigger because you're

going to be using a cloud model. So

let's use the GBT OSS 20 billion

parameter model which is of the right

size that you could run on a NTX 4090.

So we'll save that and again let's ask

the same question. Now we've got

unauthorized. Of course I didn't pass

the API key. Oh I did. Now maybe I need

to hit HTTPS. Let's try that and

refresh. Oh, okay. That's actually

worked for me now. All I did was created

a new API key and it worked. I'm not

exactly sure why it worked, but it

seemed to work. So now we can uh choose

GPT OSS 20 billion and yeah we can ask a

question and we get an answer back. So

okay let's hook up these tools again.

Okay, we'll clear down the chat. So

let's ask show me the cabinet opening

diagram. And yeah, GBD OSS 20 billion is

lightning fast on a llama cloud. And

there we go. We do get the cabinet

diagram. So let's test it out with

another PDF. So let's unpin this local

file trigger and we'll execute the

workflow. So now it's listening for new

files in this folder. So let's ingest

this document. This is 112 pages of a

user manual. So that's still waiting. So

let's drop this into our pending folder.

And actually we still need to move the

completed file to the process folder. So

that's something we need to do. Let's

drop this in first. Okay. And that's hit

the docklane API. This is where async

would actually make a lot more sense

because this is a rather large PDF. So

that's something that we could implement

again. And I can hear the fan spin on

the machine here because even with the

standard pipeline, there still is AI

models. Okay, that has finished. Took 46

seconds for 112 pages. That's pretty

decent, I think. And now it's working

through the embedding process. And it's

moving 269 images into this EngineX file

server. And yeah, there are all the

images. So now let's ask a question.

Show me how to use the ice and water

dispenser. Okay, so we are getting

instructions. Yeah, we're getting

images. That's great. Very nice. GPTO

OSS 20 bit in has such a tendency to

output tables like that which doesn't

really work in a kind of a chat

interface. It might just be some system

prompting to try to force it not to do

that. Just whatever way the model was

trained. But yeah, you could see that we

are getting images now through which is

great. While I'm cloud only has a

certain number of open source models

let's try open router just to be able to

get a flavor of the different models and

the output formats. And again, this is

all fine as we're testing and trying to

figure out what's the best model for the

job. Once we figure this out, we can

download that specific model and then

build the hardware to meet the

requirements of running that model.

Okay, so there's open router and yeah

let's try let's try Quinn Quinn 32

billion. Let's try that. Okay, we are

getting images as well. No tables, which

is great. Yeah, that looks pretty decent

actually. I'm happy with that. That's a

great way to test and play around with

lots of different open source models to

figure out what's the best for your use

case. So off camera I was just testing

different configurations of Docklane and

adding additional features. I created an

async polling loop. So now here with the

dockling local VLM pipeline I'm hitting

the async end point and then I'm going

through a polling loop where I wait for

a set number of seconds 3 seconds there.

I check the task status using the task

ID and then based on the status of the

task. When it's successful, we fetch the

result and then we process it.

Otherwise, if it's still processing, we

go back and check again. Or if there's

an error, we can stop an error out here.

I'm also moving the file now into the

processed folder. So, we have the full

file path and then dropping it into rag

files processed. So, that way then we

can keep the pending folder clear. And

once we activate the workflow, anything

that's dropped into that folder will be

consumed into the pipeline. And then

some of the other configurations of

Docklane. So I had the Docklane standard

pipeline and I was also playing around

with the picture description API. So

it's possible to annotate the images

that are in the actual document itself.

And you can specify how big the images

are to actually be sent because

obviously you don't want to be sending

really small images to the VLM. But it's

cool that you can run the VLM alongside

the standard pipeline where you're only

actually sending in, let's say, diagrams

and images. Now, a lot of the smaller

VLMs aren't really suited to describe

what's in an image. They're more

specialized on actually figuring out

what's in a document. Um, so again, some

playing around with the different

models. I was trying Granite 3.2 vision

there, and I wasn't getting amazing

results for kind of general purpose

images, but I don't think that's what

it's designed for. So yeah, you can see

how you can kind of further build out

this pipeline to accommodate the files

that you're trying to ingest and with

enough effort you can build out quite a

complex and sophisticated rag ingestion

pipeline. So this is our local rag

system that we have available in our

community and within this system we're

looping through files that are dropped

into a local file folder similar to what

we've gone through today. And then

there's lots of different tracks for

different file types because obviously

Dockling can handle lots of different

file types. So that is a little bit of a

catchall. But for structured data like

Excel sheets or CSV sheets, we want to

represent those differently. And then

once it gets into the main part of

processing the files, it works its way

through a record manager. We have

knowledge graphs. We handle the tabular

data as I mentioned. And then we have

lots of functionality around context

expansion, extracting out document

hierarchies, and then using contextual

vector embeddings. If you'd like to get

access to our state-of-the-art local rag

system, then check out the link in the

description to our community, the AI

automators. So, now that we have a

version one of our rag system up and

running, let's create a web page where

people can actually chat to the agent on

the local network. Now, we could vibe

code a chat interface, but just to keep

things simple, I'm just going to embed

the standard NAN chat widget. So, I'm

just going to grab the URL for this and

let's come back into cursor. Let's

create a new chat. And let's just

explain what we're looking for. And

actually, let's just use this extracted

images folder because the root of our

enginex file server is essentially this

folder. So, here we're going to ask

within this document root, can you

create a web page that embeds the

following chat widget? And then I want

this to hit our nadn docker container

and let's see what it does. Okay, so

it's creating the web page. I'll create

a beautiful web page with the nadn chat

widget embedded. Okay, let's see how

beautiful this is going to be. Now I

have got very good results from claude

opus 4.5 over the last couple of weeks.

So I do have high expectations. So let's

have a look. Okay, n chat your system

powered by n. Click the chat bubble in

the bottom right corner to start a

conversation. Start chatting. Let's ask

hello. Now nothing's happening. I

haven't even activated the workflow, so

that's not a surprise. So let's come

into the workflow. Make chat publicly

available. Yes. I'll give it this chat

URL. And I'll set it as embedded chat.

Don't need any authentication because

it's local. So let's copy that. And

let's save it. And let's activate the

workflow. And now let's just come back

in here. The workflow is active. Here's

the URL. drop in NAN instead of local

host. It should probably figure that out

itself. Okay, so let's try it again and

refresh that. Okay, we are hitting the

workflow which is great even though it's

an error. Now let's come back in here.

Executions debug and editor. Oh yeah, I

just need to make a change. Just need to

update my credential. That looks good.

Now let's ask it to show me how to use

the ice and water dispenser just to see

can we get images back. And as you can

see, we've got through our response. The

formatting is all over the place. I

would not define as beautiful clawed

opus or cursor. So definitely some

iterations needed on the styling of this

pop-up. And this should probably be

embedded full screen. So obviously lots

of work needed on the UI. But you get

the idea that you can use this NAN chat

embed and literally embed it in a HTML

page connected to NAN which is in a

different Docker container and away you

go. So the last thing that's needed then

is if you hit let's say localhost 8080

you're just getting this index of files

and then chat.html is just one of them.

So let's come back into cursor create a

new chat and I'll just ask can you

adjust the engineext configuration so

that if someone hits the root for the

static file service that it's bringing

them to the chat.html.

And this is where a cursor is brilliant

for making changes to the likes of

Docker Compose files because if you're

not an expert at this stuff, you know

it really can shorten the amount of time

it takes to make these changes. And it

was a pretty easy change. Just index

chat.html. Okay, so we just need to take

down our uh static files container and

then re-up it and then that should be

it. Okay, so let's test it out.

Brilliant. So now local host 880 is

pointing to chat.html HTML and we should

still be able to access the images that

are stored on that file server. So the

last thing that's needed then is for

users to be able to access this chat

interface if they're on the local

network. And there's not a huge amount

of changes that are needed here if it's

a small local network. So here for

example, this is my computer. This is my

server. And let's say this is my local

IP address. And let's say that we have

other computers or other laptops that

are on this network that want to access

this chat widget. So when we hit this

chat widget, we're hitting it at

localhost880

or 127.0.0.1.

And you can only do that if you're on

your own machine. So if someone from a

different machine was trying to access

this, they would need to hit the actual

local IP address of my machine on the

network. And to hit this static files

container, which now actually contains

the chat interface, they would hit that

IP address at port 8080. And if they

wanted to access N8N, you could make

that available at port 5678 as well. So

to set this up, there are a few changes

you would need to make. So by default

inbound connections to arbitrary ports

like that are disabled by the Windows

firewall in my case. So I would need to

make changes to my firewall to allow

connections in. This IP address that I

have on this local network is

essentially dynamic. So, if I turn off

my laptop and turn it back on again

I'll most likely be assigned a different

IP address. So, you may need to set up a

static IP address so that the actual

address doesn't change. There might be

other network configuration changes you

need to make depending on how complex

your actual network is. And the bigger

your organization, the more likely that

there's going to be a lot more layers

and a lot more systems in place. So

you'll probably need to work with your

comms team on that. And then the other

obvious thing is your server needs to

always, be, on, or, at least, on, during

office hours if you want people to

access this chatbot during office hours.

So that's how you can publish and

effectively deploy your AI agent in your

organization. I'm using the agent mode

and cursor here to make some front-end

changes to the local rag agent. So I've

embedded the chatbot fully on screen. Um

and it's currently now iterating through

the actual styles. This agent mode with

the actual browser built in and then the

fact that it uses Puppeteer to actually

simulate browser actions is really

powerful for iterating on front ends

like this. And of course, this is a very

basic multimodal rag agent. I've just

imported a couple of files here, but on

our channel, I've shown lots of

different advanced techniques that you

can use to really build out the

capabilities of a rag agent like this.

Now that you know how to create a fully

local multimodal rag agent, I highly

recommend you check out this video here

where I show you how to deploy a clone

of Notebook LM locally. Thanks for

watching and I'll see you in the next

DEPLOY Fully Private + Local AI RAG Agents (Step by Step)

The AI Automators

77 days ago

52:59

RAG & Vector Search

Rank #1

Description

👉 Get access to our Fully Local SOTA RAG system and learn how to customize it, in our community https://www.theaiautomators.com/?utm_source=youtube&utm_medium=video&utm_campaign=tutorial&utm_content=sota-local-rag Every document you send to ChatGPT or Claude is a potential security liability. Legal contracts. Medical records. Financial statements. Client data. HR documents. The cloud-based AI tools we rely on every day are brilliant—but they're also third-party services with their own data policies, training practices, and potential breach vulnerabilities. For enterprises handling sensitive information, "trust me" isn't good enough. In this tutorial, I'll show you how to build a production-grade multimodal RAG system that never leaves your infrastructure. Zero external API calls. Complete data sovereignty. 🔗 Get Started: Our Forked Self Hosted AI Starter Kit: https://github.com/theaiautomators/self-hosted-ai-starter-kit 🎯 What You'll Learn: ✅ Complete air-gapped RAG architecture ✅ Processing PDFs, images, tables, and audio locally ✅ IBM Docling for zero-hallucination document extraction ✅ Ollama + N8N + Docker local stack ✅ GPU requirements for production deployment ✅ Network deployment for team access ✅ Standard vs VLM document processing pipelines ✅ Maintaining semantic structure without cloud APIs ✅ Real-world hardware cost breakdowns 🔗 Links: Docling Documentation: https://www.docling.ai/ Ollama Vision Models: https://ollama.com/ Qdrant: https://qdrant.tech/ ⏱️ Timestamps: 00:00 Local AI + Docling 07:58 n8n + Docker 11:58 Setup Local AI Starter Kit 16:30 Building the RAG Ingestion 34:39 Building the AI Agent 46:56 Creating the Agent Frontend 50:17 Deploying to Local Network 💬 Questions or Comments? What's preventing you from deploying local AI in your organization? Hardware costs? Complexity? Performance concerns? Let me know below!

Video Details

Category

RAG & Vector Search

Featured Date

January 8, 2026

Quality Rank

#1

AI Recommended