Loading video player...
When you upload documents to an AI
service, you're placing a lot of trust
in that company. Trust that they'll keep
those documents secure, that they won't
use them to train their models, and that
they won't end up being exposed in a
data breach down the road. And for a lot
of documents, that's fine. But for
sensitive ones like legal, medical
financial, or client docs, that's a much
bigger ask. For these, you need full
control. So today, we're going fully
local and airgapped. No external APIs.
We're going to build an AI agent in N8N
that can interrogate your private
documents using a technique called rag
and all running fully privately on your
machine and available to others in your
local network. And in many ways, this is
the future of AI in business with local
models getting more and more advanced
and companies looking to reduce risk by
deploying onrem. The stack we'll be
using today includes N8N, Olama
Dockling, and Docker. And while all of
that might sound complicated, there's no
need to worry because we're going to
build everything out step by step. So by
all means, follow along and soon you'll
have your very own local multimodal rag
agent, up, and, running., All right,, let's
get into it. So what do I mean by
multimodal rag? Well, here I'm talking
about retrieval across a knowledge base
that has multiple data types. So we
could have text documents or PDFs with
embedded images or tables. We could have
audio files like meeting transcripts or
even videos. And the benefit of
multimodal rag is that when you process
a PDF that has an embedded image, for
example, then that embedded image can be
retrieved and returned as part of the
chat conversation with the agent. So
this is incredibly powerful because a
lot of AI agents will only ever return
text from your knowledge base. So what's
the best way to process all of your
files locally and make them accessible
to your agent? Well, this is where we
use Docklane, which is an open-source
document processing library created by
IBM. With Dockling, you feed it PDFs
Word docs, PowerPoint presentations
images, audio files, and it spits out
clean structured markdown or JSON that
your agent can then search over. And
this isn't just basic text extraction
here. As you can see, it's able to
recognize headers. It's able to
recognize tables. It can extract these
diagrams as images. And the text in the
diagrams is actually searchable as well.
So, you are maintaining the semantic
structure of the document. Here we have
bullet points for example and under the
hood there are two distinct ways you can
actually process documents. The first is
using their standard pipeline which is a
pipeline of specialized models and
algorithms to analyze layout extract out
table structure carry out OCR and then
assemble the output to be exported into
a different format. And the beauty of
this approach is that even though there
are AI models involved here they're
non-generative models. So you don't end
up with hallucinations. It is copying
the text out verbatim. And there are
specialized pipelines for different file
formats. So for DOCX or PowerPoint, it
knows how to parse those markup formats
to actually create this dockling
document which you can then export to
markdown or JSON or XML for example.
There is also a different approach you
can take with dockling which is to use a
VLM which is a vision language model
similar to a large language model. With
the VLM pipeline, it takes a document
which could be a 100page PDF, for
example, breaks it into pages, and then
a batch processes those pages, sending
each one into a VLM. And here, you're
asking the VLM to extract out all of the
text as accurately as possible into a
specific format like Markdown. And from
there, the Dockling document, which is
the core of the Dockling library, is
created. And then it can be exported to
lots of different formats. and VLMs can
be quite powerful but because you are
dealing with a generative AI you can end
up with hallucinations in the extracted
text and in a way that needs to be
balanced with inaccuracies in OCR from
the standard pipeline so there is no
100% best approach but I do like the
standard pipeline for a lot of use cases
when it comes to VLMs there are various
options so to run a fully airgapped
local system you would need to use the
likes of IBM's granite dock lane small
dockling or quenv there are lots of
cloud-based proprietary VLMs like Gemini
OpenAI and Claude but it's not possible
to run any of those fully locally and if
you are looking at locally hosted VLMs
just go to alama.com click on models and
vision and you'll see that there's a
long list that you can actually use
ministral from mistral deepsee OCR so
you have plenty of options but then all
of that leads to the hardware
requirements that are needed to actually
run local AI because these LLMs, VLMs
embedding models, all are based off a
neural network which requires billions
or even trillions of parameters to be
loaded into memory to actually output
responses. And these computations are
far beyond the capabilities of
traditional CPUs and RAM. You
essentially need a graphics card to
actually run these. And within the
system that we'll be going through
today, we will be using a local LLM like
GPTO OSS 20 billion. We may want to use
a VLM to ingest documents. There are the
non-genai models within that standard
pipeline for dockling. And then we have
embedding models to create the vectors
that we can search over. So graphics
cards are essential here. And there are
various options that you can use. Nvidia
GeForce RTX cards are pretty common for
local AI, but there is a limitation on
the complexity of models that you can
actually run on these. And the same with
AMD Radon and Apple Silicon. And
probably the max size LLM that you can
run on these cards comfortably would be
in the region of 25 to 35 billion
parameters. It is possible to load in
larger models like a 70 billion
parameter model, but you would need to
heavily quantize it at which point
you're losing a lot of the quality of
the model. This really is a key
requirement if you are deploying local
AI in a business. There is an upfront
investment needed to build out the
server to actually host this system. And
the more concurrent users you have, the
more hardware you'll need to actually
run it. And tokens per second is
critical here because people are used to
the speed of response from the likes of
chatbt or claw. So there will be an
expectation that a local system should
be able to do the same thing. whether
that's a reasonable expectation or not.
An Nvidia RTX490 is coming in at around
$1,600.
The 5090 is at the $2,000 mark. And from
here, you'd need to build out a server
further. But you can see that this is
the fixed cost up front. And the benefit
then is you have your fully local
system, and there are no cloud fees
required to actually run it. An
important thing to note is you don't
need this hardware in place right now to
actually build out your local AI
application. This is what you need when
you actually use this in production to
air gap the actual system. But to
actually set up and design and test your
system with dummy data, you could use
cloud-based open-source models using the
likes of Lama Cloud or Open Router
which has lots of different open- source
models, available, to, use., So, at least
with this approach, you can get started
straight away building out your solution
and then in parallel, you can actually
start getting the infrastructure ready
to go for when your system is going to
be running in production. If you'd like
to get access to our state-of-the-art
local rag system, then check out the
link in the description to our
community, the AI automators, where you
can join hundreds of fellow builders all
looking to create production rag agents.
Docklin is an open-source MIT licensed
application that's available on GitHub.
And there are two particular projects to
note. So there's the core project which
you can see on screen and then there's
also dockling serve. This is an API
wrapper on the core dockling library.
And this is crucial because we want to
use nadn as an orchestrator for our rag
pipeline to push in documents to be
processed. So where do we go from here?
We obviously want to set up docklane and
nadn locally. So NAND has produced a
self-hosted AI starter kit which bundles
N8N Olama Quadrant and Postgres together
in a docker compose file. So this makes
it quite straightforward to spin all of
this up on your machine. The only thing
that's missing though is dockling. So
what I've done is I've forked this
starter kit repo and I've added in the
dockling docker compose into the starter
kit. I'll leave a link for this in the
description below so that you can follow
along. But before we set this up, let's
just take a helicopter view of how all
of this actually operates. So all of
these services are going to be running
in Docker containers. And if you haven't
heard of Docker before, Docker lets you
run applications in isolated
environments and isolated containers.
And if you think about the applications
we need to run locally for the system, n
Dockling, Quadrant, they all have
different system requirements, different
libraries. They're written in different
programming languages. So normally to
get all of these applications running
natively on your machine can be a bit of
a nightmare. And thankfully Docker
sidesteps all of that. So each
application runs in its own isolated
environment. And that way they can't
conflict with each other because they
can't see each other as internals. They
just communicate over a shared network.
And quickly some terminology for you to
understand. So we have Docker images and
these are essentially static. They pull
in the application code. They define the
environment for the application to run.
But as I said, they're static. So to
actually access those applications, you
need to run them within containers. And
that is a running instance of the static
image. And the thing about these
containers is that they're stateless. So
when you create a container, let's say
of NAND, it spins it up from this static
image. And when you remove a container
it's essentially destroyed and any
information that was created in it is
lost. And this is why you need Docker
volumes or bind mounts. So this is a way
of persisting or saving the data
longterm. So from an NAND perspective
if you were creating workflows in a
running instance of N8N, you would want
to save those workflows to a volume or
to a bind mount. That way when the
Docker container is deleted, you haven't
lost the workflow and you can simply
spin up the container again from the
static image and it'll load in
everything that's available in the bind
mount or in the volume. So these are the
three crucial concepts you need to
understand about Docker. And then when
it comes to the Docker network, as I
mentioned, they're isolated containers.
So they can't see the internals of each
other's containers. So they need to
communicate over a network. And this
trips up a lot of people that aren't
used to Docker. So if you have N8N as a
container and it's trying to speak to
Quadrant or to Dockane, it needs to
communicate over the Docker service
name. So it would be Quadrant and then
the port or Dockling and then the port.
Whereas if you're trying to access NAND
you would just use local host and then
the port. This will make more sense when
we actually start building out our
workflow. But what's important to
understand is this idea of the Docker
compose file because here we're
orchestrating the creation of multiple
services and we're defining these
volumes, the persistent layer. We're
defining the ports as well as other
things like environmental variables. If
you haven't used Docker before, I highly
recommend you install Docker Desktop
which is a visual interface into the
volumes, the images, and the containers.
And finally, if you're new to building
and deploying local AI systems, then you
should definitely use an AI code editor.
These things give you superpowers, and
they're brilliant for troubleshooting
issues with Docker Compose files or
networks. They can provide the prompts
that you need to use to actually spin up
containers, to help you version control
your system. The list is endless. So
for this project, I'll be using Cursor
and that's where I'm going to start. If
you're enjoying the video, make sure to
give it a like below and subscribe to
our channel for more deep AI and NAND
content. It really helps us out. So
open up Cursor. Again, you can also use
VS Code or anti-gravity. And I'm just
going to click clone repo. And I'll grab
the URL of our forked AI starter kit
repo. And we'll just select as the repo
destination. And then it starts cloning
in the repository. And here we go. We
can see all the files of the starter kit
on the left. We can see the docker
compose file that I talked about and
that includes the definitions of all of
the services that needs to be spun up.
So if we go back to the repo, there's
full instructions on what commands you
need to trigger. So we've already cloned
the repository and here it's asking to
change directory into the starter kit.
We're already in it here. Now we just
need to copy the environmental
variables. So we can copy that out. Now
you could just copy and paste it here
crl +v like that and rename it. So that
works or based off the terminal commands
we can open up terminal here with ct
controll j and then you can just paste
the command into there and click enter
and that also copies it. So either
works. So we need to set some encryption
keys and passwords in this environmental
variables document. There is of course
lots of ways to generate passwords. I
have open SSL installed in my git bash
here. So I'm just going to generate a 32
character key. So that looks good. So
that could be my postcrest password. I
might just get rid of the equals at the
end. And yeah, I'll just generate a
couple more. That could be my end
encryption key. Again, I'll remove
special characters just in case. And
then back to the instructions. So I am
on a Nvidia GPU here. So I can now run
this docker compose up command passing
the profile GPU Nvidia. But obviously if
you're on AMD or Apple Silicon, you have
other profiles that you can use. So I
just copy that out and then back into
here. We'll paste it in. And what that
does is it downloads the different
images that are needed to actually run
the system. So we're bringing in N8N.
It's downloading Postgres. Quadrant is
already imported. And this can take
quite a while. Docklin in particular has
some pretty heavyweight models. So
you're talking about a number of
gigabytes. If you're on a slow internet
connection, it'll take even longer
again. But eventually all of your images
will be downloaded and then it can start
spinning up the containers. As you see
here, we can see that we have our
self-hosted AI starter kit. And if we
open it out, you can see Dockling, NAN
OAMA, Quadrant, and Postgres. Now
there's also one other container called
static files. I'll talk about that in a
second. And within ports here, then you
can see different ports. So, if you
click on the first one, which is
docklane, which is port 50,0001, and
that opens up localhost port 501. Now it
says details not found but if you just
add in for slashui you now have your
dockling serve application and the same
goes for the rest. So for N8N if you
click that link you have 5678 and you're
brought to the setup page for quadrant
at 6333
and if you add forward/dashboard it'll
bring you to the dashboard. So this is
your vector store. I don't believe we
have a UI on Postgres but that's fine.
We could hit that with a database
client. And then we are serving static
files on port 8080. And this is how the
multimodal rag aspect is going to kick
in because the images we extract from
PDFs and word documents will be hosted
here and available within our chat. So
we can see this is now up and running.
So if you click on the actual group
you're able to see a stream of all of
the logs from the different services.
And if you want to see log files from
any particular service, just click into
it., So, dockling, for example,, you, can, see
that there's a lot of health checks
going on. This is how it started and
it's given links to the likes of the
docs. So if you click on that, it's
bringing us to uh a site can't be found
but that's fine. We just need to add in
local host instead of 0000. Okay, so
there's our uh docklane API docs. So
that's how you can track the logs for
the different applications. And that is
important if you're trying to
troubleshoot or debug a problem. So
let's start with NAN on port 5678. So
here we need to set up an owner account.
Now this is all local. This is not NAN
cloud. You just need to create an
account to be able to log in. And that
brings us straight into the list of
workflows. And there is a demo workflow
that's autoloaded by the NAND starter
package. And it has OMA chat configured.
So we'll get back to that in a second.
What I might do quickly though is let's
just go into settings. We just need to
enter an activation key because there
are certain features that are gated such
as the idea of pinning previous
executions which is really important
when you're building out workflows. So
if you click on unlock on the top left
here then you can just enter in your
email address and they will send you the
activation key. This is totally free and
everything is still local. Okay. So that
has been activated. So now let's create
a workflow and as a first step let's add
a local drive trigger. So we'll come in
here. Let me just move that out of the
way. And under other ways at the bottom
we can see local file. So we want to
trigger changes that involve a specific
folder. So here now we're going to start
building out our rag ingestion pipeline.
So we'll click on that. And at this
point now we want to watch a folder to
find files as they're dropped in. That
way we can drop in a file, 10 files
100,, a thousand, files, and, have, them, all
processed. So we need to add in a folder
to watch. And this gets back to the
volumes and bind mounts because this
needs to be a persistent folder. We
don't want this to be destroyed when we
delete the container. And within the
readme file for the repo, you can see
that they provide the path data shared
as the path to use. So if we drop that
in there and then we're going to execute
that step and let's see can we trigger
the files. Now actually there's one
change I need to make. So we'll just
stop listening. Um we need to use
polling. So for whatever reason on my
local system this doesn't work if I
don't use polling. So we'll just execute
that step. And if we come back into
cursor and let's go to the docker
compose just to explain what's actually
happening here under the nadn service.
You can see that we have volume
specified and we have a bind mount. So
you can see that the shared directory
which equates to this directory here is
mapped to data shared which is what we
just entered into NAD within the
container. So now if I create a file
here so let's just add in a file. Let's
create a new one test.ext
and as you can see that has just
appeared data shared test.ext. Now
there's nothing in it but just to prove
that it works. So okay let's delete
that. And within our version of the
docker compose, I've created a folder
called rag files. So that way we can
drop all of the files we want to process
into here. So under rag files, let's
create a new folder called pending. And
actually, let's create another folder
called processed as well. That way we
can ingest a file and then move it to
the processed folder. So now let's just
update or trigger an NN. So we're now
looking for files that are added to the
shared rag files pending path. So, data
shared rag files pending and let's
execute that again. And now let's get a
PDF that we can actually start
processing. And let's use the one that I
demonstrated in the intro, which is this
Whirlpool refrigerator spec sheet. It's
only one page, so it's a good test bed
to build out the pipeline. So, I have my
pending folder here. So, let's just drag
in this PDF into that folder. And as you
can see, because I had this local file
trigger executed, it was waiting for a
file to appear, and it has just done so.
There we go. So then a good trick at
this point is just to pin that data. So
just click P on the node. And that way
now if we click execute workflow, we
don't need to keep dragging that file
into that folder. That data is always
there. So next up, let's actually load
up this file. So if we click on the plus
and just type in read, we're going to
read this file from the disk, which is
that one here. And now we need to
provide the path for this file. So
that's the path there. We just drag it
in. And now if we click execute step
there's the binary file. And you can see
by opening it up that's it. So we now
have the file to actually play with. So
next up we need to send this to dockling
to actually extract out structured
information be it markdown or JSON. So
if we go back to docker compose we can
see dockling is on port 501. So if we
click that and again if you go forward/i
you can see dockling serves own
interface but we want to access the API
documentation. So that's done via
for/doccks. So then we just need to
figure out what API we need to hit. So
we're looking to convert this file. We
want to process the file. Now there's
two options. You can either
asynchronously process a file or
synchronously process it. So I'll show
both. So let's just do synchronous
processing. In other words, we're going
to wait for the response. And you can
see on the top right, this is the path
that we need to hit. So let's just copy
that, bring it in here, and let's use a
HTTP request node. And we're going to
post to this endpoint. Now you'll see
this is mentioning local host which is
incorrect and I'll show you why in a
second but essentially we want to pass
this file to this endpoint and in terms
of the body to send we're going to send
the binary file. So that's done using
either n binary file or you can also use
form data which is what I'm going to
use. And if we go back to the
documentation you'll see that this
requires a parameter called files and
that's an array of binary files. So
we'll just copy that and let's drop it
in here. And the value is data. And
let's leave it like that for the moment.
Um so let's save that. And now if we
execute the workflow, we're going to hit
an error which is to be expected. And
it's saying the service refused the
connection. And the problem is this
local host. And if we come back to our
Docker network diagram here, what's
happening is this N8N container is
trying to communicate with Dockling, but
it's using local host. And local host is
limited to the machine or to the
container. So when it's trying to hit
localhost 501, it's actually searching
within this container. So we need to hit
docklane port 501. And that way it's
actually looking at the broader Docker
network to pass the file. So we'll just
change this out for dockling and then
click execute step. And now we get a
different error which is great. We're
making progress. And it's saying the
request is invalid. And the issue here
is I'm passing a string as opposed to a
file. Um, so it's just this parameter
type. We just need to change this to NAM
binary file. And then yeah, put that
back in as data. And let's execute it
again. And there we go. It's thinking
about it. Okay, we do have a response.
And the fact that it was thinking about
it meant that it actually processed the
file. So that's why there's two
different endpoints. There's the
synchronous endpoint where you wait for
the response or if I put in async here
it'll just give me back a task ID and
then I can pull for the result. But
let's leave that off for a second. And
if you have a look at the data here
there's a lot of kind of image data.
This is base 64 image data. And then at
the end, we do have the actual text from
the document. So if you go through the
API documentation, you can see that
there are different parameters that you
can pass. And dockling is quite
comprehensive and flexible with the API
endpoint. So image export mode is the
next one. So let's actually just drop
that in here. We'll add a parameter.
This one is now text, not the binary
file. And then if you choose placeholder
for example, and then if you execute it
again, now you'll see the document has
been processed. And where there are
images, it just says image, but it is
just a placeholder. You've actually lost
the image. It hasn't extracted it. So
instead, let's use referenced because
what that's going to do is it's going to
save that image as you can see here to
the disk. So we have the actual image
name now. And if we go to cursor on the
left hand side here, we have a folder
called dockling scratch. And if we open
that up, you can see all of the images
that were just extracted from that PDF.
And that's what that referenced flag
does. It instead of providing the image
as a base 64 string, it saves it to the
container. And this is all made possible
by the way I set up the docker compose
file under the dockling configuration.
I've set the working directory as this
shared folder. And I've also set it in
various environmental variables. And
because this shared folder is accessible
to both Dockane and N8N, it's now
possible for NAND to actually pick up
those files and move them somewhere else
so that we could serve them as part of a
chat response. Which then brings me to
this static files container that I set
up here. So in Docker Compose, this is
essentially just a really simple engine
X server that makes a particular folder
available. And as you can see, that's on
port 8080. So if we click into the
actual port here, you can see we have
dockling scratch are the rag files. So
let's create an extracted images folder
and then we can dump all of the images
in there. So we'll come back in here and
under shared we'll create a new folder
extracted images. And now if we come
back here and refresh we can see
extracted images. So we probably should
lock this engine X server down to this
folder. So under the volume we can see
that shared is actually accessible. So
let's just lock this down further. So
yeah, it's now shared extracted images.
And, now, all, I'm, going to, do, is, delete
this static files container and recreate
it and it'll build it back up again off
the back of the server configuration. So
this is the beauty of Docker. So we'll
come in here, static files, and delete.
And now we just need to rebuild this
image. And down here, you can see
actually that we're still getting logs
of the various services that are
running. So we need to run this in
detached mode. So if you just press Ctrl
Z, that'll stop all of the containers.
And now we'll just make one change. So
if you press up, you're going to get the
previous command that you ran. And now
we're just going to do forward slashd.
And it'll run it in detached mode in the
background. So just press enter. And
that's going to reup all the containers.
And it's also going to rebuild that file
server container with the new
configuration. Okay. So now if we go
back to the index and if we refresh.
Cool. So there's nothing now in that
folder. We are also locked down to that
folder as well. Okay. So let's go back
into N8N. Let's just refresh it. And as
you can see, our workflow is still here
even though we just removed all the
containers and added them again because
we have a dedicated NAN volume. Okay, so
let's just run this again now. And the
document has been processed again. We
can see the markdown content, we can see
the image names. And if we go to cursor
under dockling scratch, we can see the
images themselves. So next up, let's
move these images into this extracted
images folder. And that way they'll be
able to be served in our AI agent chat.
So we essentially need to extract out
all of these images. So as usual with
N8N, there's lots of different ways you
can go about this. What I'm going to do
is I'm just going to copy this entire
output. So copy selection. And I have
cursor here with Opus 4.5 set. So I can
literally just ask cursor to do this job
for me to create a code node to extract
out an array of image names. So I'm
saying, can you create JavaScript code
to extract out an array of image names?
And I'm saying here's my JSON input
structure. So just copy that in. And
then I'm also saying here's the skeleton
of the code node for you to start with.
So this is important. So here, let's
just add our code node. We're going to
use JavaScript and just copy this out.
Now, you don't necessarily need this
actual addition of a new field. So you
can delete that. But yeah, copy that
out. And let's paste that into here. And
I'll just say just output in chat. No
need [snorts] to create a file. And off
it goes. And this is key because the AI
needs to understand the incoming data
structure as well as the skeleton of the
code node because it might not
understand this is NAN or the structure
of the input items in NN. Um, so copying
in the code node is a really good hack.
Okay, so it's produced the code. So
let's copy that. Let's paste it in here.
And actually before we run it, let's
just pin this as well so we don't need
to keep triggering docklane. And then
let's execute the step. Yeah, there we
go. image names. That's exactly what we
want. So now we can split out this
array. So let's do that quickly. Split
out. And let's pass in our image names
array. And we can execute that. And we
now have our individual images. And then
we need to move this file from this
dockling scratch folder into our static
files engineext server. Now
unfortunately there isn't any move node
that you can use for local files. And so
what we're going to do is just use an
execute command. So we just type in
command and this allows you to run a
shell command. And we're not going to do
this once. We want this to run for every
file. Now if you don't know CLI commands
again, you can just ask cursor.
Essentially it's NV for move. So we want
to move this file. And now we need to
get the path of this file as well. So
again back to cursor. You can see it's
under shared dockling scratch. And also
this is under data um because it's the
same as the trigger. This is all set in
the docker compose. So here when we have
a local file trigger we're looking under
data shared and so it's the same here
and that's because the bind mount is
against data shared not shared. Okay. So
we're going to move this file data
shared incratch and that's the file
name. And now let's move it into our
extracted images. Data shared extracted
images. Okay, that should do it. So now
let's run it. So execute workflow and it
has succeeded and let's have a look at
cursor. Let's refresh the file
directory. Yeah, all of the images are
now available under the extracted images
folder. So now if we go to the browser
or if we go to our static files
directory, we can see the images and if
we click into them, there's our
worldpool image and we have the
diagrams. Excellent. So this is probably
the hardest part of this entire project
is actually to extract out the images
and make them available to the AI agent.
Excellent. So next up, we want to import
this document into quadrant so that we
can actually carry out a vector search
over it. So we can see the markdown
content here. Um so actually let's split
off at this point because this idea of
moving the images can be done in
parallel essentially. So let's add a new
node. Let's look for quadrant. So
there's our quadrant vector store and
we're going to add documents to the
vector store. Now there is a local
quadrant database already preconfigured
in the NAD starter kit. So actually
let's just edit that. It's not working
as well. So I don't think you need an
API key. Let's delete it. Um and then
back to quadrant URL. This is the same
thing that we talked about here. So we
need to reference this as the service
name. And actually this is the exact
host that we need to hit. So let's use
that. So there's Quadrant URL. And
actually just click save. Yep, it has
succeeded. So yeah, for local AI
implementations, there is usually no API
keys required. Now, of course, you could
set API keys if you wanted to lock it
down within your network. Now, we don't
have a collection yet. So, let's go into
Quadrant. So, back to Docker. There's
Quadrant, which is this one. And then
for Quadrant, it's forward/dashboard.
Okay, so this is the Quadrant vector
store. So, if we click on collections
we can add a new collection. and we'll
call this one multimodal rag and
continue. So then it's asking what's the
use case and we're just using global
search here really. There's no per user
documentation or anything like that at
this point. And for this we'll just use
single dense vector embeddings. You
could use hybrid search if you wanted.
Okay. So we need to choose dimensions.
We have a few options here. We want to
use a local embedding model and the one
I typically use is Nomic embed text. So
this is available on Olama and if you go
to the Nomic website you can see the
number of dimensions that are in this
embedding model. So you can specify for
version 1.5. We'll go for the highest
number of dimensions to get the best
quality embeddings. So we'll just drop
768 in there. And then we're going to
use cosine similarity as our algorithm
to figure out what are the closest
vectors. So we'll click continue on
that. So we'll just click finish. Okay.
So we now have our quadrant vector store
or collection essentially set up. So
then if we come back into N8N, we should
be able to choose it now. So let me just
save that and go back in. There we go.
Multimodal rag and there shouldn't be
anything else to set there. So then for
the embedding model we now need to
choose nomic text embed. So we need to
use lama and again we need to specify a
credential to lama. So just click on
edit and it's looking for localhost
which again doesn't make sense here
because we need to use the service name
of the service within the docker
network. So that should be lama and
again no API key. We can delete it. And
if we click retry, yep, we have a green
message. So let's just save that and it
should load the models again. It didn't
immediately. So let's just get out and
go back in. And there we go. So that
this only has llama 3.2. So we need our
nomic text embed model within our lama
system. So if we go back to lama, you
can see that there is a command we can
use lama pull nomic embed text. So let's
copy that out. And if you go to exec
what you can do here is you can execute
commands within this container. So if I
right click and paste it in, this is
going to pull the nomic embed text model
into this container. And there is a
volume mounted for lama so that when we
destroy this container and we create it
we won't need to import nomic embed text
again. So we'll trigger that. And now as
you can see it is downloading this
model. Okay, so that is successful. So
now if we come back to nadn back into
embeddings and there you go, nomic embed
text latest. So let's click on that and
we'll save. And then the document we
need to attach a document parser or a
document loader. So there we go. There's
that one. I generally don't use the
simple one. I prefer to use custom. So
we'll hit custom. We'll add a uh
recursive character text splitter. And I
usually specify markdown as the split
code. So that way it's going to retain
some of the structure at least of the
document in terms of the the chunks it
creates. And we might just reduce the
chunk size a little bit. So maybe to
700. Okay, we are in good shape here. So
now let's connect this up. Let's just
remove this for a second because we just
want to see how the vector store side of
it works. And let's execute the workflow
again. Okay, so that has injected into
the quadrant vector store. And if you
come into quadrant and click
collections, yeah, you can see there's
now 19 points within the vector store.
Yeah, you can see all of the various
embeddings. So there's the uh the image
URL, that's the table for kind of model
sizes, product dimensions, etc. There is
some nice visualizations within
Quadrant. So if you click on visualize
and just hit run over the limit, it'll
actually show you where the points are
and how they are clustered. So doesn't
really mean much just with one document
but as you load more in, you can see how
they are clustered. There also is a
graph as well, which is kind of neat.
And if you double click it, then it uh
loads up other points close by. Cool. So
we have our vector store. We have the
data in the vector store. So now come
back into NADN. Let's just hook that
back up again. And now let's create an
AI agent that we can actually converse
with. So let's click on plus. We'll add
a chat trigger. And then let's add an AI
agent which is that one. And now we need
to add a model. So again we're going to
use O Lama. It has to run fully locally.
It has specified the local O Lama
service and it has selected Lama 3.2 2
which is imported by default with this
NADN self-hosted AI system. So we'll
just save that. Now that's a very small
model. You're not going to get uh huge
amounts of intelligence from it, but it
might just be enough to be able to
demonstrate this. So then in terms of a
tool, let's choose quadrant, which is
our vector store. It has already
specified the credential. Description
wise, we'll just say use this to fetch
information from the knowledge base. And
then we'll just choose our collection.
and let's limit it to five. Okay, so
we'll save that. We need our embedding
model. Let's uh grab that from here. So
obviously it has to be the same
otherwise you're not comparing like with
like. And I think we're in business
here. We might just set a very simple
system prompt. I just say you must use
the quadrant vector store to retrieve
information. Actually a good tip as a
starting point is if you add the prompt
from this question and answer chain. Um
so if you open it up and then just look
at the system prompt template. Uh
that's not a bad starting point. It
basically says don't make things up.
Yeah, let's use that instead. Okay, so
now let's ask it a question. Maybe show
me the cabinet opening diagram. Let's
try. And actually, I need to add one
more thing to the system prompt, which
is um, you must output images in
markdown format using the URL provided
in the retrieved results. Let's try that
for example. Okay, so show me the
cabinet opening diagram. Let's see how
it goes. Okay, it has triggered the
quadrant vector store and we do have a
response. Uh there is no image that I
can see anyway. Now that might be down
to the size of the model and it's not
exactly ideal for instruction following
if it's too small. Let me retry that
again. Okay, there is images this time.
Uh the images are broken links though.
So let's have a quick look at that. So
let's just rightclick and inspect them.
Of course, yeah, we haven't added the
full path into the vector store. So
let's close that out for a second and
let's go back to our ingestion flow. And
we essentially need to inject the full
URL here. Now, these image paths are
actually way longer than what the LMA
3.2 produced. So, I'd say we need to
upgrade the model anyway, but we
definitely need to add in the full path.
So, let's add another code node here.
And then same again. So, I'm just going
to copy this out. Let's bring it into
cloud to Opus. I'll just create a new
chat. It's always good to keep opening
new chats and cursor. um as as chats
kind of continue on and on and on um the
actual quality of response deteriorates
due to kind of context rot and a few
other things. Please create JS code that
injects the full URL of images into the
output MD content. Okay, so this is my
input. This is an example of the full
URL. So this is now my enginex file
server. So that's it there. So local
host 8080 essentially. And then again
let's drop in um our JS skeleton here.
Okay, so let's let it run. It's probably
just a reax to find the image and inject
in the the URL. Bit of pattern matching.
And here we go. So, let's copy that out.
Drop it [snorts] in here. Execute step.
And there we go. Yeah. HTTP localhost.
And that's the full image. And let's
just copy that into the browser to see
did it work. Excellent. Okay. Let's
delete everything now from Quadrant. And
let's reimpport the file. So, within
Quadrant, you can go through and delete
everything. Um, this gets pretty
tedious. So let's hit this endpoint to
delete the quadrant collection and
recreate, it., That, way we, can, quickly
kind of prune all of the vectors without
having to manually create a collection
every single time. So this is the
endpoint. So let's add another HTTP
request node. So what is this? This is a
delete method. You pass that in. You
pass in the collection name which is
multimodal rag that refused the
connection of course local host. So it
needs to be quadrant. Okay. So it
deleted it. And if we refresh, no
collection present. Obviously very
destructive. So only to be used when
actually building out your system. Okay.
So let's copy that. So now this one is
create collection. So we're going to
post I assume we create a collection.
That looks right. And then this is the
body that we need to pass. So copy that
out. Drop it in there. And then what was
it again? 768. Execute step. Didn't
work. Oh, it's a put, not a post. So
there we go. Brilliant. Okay. Okay, so
we have it up and running again. Okay
let's actually run this fully now. So
let's come back to cursor. Let's just
delete out the images that are there.
Okay, so that's now gone to do clean.
It's going to extract the images again.
So if I refresh, yep, they've just
appeared and refresh. 19 points.
Perfect. And if we look through the
vectors, we can see that this one has
the full URL now. So now let's ask the
same question to this agent and let's
see, can we get a better answer? Um, I
feel like we probably can't. I don't
think the model is big enough to be able
to output the full URL reliably anyway.
Oh, it did actually. There you go. Cool.
All right., Multimodal, rag., There, you, go.
And using a very small model actually.
So that's LMA 3.2 which is a three
billion parameter model that's installed
here. I'm quite impressed actually that
it was able to spit out that image URL
and actually called the vector store cuz
my experience of very small language
models is that they can't even reliably
call tools. But again, I'm sure if I ran
this exact query 10 times, it might
struggle to produce the image accurately
10 times. And that's really where you
need probably a bigger model to be a
little bit more reliable. I have no
memory assigned here as well. So every
time I refresh this, it's a fresh call.
But yeah, it worked again there.
Excellent. Yeah, that's that's great. As
I mentioned earlier, if you don't have
the graphics card to hand right now, you
could hook up an open source model in
the cloud to actually get all of this
stuff up and running. So let's try that.
So I'll come back into chat model. Let's
create a new credential. And now let's
just choose uh ola.com.
And if we go to lama.com, if you create
an account and go to API keys, you can
create a new API key. So this is
tutorial. Generate the key. And we can
copy it. And then if you paste it in
here and click save, you'll get your
green success message. And now the model
list is a lot bigger because you're
going to be using a cloud model. So
let's use the GBT OSS 20 billion
parameter model which is of the right
size that you could run on a NTX 4090.
So we'll save that and again let's ask
the same question. Now we've got
unauthorized. Of course I didn't pass
the API key. Oh I did. Now maybe I need
to hit HTTPS. Let's try that and
refresh. Oh, okay. That's actually
worked for me now. All I did was created
a new API key and it worked. I'm not
exactly sure why it worked, but it
seemed to work. So now we can uh choose
GPT OSS 20 billion and yeah we can ask a
question and we get an answer back. So
okay let's hook up these tools again.
Okay, we'll clear down the chat. So
let's ask show me the cabinet opening
diagram. And yeah, GBD OSS 20 billion is
lightning fast on a llama cloud. And
there we go. We do get the cabinet
diagram. So let's test it out with
another PDF. So let's unpin this local
file trigger and we'll execute the
workflow. So now it's listening for new
files in this folder. So let's ingest
this document. This is 112 pages of a
user manual. So that's still waiting. So
let's drop this into our pending folder.
And actually we still need to move the
completed file to the process folder. So
that's something we need to do. Let's
drop this in first. Okay. And that's hit
the docklane API. This is where async
would actually make a lot more sense
because this is a rather large PDF. So
that's something that we could implement
again. And I can hear the fan spin on
the machine here because even with the
standard pipeline, there still is AI
models. Okay, that has finished. Took 46
seconds for 112 pages. That's pretty
decent, I think. And now it's working
through the embedding process. And it's
moving 269 images into this EngineX file
server. And yeah, there are all the
images. So now let's ask a question.
Show me how to use the ice and water
dispenser. Okay, so we are getting
instructions. Yeah, we're getting
images. That's great. Very nice. GPTO
OSS 20 bit in has such a tendency to
output tables like that which doesn't
really work in a kind of a chat
interface. It might just be some system
prompting to try to force it not to do
that. Just whatever way the model was
trained. But yeah, you could see that we
are getting images now through which is
great. While I'm cloud only has a
certain number of open source models
let's try open router just to be able to
get a flavor of the different models and
the output formats. And again, this is
all fine as we're testing and trying to
figure out what's the best model for the
job. Once we figure this out, we can
download that specific model and then
build the hardware to meet the
requirements of running that model.
Okay, so there's open router and yeah
let's try let's try Quinn Quinn 32
billion. Let's try that. Okay, we are
getting images as well. No tables, which
is great. Yeah, that looks pretty decent
actually. I'm happy with that. That's a
great way to test and play around with
lots of different open source models to
figure out what's the best for your use
case. So off camera I was just testing
different configurations of Docklane and
adding additional features. I created an
async polling loop. So now here with the
dockling local VLM pipeline I'm hitting
the async end point and then I'm going
through a polling loop where I wait for
a set number of seconds 3 seconds there.
I check the task status using the task
ID and then based on the status of the
task. When it's successful, we fetch the
result and then we process it.
Otherwise, if it's still processing, we
go back and check again. Or if there's
an error, we can stop an error out here.
I'm also moving the file now into the
processed folder. So, we have the full
file path and then dropping it into rag
files processed. So, that way then we
can keep the pending folder clear. And
once we activate the workflow, anything
that's dropped into that folder will be
consumed into the pipeline. And then
some of the other configurations of
Docklane. So I had the Docklane standard
pipeline and I was also playing around
with the picture description API. So
it's possible to annotate the images
that are in the actual document itself.
And you can specify how big the images
are to actually be sent because
obviously you don't want to be sending
really small images to the VLM. But it's
cool that you can run the VLM alongside
the standard pipeline where you're only
actually sending in, let's say, diagrams
and images. Now, a lot of the smaller
VLMs aren't really suited to describe
what's in an image. They're more
specialized on actually figuring out
what's in a document. Um, so again, some
playing around with the different
models. I was trying Granite 3.2 vision
there, and I wasn't getting amazing
results for kind of general purpose
images, but I don't think that's what
it's designed for. So yeah, you can see
how you can kind of further build out
this pipeline to accommodate the files
that you're trying to ingest and with
enough effort you can build out quite a
complex and sophisticated rag ingestion
pipeline. So this is our local rag
system that we have available in our
community and within this system we're
looping through files that are dropped
into a local file folder similar to what
we've gone through today. And then
there's lots of different tracks for
different file types because obviously
Dockling can handle lots of different
file types. So that is a little bit of a
catchall. But for structured data like
Excel sheets or CSV sheets, we want to
represent those differently. And then
once it gets into the main part of
processing the files, it works its way
through a record manager. We have
knowledge graphs. We handle the tabular
data as I mentioned. And then we have
lots of functionality around context
expansion, extracting out document
hierarchies, and then using contextual
vector embeddings. If you'd like to get
access to our state-of-the-art local rag
system, then check out the link in the
description to our community, the AI
automators. So, now that we have a
version one of our rag system up and
running, let's create a web page where
people can actually chat to the agent on
the local network. Now, we could vibe
code a chat interface, but just to keep
things simple, I'm just going to embed
the standard NAN chat widget. So, I'm
just going to grab the URL for this and
let's come back into cursor. Let's
create a new chat. And let's just
explain what we're looking for. And
actually, let's just use this extracted
images folder because the root of our
enginex file server is essentially this
folder. So, here we're going to ask
within this document root, can you
create a web page that embeds the
following chat widget? And then I want
this to hit our nadn docker container
and let's see what it does. Okay, so
it's creating the web page. I'll create
a beautiful web page with the nadn chat
widget embedded. Okay, let's see how
beautiful this is going to be. Now I
have got very good results from claude
opus 4.5 over the last couple of weeks.
So I do have high expectations. So let's
have a look. Okay, n chat your system
powered by n. Click the chat bubble in
the bottom right corner to start a
conversation. Start chatting. Let's ask
hello. Now nothing's happening. I
haven't even activated the workflow, so
that's not a surprise. So let's come
into the workflow. Make chat publicly
available. Yes. I'll give it this chat
URL. And I'll set it as embedded chat.
Don't need any authentication because
it's local. So let's copy that. And
let's save it. And let's activate the
workflow. And now let's just come back
in here. The workflow is active. Here's
the URL. drop in NAN instead of local
host. It should probably figure that out
itself. Okay, so let's try it again and
refresh that. Okay, we are hitting the
workflow which is great even though it's
an error. Now let's come back in here.
Executions debug and editor. Oh yeah, I
just need to make a change. Just need to
update my credential. That looks good.
Now let's ask it to show me how to use
the ice and water dispenser just to see
can we get images back. And as you can
see, we've got through our response. The
formatting is all over the place. I
would not define as beautiful clawed
opus or cursor. So definitely some
iterations needed on the styling of this
pop-up. And this should probably be
embedded full screen. So obviously lots
of work needed on the UI. But you get
the idea that you can use this NAN chat
embed and literally embed it in a HTML
page connected to NAN which is in a
different Docker container and away you
go. So the last thing that's needed then
is if you hit let's say localhost 8080
you're just getting this index of files
and then chat.html is just one of them.
So let's come back into cursor create a
new chat and I'll just ask can you
adjust the engineext configuration so
that if someone hits the root for the
static file service that it's bringing
them to the chat.html.
And this is where a cursor is brilliant
for making changes to the likes of
Docker Compose files because if you're
not an expert at this stuff, you know
it really can shorten the amount of time
it takes to make these changes. And it
was a pretty easy change. Just index
chat.html. Okay, so we just need to take
down our uh static files container and
then re-up it and then that should be
it. Okay, so let's test it out.
Brilliant. So now local host 880 is
pointing to chat.html HTML and we should
still be able to access the images that
are stored on that file server. So the
last thing that's needed then is for
users to be able to access this chat
interface if they're on the local
network. And there's not a huge amount
of changes that are needed here if it's
a small local network. So here for
example, this is my computer. This is my
server. And let's say this is my local
IP address. And let's say that we have
other computers or other laptops that
are on this network that want to access
this chat widget. So when we hit this
chat widget, we're hitting it at
localhost880
or 127.0.0.1.
And you can only do that if you're on
your own machine. So if someone from a
different machine was trying to access
this, they would need to hit the actual
local IP address of my machine on the
network. And to hit this static files
container, which now actually contains
the chat interface, they would hit that
IP address at port 8080. And if they
wanted to access N8N, you could make
that available at port 5678 as well. So
to set this up, there are a few changes
you would need to make. So by default
inbound connections to arbitrary ports
like that are disabled by the Windows
firewall in my case. So I would need to
make changes to my firewall to allow
connections in. This IP address that I
have on this local network is
essentially dynamic. So, if I turn off
my laptop and turn it back on again
I'll most likely be assigned a different
IP address. So, you may need to set up a
static IP address so that the actual
address doesn't change. There might be
other network configuration changes you
need to make depending on how complex
your actual network is. And the bigger
your organization, the more likely that
there's going to be a lot more layers
and a lot more systems in place. So
you'll probably need to work with your
comms team on that. And then the other
obvious thing is your server needs to
always, be, on, or, at least, on, during
office hours if you want people to
access this chatbot during office hours.
So that's how you can publish and
effectively deploy your AI agent in your
organization. I'm using the agent mode
and cursor here to make some front-end
changes to the local rag agent. So I've
embedded the chatbot fully on screen. Um
and it's currently now iterating through
the actual styles. This agent mode with
the actual browser built in and then the
fact that it uses Puppeteer to actually
simulate browser actions is really
powerful for iterating on front ends
like this. And of course, this is a very
basic multimodal rag agent. I've just
imported a couple of files here, but on
our channel, I've shown lots of
different advanced techniques that you
can use to really build out the
capabilities of a rag agent like this.
Now that you know how to create a fully
local multimodal rag agent, I highly
recommend you check out this video here
where I show you how to deploy a clone
of Notebook LM locally. Thanks for
watching and I'll see you in the next
👉 Get access to our Fully Local SOTA RAG system and learn how to customize it, in our community https://www.theaiautomators.com/?utm_source=youtube&utm_medium=video&utm_campaign=tutorial&utm_content=sota-local-rag Every document you send to ChatGPT or Claude is a potential security liability. Legal contracts. Medical records. Financial statements. Client data. HR documents. The cloud-based AI tools we rely on every day are brilliant—but they're also third-party services with their own data policies, training practices, and potential breach vulnerabilities. For enterprises handling sensitive information, "trust me" isn't good enough. In this tutorial, I'll show you how to build a production-grade multimodal RAG system that never leaves your infrastructure. Zero external API calls. Complete data sovereignty. 🔗 Get Started: Our Forked Self Hosted AI Starter Kit: https://github.com/theaiautomators/self-hosted-ai-starter-kit 🎯 What You'll Learn: ✅ Complete air-gapped RAG architecture ✅ Processing PDFs, images, tables, and audio locally ✅ IBM Docling for zero-hallucination document extraction ✅ Ollama + N8N + Docker local stack ✅ GPU requirements for production deployment ✅ Network deployment for team access ✅ Standard vs VLM document processing pipelines ✅ Maintaining semantic structure without cloud APIs ✅ Real-world hardware cost breakdowns 🔗 Links: Docling Documentation: https://www.docling.ai/ Ollama Vision Models: https://ollama.com/ Qdrant: https://qdrant.tech/ ⏱️ Timestamps: 00:00 Local AI + Docling 07:58 n8n + Docker 11:58 Setup Local AI Starter Kit 16:30 Building the RAG Ingestion 34:39 Building the AI Agent 46:56 Creating the Agent Frontend 50:17 Deploying to Local Network 💬 Questions or Comments? What's preventing you from deploying local AI in your organization? Hardware costs? Complexity? Performance concerns? Let me know below!