Loading video player...
In this video, I'll show you how to
build an AI rag application in Python
and how to get it ready to deploy to
production. Now, I say that because I
myself have made many AI projects on
this channel. You've probably seen a few
of them. And while those projects are
super fun and cool and you can learn a
lot, they're not ready to be deployed
into the wild and used in a production
environment. That's because they're
missing observability, logging, retries,
throttling, rate limiting, all of the
things that you need for a real
production-grade AI app. And in this
video, I'm going to show you exactly how
to get all of those things. And the good
news is it's free. It's very easy to do,
and I'm going to walk you through it
step by step using something called
Inest. Now, I want to give you a quick
demo of what the finished application
will look like. Then we're going to dive
in, and we're going to start coding
everything out. And to be clear, the
stack that we're going to use in this
video is going to be Python. That's
where we're going to be writing all of
our code. We're going to use Streamlit
for the front end. We're going to use
Quadrant, which is a specific vector
database, which you can run locally,
which I'm going to talk about when we
get to that point. We're going to use
Inest for all of the orchestration and
observability. And then we're going to
use things like Llama Index for
ingesting PDFs. And we'll also use
OpenAI for the AI components. I know
that sounds like a lot, but don't worry,
I will break it down and explain it step
by step. Now let's have a quick look at
a demo. Okay, so I'm on the computer and
on the lefth hand side you can see a
simple user interface that's built in
streamlet. Now what you can do here is
essentially just chat with various PDFs.
This represents a very simple rag
application which is retrieval augmented
generation. If you're not familiar with
that essentially what it means is you
can upload any file that you want. It
will get vectorized which means turned
into a format that essentially can be
really quickly searched and kind of
pulled in by an LLM. And then we can ask
a question. So I can say something like,
"What does a road map engineer do?" This
is actually in one of the contracts that
I uploaded here for my program Dev
Launch where we're hiring someone to
help us build out road maps for students
and their position is roadmap engineer.
Anyways, we'll wait a second here. It
should actually generate the answer for
us and then it will give us that answer
based on the document that was uploaded.
So you can see it says a road map
engineer is this and then it tells us it
pulled it from this source. Now, in this
case, four of them were the same sources
because there's multiple information
from the same source. And then the last
one was this invoice 4, which obviously
didn't give us any relevant data. Okay,
so that's kind of the application we're
going to build. But what you may notice
is on the right hand side of my screen,
I have this really interesting
application open. Now, this is what
ingest looks like. We're running the
local development server and essentially
this gives us insights into everything
that's happening in our application. So,
we can view our app. We can see that we
have two inest functions here. So we
have one for ingesting a PDF and then
for querying the PDF we can look at
them. We can actually manually invoke
them directly from here if we want to
test them out without a UI and then we
can view all of the runs of these
different functions. This is the most
useful part in my opinion because for
example I was testing this earlier and I
had a run that failed. Now the way that
I have this set up is that it's going to
retry five times automatically and you
can see that every retry it's telling us
exactly how long it took to run. We can
click into it. We can view the logs and
see the error that occurred. I can copy
it and kind of dive into that. And then
I can actually rerun this particular
step and test it again. Then we also
have completed runs. So for example, we
can see that we had multiple steps in
this run right here where we had to do
an embed and search come up with an LLM
answer and we can see the results at
every single step here. So we know
exactly what's going on and we can
really have deep observability into our
application which is super important.
Same thing with actually loading in a
PDF. For example, you can see this whole
thing took 3.4 seconds to load and
chunk. It only took 74 milliseconds. To
actually embed it took a little bit
longer, 1.9 seconds. And then
finalization, 1.2 seconds, where we
returned the result. So that's a quick
demo of the application. Don't worry,
I'll make it all clear as we go through
the tutorial. Now, let's hop over to the
code editor and start building this
project. So before we dive into the code
here, I want to explain to you the
architecture of this project at a high
level and the different components that
we're going to use and why we're going
to use those. Now to do that, let's
start by understanding what rag is. Now
rag stands for retrieval augmented
generation. Essentially, all that means
is that rather than relying on the base
model or the base training from
something like, you know, OpenAI or GPT4
to generate a response for us, we're
going to augment that by actually
passing in additional data to the
prompt. This data is typically going to
be information from like a PDF document
or from our own knowledge store or
something along those lines so that the
LLM can reason based on data that's
relevant to what we're asking it. So in
this case, we're making a PDF rag
application. So that means we'll be able
to upload any PDF that we want and then
we can ask questions about any of the
information that's inside of those PDFs.
What will happen is when we ask that
question, we'll search in the vector
database for relevant information. We'll
then take that relevant information,
give it to our model by just actually
putting it inside of the prompt, and
then we'll tell the model, hey, reason
based on this information and give us an
answer. So, for example, if we had some
invoices or something and we ask, you
know, how much did I pay Tim? We would
go and we would look through all the
invoices, we would take out the relevant
information. We would pass it to the
LLM. We would say, hey, LLM, use this
information and answer the question, and
then it would look at that and say,
"Hey, you paid Tim $10 or whatever the
amount is." Now, in order to do that,
that involves having a vector database
and some other components. Now, as I
said, kind of the core of what I want to
show you in this video is the fact that
we can do this at a production level by
using an orchestration tool called
ingest. Now, this is very important
because when we actually want to go to
production, we need to be able to see
what's happening. We need to have
retries. We need to have throttling. We
need to have rate limiting. We just have
observability and logging into what's
going on with our AI app. Now, I'm
thrilled to say that Ingest has
sponsored this video. Don't worry,
they're completely open source. You do
not need to pay for anything. You don't
even need to make an account to be able
to use this application. And that's
exactly what we're going to do in this
video. We're just going to use what's
called their local dev server. Quickly
going through it, you have all different
kinds of ways that you can effectively
run different steps. And this allows you
to orchestrate, which just means kind of
manage and organize all of the steps
that are happening inside of your AI
application. They have a Python SDK,
which we're going to be using. They also
have a TypeScript and JavaScript SDK,
which is a little bit more popular. And
you can see that this allows us to
actually just have all of the
orchestration immediately ready. So we
can push this into production, see
exactly what's happening, have it
distributed across multiple servers.
It's fault tolerant. We can see
everything that's going on. And you'll
see how much easier this actually makes
development. Okay, that's ingest. That's
the orchestration and the kind of you
know production ready component. The
next thing is we need a vector database.
Now, a vector database is just something
that's going to store all of our data in
a vector format. Now, the way that this
effectively works is we're going to
convert textual data into this numeric
vector and we'll be able to search these
vectors extremely fast for similarity.
So what that means if I type something
like color, right, or you know, whatever
some word that we would be able to turn
that into a vector and then compare that
vector similarity to all of the other
vectors that we have in our database,
which effectively just allows us to
really quickly search through all of the
documents that we're going to be
uploading and find all of the relevant
data so we can give it to the LLM. I'm
not going to go into this in a ton of
depth, but a vector database is just a
really fast database specifically useful
for use with LLMs that lets us search
for similarity. That's typically what
we're looking for uh in kind of
documents or, you know, pieces of
information so we can pull it out and
give it to our LLM. Now, in order to use
the vector database, we're also going to
use something called llama index. This
is going to allow us to actually load in
a PDF and then to parse the PDF and
essentially turn it into something that
we can pass to Quadrant, which is the
local vector database that we'll be
using. And then the last piece of the
puzzle is we're going to be using OpenAI
for our LLM, so using something like
GPT. We'll also use Streamlit for the
front end, but we'll get to that later.
Okay, so let's close all of that and
let's go into PyCharm. And this is where
I'm going to write all of the code. Now,
you can use any editor that you want,
but I typically do recommend PyCharm
because it is just the best for
Pythonheavy projects. I also do have a
long-term partnership with PyCharm, so
you guys can check it out. You can get a
free trial, see if you like it, of the
pro subscription by clicking the link
down below, and you'll see through this
video some of the features that it has
that really does make it nice for these
large projects. Okay, so first things
first, I'm just going to go into full
screen mode. So, let's enter full
screen, and I'm going to start setting
up my dependencies for my Python
project. Now, what I'm going to do is
I'm going to type UV innit and then dot
and I'm going to just do this inside of
a folder that I've opened. So, I've
opened rag production app uvit dot hit
enter. It's going to initialize a new UV
project for me. All right. Now, we need
a few dependencies here. So, what we're
going to do is type UV add and I'm going
to start listing them out. Now, first we
need fast API. This is because we're
going to write effectively an API that's
going to be callable to do this kind of
uh PDF ingestion and the querying of the
PDF. essentially the rag application.
We're going to bring in ingest. Like I
said, this is what we're using for the
orchestration. We're going to bring in
llama-index
and then dash core. We're also going to
bring in llama-index
dash and this is going to be
readers-file.
This is going to allow us to read in a
PDF. We're going to bring in python-env.
This is going to allow us to load
environment variables. We're going to
bring in quadrant. Okay, so let's spell
quadrant correctly. D-client. This is
going to allow us to connect to a
quadrant vector database, which again we
can run locally. Then we're going to
bring in uicorn. This is going to let us
run the fast API server. We're going to
bring in Streamlit and we're going to
bring in OpenAI. Okay, so these are all
of the packages that we need. We're
going to go ahead and hit on enter and
then it should install all of those for
us. Okay, so that's going to take a
second. Now that is done. Perfect. Now,
what I'm going to do from my application
is I'm going to make a new file and I'm
going to call this enenv. Now, this is
where I'm going to put an environment
variable for storing my OpenAI API key.
That's because we're going to use OpenAI
for this project. So, let's just get it
set up right now. So, we're going to
type OpenAI_API_key
is equal to and then I'm going to go
back to my browser. I'm going to go to
platform.opai.com/api.
This is where I can get an API key. In
order to use this, you will need to have
a a credit card on file with OpenAI. It
should only cost you a few cents to do
it with this project, but you also are
welcome to use any other LLM that you
want. Uh assuming you know how to do
that, and I have many videos on my
channel that showcase that. So, I'm
going to go make new secret key. I'm
just going to call this rag app and go
create secret key. And then I'm going to
copy this. And obviously, don't share it
with anyone. That's not a key that you
want to leak. So I'm going to paste that
inside of my env file. I'm now going to
close that. And now we are kind of good
to continue. So first things first, I'm
going to initialize my main.py file. Now
inside of main.py, this is where I'm
going to write the logic to create an
API. Now what I'm going to do is use
fast API to make a simple API
essentially and then I'm going to serve
some of those API endpoints with ingest.
The way that ingest works for our
orchestration is that any endpoint that
we want to essentially have more control
over have observability into that's
typically dealing with some AI
component. So not something basic like
you know adding a event to a database
but typically more for the AI operations
that could take a long time or require
the retries. We can wrap that endpoint
in what's called a decorator. It's
essentially just a line of Python code
and then inest will automatically track
everything that happens inside of that
endpoint and give us kind of those logs
that you saw earlier in the demo that I
was showing. So right now it's not going
to make a ton of sense because I need to
set up a lot of stuff before we can see
the benefit of it. But in the meantime,
we're going to kind of write out or stub
what the API might look like, some of
the functions that we're going to have
inside of here. And then I'll show you
how we connect this to ingest. So while
we're building this application, we can
debug it a lot easier. Again, I know
it's going to be a little bit confusing
when we start. Uh it just requires a
little bit of code before it can start
to become useful. So bear with me here.
So we're going to start writing our
imports here. We're going to import the
logging module. We're going to say from
fast API, import fast API like that.
We're then going to say import and this
is going to be
ingest. We're then going to import
ingest.fast_appi.
That's because it directly um kind of
connects with fast API. We're going to
say from enenv import
load.env.
Uh and that is almost everything that we
need. We're also going to import uyuid.
This is to create a unique ID. We're
going to import os. We're going to
import date time. And we're going to say
from ingest.ex.
experimental
import AI. Okay. And we need to spell
experimental correct. And we need to
spell experimental correctly. So let's
do that. And let's just put all the
injust imports together so they're a
little bit more organized. Okay. Now
we're going to call the load.env
function. What this is going to do is
load the environment variables inside of
this.env file. Now we're also going to
start creating some of our clients. So
we're going to say app is equal to fast
API. And above that, we're going to say
the ingest_client
is equal to the ingest dot.est
like that. Now, inside of here, we're
going to give this an app ID. I'm going
to explain all of this in a second, so
bear with me. We're going to call this
our rag application.
We're going to say logging is equal to,
and this is going to be logging.get
uh and this is going to be logger. And
for the logger, we're going to get the
uvicorn logger because we're going to
run this with uicorn. We're going to say
is production is equal to false. This is
really important when we're doing this
in development mode. We need to make
sure we disable the production because
in production uh this is going to
require us to have a little bit more
security essentially to call these
ingest functions which I will explain
later. We're going to have serializer is
equal to ingest.py
dantic serializer. And that is because
injest supports pyantic typing. And
we're going to use some of those type
hints and typing system here in this
video. If you don't know what that is,
essentially this is a really good typing
system in Python that allows us to
essentially define the types of
different variables in this dynamically
typed programming language. Okay. Now
what we're going to do is down here
we're going to serve the ingest
endpoint. So we're going to say
ingest.fast_appi
and then this is going to be serve. And
then what we're going to do is we're
going to serve app and we're going to
serve the ingest
oops if we can type this correctly
client. And then we're going to put a
list and this list is going to include
the functions from ingest that we want
to serve. So like I was saying really
what we're doing right now is we're just
setting up kind of like a normal API
using fast API. We could go here and we
could say something like you know
app.get get and we could just define the
endpoint here, you know, get notes or
something and just write like a normal
standard fast API endpoint. We could run
the application with unicorn like we
normally would if you've ever used fast
API before. If not, don't worry about it
and it would just work. That's fine.
However, when we have kind of AI heavy
logic, we want that orchestration on top
of it that I was showing you. So, if
that's the case, then what we're going
to do is we're going to create something
called an ingest function. Now when we
make an ingest function because we have
this line right here, ingest will
automatically kind of serve that
function for us and it will connect to
the ingest development server which I'm
going to run for us in a minute and show
you what that means. Effectively what
will happen is we'll have this server
that now is sitting between our API and
our client. So let's say we have some
front end, right? Someone wants to use
the application. What they'll do is
they'll say, "Okay, I want to upload a
new PDF." Rather than directly sending a
request to our API here, they're going
to send a request to the ingest server.
The ingest server is going to take that
request and it's going to forward it in
the correct format to our API. It's then
going to call the ingest function that
we're going to write in just 1 second
and it's going to go through that
process of logging it, making sure that
it's retrying if possible, tracing all
of the errors, and giving us all of
those benefits. So let's look at a quick
example of how we set up one of those
functions. So we're going to say at and
then this is going to be
ingest_client.create
function. Okay. Now when we make the
function what we're going to do is we're
going to say the function ID is equal to
and then we just give this some human
readable name. In this case I'm going to
say rag and then ingest.
Okay PDF like that. Now on the next
line, we're going to specify the
trigger. Now the way that a function is
triggered or called is by some event
being issued to the ingest server. Now
that event can be triggered from a
client, so something like a front end,
or it could be triggered from another
function. So like one of our functions
could call another function and we could
have kind of this large chain of events
that are occurring. So we have an event.
An event typically triggers one or more
functions to run. And there's all kinds
of advanced stuff that you can do with
events and kind of with this flow that
I'm going to show you. So we're going to
say ingest.trigger
event like that. For the event, we need
to give this a name that we could call
from code. So we're going to say rag
slashingest
pdf. And what I've just said is okay,
whenever this event is sent or kind of
yeah is triggered essentially, we're
going to run this inest function. Okay,
now that's all that we need for right
now inside of this decorator, which is
what this thing is called. Beneath here,
we need to define the function. It's
going to be async. So, we're going to
say async define rag_ingest,
and then this is going to be PDF. And
then inside of here, we're going to take
in some context. Now, the type of the
context is going to be ingest.context.
And what we're going to return for right
now is nothing. Uh, but we will set the
type later. And sorry, this needs to be
inest with two ends. Okay, so now that
we've done this, we've created this
function that's going to be effectively
controlled by ingest and the development
server, which again I'm going to run in
one second. When someone triggers this
event, this function will run. We'll get
the context of the event, which can be
like parameters or data or values that
we want to pass here. And then we can
start doing whatever it is that we want
to do using some of kind of the flow
control inside of inest. So if I want to
just make a really simple example, I can
literally just do something like return
hello world, right? I can just return
pretty much anything that I want so long
as it would be valid kind of JSON or a
Python object that is uh what do you
call it? Serializable. So I made this
function now and when you call it, it
just returns hello world. But because
it's an inest function, we get the
benefit of having the observability
which I'm going to show you now. Okay,
so I promise it's going to all start
making sense in just one second. What
we're going to do to start is we're
going to run our Python API. So to do
that, we're going to type uv run
uicorn and then this is going to be main
colon app. Now the reason why I'm doing
that is because the name of our file is
main.py and because the name of our fast
API application is app. Okay, so I'm
going to do that. Going to go ahead and
press on run. And it says I got an
unexpected keyword here of logging.
Okay, so this should not say login. It
should say logger for the ingest client.
So let's just quickly fix that and rerun
and hopefully we should be good to go.
And there you go. We can see that our
API is now running. Okay, so the API is
running which means this function is
technically available to be called.
However, as I mentioned, what's going to
happen is we're going to have something
called the ingest server which is going
to essentially control the invocation of
this ingest function. So in order to
have that server running, well, we need
to run it. So what I'm going to do is
run a command on my computer. You'll be
able to run this as well that will run
this local development server. And this
is a huge advantage where this is open
source. You can run it locally on your
own computer. Of course, you can use
their manage solution as well. And if
you go into production, you're going to
have to deploy some instance of the
server. Uh but in this case, locally,
it's very easy. So what we're going to
do is make sure that we have Node.js
installed on our computer. And we're
going to type npx
and then ingest
client at latest. And then we're going
to type dev- u and we're going to put
essentially a link to our API. So if we
go here, we can see that our API is
running on port 8000. So what we're
going to do is we're going to write
that. We're going to go http
slash and then we're going to go
127.0.0.1
which is effectively localhost and then
this is going to be port
8000/api/ingest.
And then we're going to go d-n no dash
discovery.
Okay. Now what this is doing is it's
going to run this development server for
us. And it's going to tell the
development server, hey, I want to
connect to an application running on
port 8000. And then I put / API/ingest,
which is essentially this. Okay, so
ingest will serve an endpoint at /
API/ingest. And that endpoint will kind
of control the ingest functions. So I'm
going to go ahead and run that. And as
soon as I do, it's going to just install
that for me. And of course, I spelled it
incorrectly. It should not be ingest
client. It should be ingest CLI. I don't
know why I put client there. Let's fix
the command. npxingest-client.
Excuse my dictation tool there. And
sorry, not client- CLI. I keep saying
client. Let's just look at it again cuz
I want to make sure you guys know
exactly what the command is. npxingest-
CLI at latest dev-u. Nothing else
changes other than the fact this is the
CLI tool. And you can see that it's now
running this and it tells us that this
is running on port 8288.
So what we're going to do is we're going
to open up our browser here and we're
going to go to localhost port and then
we can copy this 8288.
When we do that, you should see
something that looks like this being
served on your computer. This is the
user interface for the ingest
development server. And what we're able
to do now is if we go to apps, we should
be able to sync with this. However, it's
telling us there's some error here. This
is not synced properly. Uh so let's make
sure that our API is running. Okay, we
can see that we're getting a internal
server error because we didn't actually
put this function that we need to serve
inside of this functions list. So,
excuse me. Let's fix that. We're going
to go rag ingest PDF. We're just going
to put that inside of here and then
save. And then what we can do is just
shut this down and restart our API.
Okay, so now we're actually serving this
function. And now if we go back here,
let's refresh this. And you can see,
let's close this. We can delete this
application that we have this rag app
that is showing up as connected. Okay,
under our apps and it has one function
called rag ingest PDF. We can view the
function. So if we go here, we can take
a look at it. I can open it up and if I
want to, I can invoke the function just
by essentially calling it, right? Or
passing this kind of event trigger. From
invoke, I can pass any data that I want.
In this case, we can just leave it
empty. And then I can press sorry invoke
function. When I do that, we see that a
run now appears under runs. And if I
click on this, we can see exactly what
happens. So it says it took, you know,
35 milliseconds with a 2 millisecond
delay. And if we look at this, we can
see this was the input, right? So this
is the ID. This is the name of the
function. This is the data. So this is
the function ID, you know, all of this
kind of stuff. This is the timestamp.
And if we look at the output, we can see
that we got hello and then colon world.
Okay. Okay. And then we could rerun this
again if we want. Now we have another
run that we can look at. We can see all
of the information related to it. And
this is extremely useful not just for
debugging the application, but for if we
actually put this into production. And
then if we go to events here, you can
see that this event occurred, right?
Ingest function finished. And then we
also had these functions that were
triggered when we invoked a particular
function. Okay. All right. So that is
kind of a quick example of essentially
how we connect inest. And now we just
leave this development server running.
So it's just going to keep running the
whole time. And we can keep kind of
turning on and off our API as we make
changes to it. So what we're going to
need to do now, we have this function.
We kind of made the connection. We need
to start actually implementing the rag
features. All right. So let's move on to
our vector database. Now as I said,
we're going to be using quadrant for
this database. and we need to write
essentially kind of like a database
client that's going to allow us to
create the database, load data, search
data, etc. So, in order to do that, what
we need to do is we need to run Quadrant
locally on our own computer. Now, in
order for this step to work, you will
need Docker installed on your machine. I
would highly recommend just downloading
Docker Desktop. Okay. Now what we're
going to do is we're just going to make
a new folder inside of our application
here. And we're going to call this
quadrant
storage. Okay. And this is going to be
the volume effectively where we store
the vector database that we're going to
use in our application. Now what I'm
going to do is copy in a command and you
can pause the video and you can type it
out if you want. That's going to allow
us to essentially run Quadrant locally
on our own computer. Now what I'm doing
is I'm saying docker run-d for damon.
The name of the container is going to be
quadrant. You can name this anything
that you want. So if we want to go back,
we can name it, you know, quadrant.
Let's go here rag db or something.
Doesn't matter. We're going to say the
port that we're running on is 6333. This
is the standard port. And then for the
volume that we want to attach here,
we're going to print the working
directory. This is going to work only if
you are on Windows. If you're on Mac or
Linux, you're going to have to change
this command slightly, which I'll talk
about in a second. We're going to go /
quadrant storage. This is the uh what do
you call it folder that we're currently
inside of. So that's why I'm doing that.
And then / quadrant / storage. And then
I'm saying quadrant quadrant, which is
the uh what do you call it? Image that I
want to run this container from. Okay, I
know it's a little bit confusing. The
reason why I have to do this print
working directory thing here with the
dollar sign is because for some reason
in PowerShell this messes up a little
bit and doesn't work locally for you.
You should be able to just replace this
with a dot slash which references the
local directory. Worst case you can put
the full path to where you want to have
this storage and it doesn't need to be
in the same directory as the
application. So it's completely up to
you. If you're on Windows again go with
this. Okay. So, dollar sign and then
pwd. And sorry, I think this needs to be
inside of parenthesis, not braces. If
you're a Mac or Linux, go with dot
slash. And then again, make sure you
have this quadrant storage uh created in
this directory that you're referencing.
Okay, so we're going to go ahead and run
this. You shouldn't really see any
output. You should just get some random
string that kind of looks like a hash
here. And then it should just be done.
Now in the background if you open up
docker desktop you should see that there
is now a new container that is running.
So let's wait for this to load for a
second. Okay. And you can see sorry it
just loaded. We have quadrant rag db
it's running. We have the container ID.
We have the image. We have the ports.
And we can control that from here. Okay.
And I'm just going to shut down this
other container that I don't know why
that's running because I don't need this
active. So let's just delete that. Okay.
Anyways, there we go. Now we have
Quadrant running and we're able to
connect to it from our uh code. So what
we're going to do is we're going to make
a new file here and we're just going to
call this vector_db.
py. Okay, we can add that to get. That's
fine. And inside of here we're going to
write the code that will allow us to
connect to our quadrant database and to
search something in the database. So,
what we're going to do is we're going to
say from quadrant,
okay, if we can spell this correctly,
underscoreclient. We're going to import
the quadrant client. We're then going to
say from the quadrant client dot models,
we're going to import the vector params,
the distance, and the point structure.
Okay. And we're going to make a class.
Now for the class, we're going to call
this just quadrant
storage. We're going to do an
initialization. So we're going to say
define a netit. We're going to take in
self. We're going to take in some URL.
And by default, the URL is going to be
http/
localhost col 6333.
Then we're going to take in a
collection. So for the collection, I'm
going to call this documents. This is
going to be the collection where we're
storing the information essentially. and
we're going to take in the dimensions or
dim and this is going to be equal to
372.
Okay, now there's a lot of stuff to
explain when it comes to vector
databases. Essentially with this
quadrant database that we're using, it's
very high performance and we can run it
locally. However, realistically in
production, you would actually deploy
this database out and then you would
probably end up changing this URL. So
you're connecting to the deployed
instance or you're using Quadrant's kind
of managed service, right? Um, I don't
really know exactly how that works, but
they obviously have their own, you know,
thing that they'll try to sell you. So,
here we can go self.client
is equal to and then we're going to
bring in the quadrant client. We're
going to pass the URL equal to URL and
we're going to say the timeout is equal
to 30 seconds. So that if we don't
connect in 30 seconds, then we
essentially crash this program. We're
going to say self.colction
is equal to collection. So, we store
that as a variable. And then what we're
going to do is we're going to create a
new um collection in our vector database
inside of this quadrant storage folder.
You can see now we have this collections
folder, right? We have these aliases, a
few other pieces of data. So what we're
going to do is say all right, do we
already have a collection called docs?
If we don't create one or sorry, yeah,
if we don't create one, if we do, then
we don't need to create one. So we're
going to say if not, and this is going
to be self.client client dot and this is
going to be collection_exists
and you can see the autocomplete coming
here from PyCharm. We're going to put in
the name self.colction. Then what we're
going to do is say self.client.create
collection and for the collection we're
going to say the collection name is
equal to self.colction
and we're going to say the vector and
this is going to be underscore config is
equal to the vector params. The size of
our vector is going to be the dimensions
and we're going to say the distance is
equal to distance doc cosine. This is
the uh algorithm or formula or whatever
you want to call it for calculating the
distance between different points in our
vector database. Again, without getting
into any advanced linear algebra here,
which is kind of the foundation of how
vector databases work, what we're going
to have is a certain number of
dimensions. This is effectively kind of
the number of values that we have inside
of our vector. And we're going to kind
of turn these text documents into
vectors. We're then going to compare
these vectors against each other using
this distance formula. And vectors that
are closer to each other in this vector
space have kind of similarity. At least
that's what we're hoping for. So we'll
be able to really quickly find those
vectors that are close to us, pull them
out of the database, get their original
text data, and pass that to our LLM. So
that's what we're setting up here uh
with this initialization. Now we're
going to make a function here called
upsert. This just means essentially
insert and update. And we're going to
say self ids vectors and payloads. And
what we're going to do is say points is
equal. And then we're going to have a
new um what do you call it? List
comprehension here. And we're going to
say this is point strruct. And then this
is going to be id equals id's i. vector
equals vector and then i or vectors i
sorry. And then we're going to have
payload is equal to payloads i for i in
range. And then this is going to be the
len of ids. Now, what this is going to
do is it's going to get all of the
associate ids, vectors, and the payloads
from these three uh lists effectively
and create a point structure, which is
what we need to create in order to
insert this into our vector database.
And then we're going to insert it. So,
we're going to say self.client.upsert.
Okay, we're going to upsert the what is
it? Self.colction. So, let's type this
correctly. Self.colction. And then we're
going to say points is equal to points.
Okay, so the idea here is that we're
going to pass a series of ids, which is
a list, a bunch of vectors, which is
kind of the vectorzed version that's
going to be in a dimension of 372, and
then payloads. And payloads is going to
be real data, data that's actually human
readable. Um, that kind of represents
the information that we've vectorized.
We're going to convert all those three
things into this point structure, which
is just what's required for quadrant and
the way that we're doing this. and we're
going to insert that into the vector
database. Okay, so that allows us to now
add vectors effectively. And the more
important thing is searching for
vectors. So we're going to say define
search. We're going to say self. We're
going to say query
vector
like that. And we're going to say top k
this is going to be an int. And by
default it will be equal to five. Now
we're going to say results is equal to
self do.client client dot search and
we're going to search the collection_ame
equal to self.colction.
We're going to say the query vector is
the query vector. We're going to say
with payload is true and we're going to
say the limit is equal to top k. Now top
k just means that we're looking for this
many results from the vector database.
So we're going to look for five results
in this case. We could look for two. We
could look for 10. could look for 20.
Obviously, the more you look for, the
longer this could potentially take, but
it's still very, very fast. Okay, then
we're going to say context is equal to
an empty list. And sources is equal to
an empty list. So, the reason we have
these variables is because I need to get
all of the context or information. I
want to store that in one list. And then
I want to get all the sources. So, the
documents essentially that we pulled
this information from. So I'm going to
say for R in results I'm going to say
payload is equal to get attribute or get
adder this is going to be R payload or
none and then or we're going to have
this empty dictionary. We're going to
say text is equal to payload.get
and we're going to get text or we're
going to get an empty string and we're
going to say source is equal to
payload.get get and we're going to say
source or an empty string.
Okay. Now, what we're going to do here
is we're going to say if text. Okay. So,
if there is text, we're going to say
context.append text and sources.append
source. And actually, this just reminds
me I'm going to just convert the sources
into a set because I don't want to keep
adding the same source over and over
again. So, we're going to say
contextes.add and then rather than
append, we're going to say add source.
And then what we're going to do down
here is we're going to return and we're
going to say
context and we're going to return the
context and then the sources we're going
to convert to a list and we're going to
return the sources. So what this is
going to do is it's going to search our
vector database. It's going to get the
relevant results based on this query
vector which again we'll look at in a
little bit and then we're going to pull
out all of the sources and the context
and return that. Okay, so that's that
for the vector database. Now one thing
to note the way that I've done this is
that we are going to lose which context
is associate with associated sorry with
which source. So if we wanted to we
could change it back and then we would
have kind of the related data based on
the indices. In this case I think it's
fine just to have it as a set. So that's
our vector database. The next thing that
we need to do is have a way to
essentially read in a PDF. So I'm going
to make a new file and this new file is
going to be called data_loadader.
py. Now, this is where I'm going to use
llama index to load in PDF documents and
to embed them because what we just did
is we made the vector database which
will allow us to upload essentially
vectors and search for vectors. But the
thing is I still need to create the
vectors. So let's do that now. So I'm
going to say from OpenAI
import openai. I'm going to say from
llama index dot readers
okay dotfile import the PDF and then
reader I'm going to say from llama index
docore dot node parser okay import the
sentence splitter and then I'm going to
say from
enenv import load envoenv
function again in here and I'm going to
initialize an openai client So I'm going
to say client is equal to open AAI. Now
because we've defined this variable
OpenAI API key inside of this file, we
don't need to actually pass anything
else to what is it here? This client, it
will just automatically find and look
for that variable. And because it
exists, it will essentially allow us to
use OpenAI. Now let me briefly discuss
what we're about to do here. So we're
going to effectively load in a PDF. Now
when we load in a PDF, that could be
very large. It might be, you know, a
thousand pages long. We can't just embed
the entire PDF. And by the way, embed
means just effectively convert it to a
vector so we can store it in the
database. Instead, what we need to do is
chunk it. Chunk it means we need to
break it down into smaller pieces and
then embed those smaller pieces. And the
size of the pieces is relevant here. We
don't want anything that's too big, but
we don't want anything that's too small.
So that we still have a lot of really
relevant data, but we don't have, you
know, massive amounts of data or really
tiny pieces of data that are going to be
hard to search for. So, what I'm doing
here is I'm going to use Llama index to
read in our PDF and then to split all of
the sentences in that PDF into chunks.
We're then going to take those chunks.
We're going to embed them and then we're
going to store that in the vector
database. So, what we're going to say is
our embedders model is equal to the text
embedding and not this one. It's going
to be -3-l large. There's all kinds of
models that can embed text for you. In
this case, we're using OpenAI. This is a
pretty popular one. And we're going to
say the embedders dimension is equal to
372. And we need to make sure that
matches what we have inside of our
vector database, which is 372.
Okay, this is effectively how large the
vector is for the text that we're
embedding. Then we're going to say our
splitter is equal to the sentence
splitter. We're going to have a chunk
size. I'm going to say chunk size is
1,000. And the chunk overlap. Now the
chunk overlap is how much of the end of
one chunk is included in the beginning
of another chunk. The reason why you
would have an overlap is because if you
have a sentence like hello world, my
name is Tim. Let's say we want to split
this right into two chunks. Well, we
might have one chunk that is hello
world. We might have one where his name
is Tim. Now what we would do if we had
an overlap of let's say one or something
and one represented a word is in the
first chunk we would have hello my and
then in the second one we would have my
name is Tim where we would kind of
duplicate the word at the end in the
beginning of this chunk so we don't
potentially lose a lot of relevant
context. So in my case I'm going to go
with a chunk overlap of 200. This
represents characters by the way not
words so that we're able to kind of
split it properly. Okay. Okay. Now,
anyways, that hopefully that makes sense
in terms of how that's going to split,
but it's just going to do essentially
sentence splitting for us. Um, and
create all of these chunks. We're then
going to say define load and_chunk
PDF. We're just going to take in a path
to the PDF, which will be of type
string. Now, we're going to say our
documents are equal to the PDF reader.
Data, and we're going to say the file is
equal to the path. This is just going to
look for the PDF and it's going to load
it. We're then going to say text is
equal to and we're just going to pull
out the text. We're not looking at
images or anything. And we're going to
say d.ext
for d which stands for uh what is it
kind of data or document for d in docs
if get adder dext
none.
So, what this is going to essentially
say is, okay, we're going to get all of
the text content for every single
document inside our documents. All
right, if this document uh has some text
attribute, because we might have a PDF
that only has images, for example, not
text. We're then going to say chunks is
equal to this. We're going to say 4T in
text. And we're going to say
chunks.extend.
And then we're going to say splitter,
okay? Text. and we're going to split
this with t and then we're going to
return
our chunks. Okay, so this is the
chunking process. Again, I don't want to
break it down too much further than
that. Effectively, we're taking the PDF,
turning it into smaller pieces of
textual data, and then we're going to
embed each of those pieces of data. So,
the next function we're going to have is
embedders text.
This is going to take in our text, which
is going to be a list of type string.
This is then going to return, I'll just
type it manually here, a list of list of
type float, which is effectively what
our vectors are going to look like.
We're going to say response is equal to
client.bedings.create.
We're going to pass the model, which is
the embed model, and we're going to say
the input is equal to text. We're then
going to return iteming
for item in response.
Okay, so what this is going to do is
it's going to send a request to OpenAI.
It's going to pass all of the text uh
which is all of the text that we've kind
of chunked here already. It's going to
embed them, which means convert it into
a vector, which we can store in the
vector database. We're then going to go
through the result here. So,
response.data, and we're just going to
pull out the embedding itself. We don't
care about any of the other metadata
that's going to be included. So, that's
our data loader. That's our vector
database. And now, we're going to move
over to main.py. Okay. Now, I want to be
able to test the ingestion of a PDF
first. So, I'm going to write this
function by using some of the functions
that we just wrote here, and then we'll
kind of continue. And there's a bit more
advanced stuff that we want to get into.
Um, all right. So, let's start by
importing some of the stuff that we just
wrote. So, we're going to say from data
loader, import the load and chunk PDF
and import the embed text. We're then
going to say from the vector database
import.
Okay, the quadrant storage like that.
Now, quickly, I'm also just going to
make a new file here. I forgot to do
this where I'm going to create some
custom Python types. So, I'm going to
call this customtypes. py. The reason
for this is that I want to have these
types so that I can make my application
a little bit more readable and I can
import and use paidantic which is
supported inest. So I'm going to say
import pi dantic and I'm just going to
write some really simple Python classes
that represent some types that I'm going
to use in my app. So I'm going to say
class and I'm going to say rag chunk and
src and this is going to be paidantic.
Model and then what I'm going to do is
say chunks and this is going to be a
list of type string and I'm going to say
the source ID string is equal to none.
Now what this type is going to represent
is essentially the result after we chunk
and get the source for a particular PDF
document. I'm then going to say class
and this is going to be rag if we can
spell this correctly upsert result. This
is going to be the result after we
upsert a document. So we're going to say
pyantic dot base model and then we're
just going to say ingested and this is
going to be an int just representing how
many um what is it things that we
ingested. We're then going to say class
and this is going to be rag search
result. So you can guess what this one
is when we're searching for some text.
That's what we're going to have here.
We're going to have pideantic.base base
model and then we're going to have the
context which is going to be a list of
type string and we're going to have the
sources
which is going to be if we type this
correctly a list of type string. Then
we're going to have one more. This is
going to be class rag
query
result. So this is different than the
search result. This is the query that
the user is actually sending uh what is
it to the endpoint. So we're going to
say pantic.base base model. We're going
to say the answer is of type string.
We're going to say the sources is list
of string and we're going to say the
number of context is an int. Okay, so I
think that's all we need for our custom
types. Let's go back to main. Let's now
import these custom types and then we
can use them in this first function.
We'll test it and then we'll move on to
the next one. So we're going to say
actually from custom types import and
then we need to import all of these. So
what did we have? We had the rag query
result. We had the rag search result. We
had the rag upsert result and we had the
rag chunk and source. Okay. So now let's
go into this function that we wrote and
let's start setting it up to actually
utilize inest properly and to kind of
perform the steps that we need. So first
things first, what I'm going to do is
kind of explain to you a little bit
about how this works. when we actually
run an ingest function. So as you saw,
we kind of have a diagram that looks
like this, right? We have ingest which
is kind of the execution engine or the
local server. We send a request to there
that then sends request to our API
endpoint. The API endpoint then goes to
our ingest function and then inside of
the function we can have these things
called steps. Now each step is kind of
an individual operation that we're going
to track that we could retry if needed
and that we're going to kind of observe,
right? and get all of the logging and
the information for. So if I just go
quickly over to overview here, you can
see right that there's kind of three
main things in inest. We have the
triggers which we've already kind of
talked about where it's essentially
events that can trigger something to
run, right? It could be a web hook,
could be a crown schedule, could be a
manual trigger like in our case. We have
flow control, which we're not really
getting into right now, but we will talk
about the concurrency, the throttling
and all of that later on. And then we
have steps. And steps are kind of how we
convert a function into a workflow with
multiple retryable checkpoints. So if
you look at an example right here, we
have this step, right? We're saying,
okay, we're going to run step one, which
is getting data. We're going to wait for
that step to finish, and then we're
going to save the data. Now by wrapping
these different operations in these
steps from ingest, this allows us again
to have all those advantages of the step
where we can retry it. We can wait for a
step to finish and we can see kind of
what the application's actually doing at
each step. So we have kind of deep
observability into our functions. So I'm
going to show you how you make a step.
But if we go here, you can see running
retryable blocks, pausing execution,
pause for an amount of time or wait for
a specified amount of time. There's a
crazy amount of stuff that you can do
with the steps here, but what we're
going to do is have two steps in our
function. The first step is going to be
for loading the PDF, and then the second
step is going to be for embedding it and
kind of chunking it or not chunking it,
but adding it to the vector database.
So, what I'm going to do is I'm going to
write two internal functions. The first
function is going to be load. And this
is going to take in a ingest dot context
and it's going to return a rag chunk and
src result. Okay. And for now we're just
going to go with pass. We're going to
have another function and I'm going to
call this upsert. Now this is going to
take in chunks and what do you call it
underscore src. This is going to be of
type rag chunk and src. And it is going
to return a rag upsert result. For now
again we are going to pass. So the idea
is I have these two individual steps
that I want to run inside of this
function. We need to load and then we
need to add to the vector database. And
we can make as many steps as we want. In
this case it's kind of the logical thing
to do. We could even make more steps if
we want. And if we did that then we
would have obviously an even more
detailed function where we go through
everything. We can see the timing all of
that kind of stuff. So what I'm going to
do down here is I'm going to say chunks
and src is equal to and rather than just
calling the load function directly which
is what you would do if you're just
working kind of standardly in Python is
we're going to wrap this in a step. Now
the way we do that is we say await and
then this is going to be ctx.step.run
and then we put the name of the step. So
in this case we can just call this
anything human readable that we want. So
I'm going to say load and chunk and then
we put the function that we want to
call. Now, because we want to call these
functions with arguments, what we're
going to do is put a lambda, and we're
going to say lambda, and then this is
going to be underscore load. And then
we're just going to pass ctx, which is
this value essentially here, right into
this load function. Now, we don't really
need to pass it like this because we
could just use this as a global variable
inside of the function. But for now,
that's how I'm going to do it. Now, we
also have the ability here to specify
the output type. So I can say the output
type is rag chunk and src because like I
said this now supports paidantic. All
right. Now we're going to do the exact
same thing for the next step. So what
we're going to do is we're going to say
inested
is equal to await ctx.step.r
run. This step is going to be called
embed
and upsert. And keep in mind like you
can name these steps anything that you
want. It's more for the logging to see
it. We're going to say lambda
and then this is going to be underscore
upsert and then we're going to pass the
result from the previous step chunks and
src and we're going to put the output
type equal to the rag and what is this
upsert result okay so this is how we
call the steps right and then what we
could do is we could say return ingested
dot and I'm going to go model dump what
this does is it just takes our pyanti
model and converts it into JSON or a
Python object. So, kind of Python
dictionary, sorry. Um, and allows us to
return that because these functions need
to return something that is what's known
as serializable. So, we just model dump
the uh what is it? Pantic object and
we're good to go from there. So, that's
kind of how we set up the steps, right?
We're saying, okay, we're going to run
this step. We're going to wait for it to
finish. Then, we're going to run this
step. Now, you could run these steps in
parallel, right? You could run them at
the exact same time. You don't need to
wait for one to finish because we're
doing these asynchronously. Um, so we
could just remove the await, right? And
we could just run them kind of one by
one. So we could just remove the await.
We could run them at the exact same
time. We could run it in parallel. Like
we can control the flow however we want.
But in this case, I do want to wait for
them to finish running because they will
take, you know, a second and I need the
result from this one before I can do the
result here or before I can execute
this. So let's write the context of the
functions now. So I'm going to say
inside of load, what we need to do is
get our PD PDF path. So I'm going to say
ctx.event
event
dot data and then I'm going to get the
PDF path. I'm then going to say the
source id is equal to ctx
doevent dot data.get
and I'm going to say source id and then
I'm going to put pdf path here. Now this
is because if I pass a source ID myself
then I will use that. If not then we'll
just use the pdf path as the source ID.
We're going to say chunks is equal to
load and chunk PDF. We're going to take
the PDF path and what we're going to do
is return rag chunk and src and we're
going to say the chunks is equal to the
chunks and the source ID is equal to the
source ID. Okay, so that is loading.
What we're doing right is we're just
going to load and chunk the PDF and
we're effectively going to return that
result. Now upserting we're going to say
chunks is equal to chunks and
src.chunks.
Right? The typing is nice here. We know
what we're going to be getting from this
object. We then are going to say the
source ID is equal to chunks and src
dots source ID. We're going to say the
vectors is equal to and this is going to
be the embed text. From the embed text,
we're going to take in the chunks. And
by the way, if we wanted to like we
could convert this into a step. It's not
necessary because it's already kind of a
part of this and this is the long
running operation anyways. But we can
have you know other steps from one of
these steps. We can also trigger another
function from this ingest function and
we can make it you know as complex as we
want. So I'm going to embed that. I'm
then going to say ids because I need to
generate a unique ID for all of these
vectors and I'm going to say this is
string.
And then we're going to say uyu ID
doyuyu ID 4. This is going to create a
unique identifier for us. And sorry,
actually, UUID 5, not four. We're going
to take UUID.namespace,
and we're going to use namespace URL.
Don't worry too much about exactly what
this is. Just bear with me because this
will identify or make a unique ID. We're
going to say source ID. And this is
going to be associated with I. And then
we're going to go for i
in range len of yes len of vex like
that. Or actually rather than going vex
let's just go len of chunks although it
shouldn't really make a difference. Now
that's the ids. The next thing that we
need is the payloads because we've
generated the vector but I need the ID
for every vector. I need the payload for
every vector. So I'm going to say
payload is equal to and then this is
going to be source. The source is going
to be the source ID so we know where it
came from. We're going to say the text
is equal to chunk at i and then sorry
this is chunks not chunk. So it's chunks
i for i in range the len of chunks. So
again what we're doing is we're kind of
looping through all the chunks and we're
getting uh what do you call it? All of
the text and the source ID for that. So
we can have that as a payload. So we
have an ID, we have a payload, we have
the vectors. Now we're able to actually
pass this to the quadrant store so that
we can store it there. So we're going to
say quadrant storage. Okay, we're just
going to initialize it. We're going to
say upsert. And then we're going to pass
our ids, our vectors, and our payloads.
And we're going to return the rag upsert
result. And for this, we're going to say
ingested is equal to the length of and
then this is going to be what did we
have here? Chunks. Okay, so the number
of chunks that we actually ended up
ingesting. All right, so that's actually
it for this function. You know, in
theory, assuming I didn't make any
mistakes, which is unlikely, this will
run and we will be able to actually see
the results running inside of Injest and
we'll be able to upsert this into the
Quadrant database. So, right now, let's
look uh at what we have. We have the uh
Injust server running. Our server is
currently running, but we need to stop
it and restart it because we made a
change here and it's not in kind of
debug mode right now or reload mode.
It's going to take a second to load up
here because it's connecting to the
database. Okay, we can see that it's all
good. What we're going to do now is
we're going to go back to our
development server. Now, we're going to
go apps. We're just going to make sure
it's connected. Looks like it is. Let's
go to our our functions. Let's view
that. And I actually want to test
invoking this. So, in order to invoke
this, I'm going to press on invoke. And
for my data, I'm going to pass my PDF
path.
And I need to pass a valid PDF path. So,
what I'm going to do is I'm going to go
and get the absolute path to a PDF from
my documents. And I'm just going to
paste it inside of here. Let me do that
now. Okay. So, I just have a path here.
I had to escape the slashes because it's
not going to allow me to do this in the
string due to how like the slashes are
how they escape the string essentially.
Now, this is just a link to one of the
resources that we actually have for dev
launch where we have kind of like a DSA
road map. And anyways, what I'll do is
just go and press invoke function. When
I do that, we see a new run is
triggered. Now, it looks like it's just
waiting for this to run. So, it says
running. And then I guess there was an
error. State finished. Okay. I don't
know what that means. Uh, file not found
error. Okay. So, you can see we had a
file not found error. So, clearly I kind
of made a mistake there. It's going to
keep attempting this a few times based
on how the default settings are set up.
So, what I could do is just cancel that
for now. And that's the whole point of
having this, right? We could debug this.
I'm going to go to functions, go back
here, go to invoke, and I'm now just
going to put a new file. And it looks
like actually it wasn't two graphs, it
was 12 graphs. I'm just looking in my
file explorer. So, let's try it. Now, we
have this new run. We can see a loadin
chunk look to work pretty fast. 48
milliseconds. We get two chunks here
because of how much content was in this
document. And then we have embed and
upsert. And we can see that we ingested
two documents. And that took a little
bit longer to go in the vector database
whereas loading and chunking was very
fast, right? Where the embedding, you
know, required us to call out to OpenAI.
So it takes a second to get the result.
And then we have finalization down here.
Cool. So that's also just a really nice
way to test this, right, from this UI
rather than having to make the front
end. So what I'm going to do now is I'm
going to move on to the next function
which is going to allow us to actually
query our PDFs. Right? So let's make
another ingest function. So here we're
going to say at ingest and then this is
going to be underscoreclient
dotcreate
function. For the function we're going
to give this an ID. So we're going to
say function ID is equal to and this is
going to be rag. And then we'll go here.
Query PDF.
Okay, that can be the name. Then we're
going to say the trigger is equal to
ingest.t trigger event. For the trigger,
we'll go rag query PDF AI because we're
using AI to do this. We could do it
without AI as well. We're then going to
say async define rag_query
pdf_ai.
We're going to take in our context,
which is the
ingest
dot context like that. And what we're
going to do now is start setting up
essentially everything that we need to
do to query the PDF. Now, the first
thing that we're going to do is have a
function. Okay, I'm going to call this
search. Now for search we're going to
take in a question
which is a string and we're going to
take in the top k which by default will
be equal to five and which will be an
int.
Okay. Now inside of here we're going to
say our query vector is equal to embed
text and we're going to pass a list with
our question and then we're going to
pull out index zero. The reason for this
is that if I want to query my database,
I need to do it with a vector. So
whatever the question is that the user
asked, I need to embed that so it's in
the same format as everything in the
vector database. Because this normally
takes in a long list of text, we're just
going to pass it as a list and then take
out the first result. So we'll have this
as our query vector. Then what we're
going to do is we're going to say our
store is equal to quadrant storage. And
we're going to say found is equal to
store.arch. We're going to pass our
query vector and we're going to pass our
top K. we don't need to pass it as the
keyword parameter uh or name parameter
whatever you want to call it and we're
going to say found is equal to store
search and we're going to search based
on that query vector. We're then going
to return our rag and this is going to
be search result and we're going to say
context is found and then context and
sources is found and then sources and we
can also just add the return value here
which is the rag search result. Okay.
Now, if we go down here, we've kind of
built out the first step, but there is
more that we need to do. So, first
thing, we're going to get the question
from our event data. So, we're going to
say ctx.event.data,
and we're going to pull out what the
question is. We're also going to pull
out the top k because this is something
that will allow the user to pass. So,
we're going to say ctx.event.data.get
top k and then five. And we're also just
going to convert this to an int in case
they pass it as a string. Okay. Now
we're going to say found is equal to and
we're going to await ctx.step.tr
run. We're going to run the step which
we're going to call embed and search.
And same thing as before, we're going to
call this function. So we're going to
say lambda
search and we're going to pass the
question and the top k and we're going
to say that the output type. So let's
specify that is equal to the rag search
result. Okay. And I don't know why it's
giving me that. I didn't want that
autocomplete. All right. So we have the
output type rag search result. Let's
make this left side a little bit
smaller. And we're running this now as a
step. So again, it will be retriable and
we'll have all of those benefits. Now
I'm just going to copy in the prompt
that I'm going to use for the LLM
because now once we find the
information, we need to pass this in a
prompt. So what I'm going to do is say
content block and I'm going to join all
of the context. So all of the sentences
essentially that I found uh like this
where I have a dash and then I have the
uh what do you call sentence 4C in
contextes and I'm going to combine them
with kind of two new line characters. I
know this looks a little bit weird but
I'm just taking all of the context in a
list and converting it into a string.
Then what I'm going to say is the user
content. So this is essentially what I
want to ask or kind of like the um yeah
the the prompt is going to be use the
following context to answer the
question. Here's the context. Then
here's the question. Answer concisely
using the context above. Okay, that's
the prompt that I want. Now what I'm
going to do is I'm going to use an
adapter here from uh what do you call
ingest to actually call the AI model. So
I'm going to say adapter is equal to AI.
OpenAI dot adapter.
Now for the adapter I need to pass my
authorization key. This is going to be
OS.get envai API key. I need to manually
pass it here because we're not using the
open A client now. We're using the
adapter from uh what do you call it? Uh
ingest. I'm going to say the model is
equal to and then I'm going to pass
GPT-40-
MIDI. I'm just going to use a small one
because I don't want it to be super
expensive. And then I'm going to
generate a response. Now, by using this
AI adapter and generating a response
here with the method you're going to see
in a second, again, we get the same
benefits like we do with the step
function where it will automatically be
retrieded. It will handle the
throttling. It will handle the rate
limiting. As you probably know, when you
call these LM providers, there can be a
lot of errors that pop up. So, we're
going to say response or res is equal to
await ctx.step.ai.infer.
This is an AI inference. It has a
special kind of syntax here inside of
ingest. We're going to call this the LLM
answer. So again, it will be observable
to us. We're going to pass the adapter,
which is equal to the adapter that we
just wrote. We're going to pass the
body, and for the body, this needs to be
equal to. We're going to pass max tokens
equal to 1,024.
We're going to pass the temperature
equal to 0.2. This is essentially how
random the model is going to be. So, we
want to be pretty low on temperature.
Now, I'm spelling temperature completely
incorrectly. So, let's fix that
spelling. And then we're going to have a
message. So, we're going to say messages
actually is a list. And for the
messages, the first thing that we're
going to do is we're going to say roll.
And this is system.
And then the content, let's fix this
typing here, is going to be equal to
kind of a system message. And I'm just
going to paste in a simple one here. You
can make this as detailed as you want.
I'm going to say you answer questions
using only the provided context to make
sure that it really is not kind of going
crazy and it's just using what we
provided. And then I'm going to say roll
user and then content and then I'm going
to pass to this again if we fix the
typing the what did we call this up
here? The user content.
Okay. So that is going to now generate
the response for us. Now after the
response we need to get the answer. So
we're going to say answer is equal to
res. This is the default format from
OpenAI. So we're going to say choices
zero
message and then content. Okay? Because
you can pass multiple things. So this is
kind of the way that we need to do it.
And then we're going to say dot strip.
So we're going to strip off any leading
or trailing white spaces. Then what
we're going to do is we are going to
return and we're going to say answer is
answer. We're going to say sources. And
this needs to be as a string.
So sources is going to be equal to found
do.ources like that. And then we're
going to say the num
context. And then what we'll do is we'll
just pass the number. So we're going to
say len of found.context.
Okay. So that should pretty much be it
in terms of this function. Again, what
we're doing is we have the search step
where we're embedding the query, the
thing the user asked essentially, and we
are then searching for the results in
the vector database. We are running that
as a step. We're creating a simple
prompt here to say, okay, here's the
information. Here's the question. We're
creating an AI adapter. We're passing
that to the inference step with our
messages and then we're just kind of
parsing the response and returning it.
It's really not too complex, but just
because we're doing with ingest, I want
to explain it a bit more in depth. Okay,
now that we have that, we can restart
the server again and we should be able
to actually run this inference. So, if
we go back, we can refresh. Okay. And we
now know that we have one document that
we added. I added it about graphs,
right? And actually, let me open the PDF
so you can kind of see what it looks
like. Something like this, right? Where
we have like some problems, leak code
problems, um, you know, just like some
basic information because we recommend
people do, and there's, you know, a
bunch more here. uh certain questions on
leak code during certain days based on
our DSA road mapap in dev launch. So
what I'm going to ask it is something
like you know what is the importance of
graph I don't know we we'll come up with
something right but if I go to functions
now oh we only have one function okay so
let's go back here my my apologies guys
let's shut this down okay close that
shut this down and let's add this
function to the list so what do we call
this uh rag query PDF AI because we need
to serve the function so we got to add
it into that list there okay let's rerun
the application
and apologies. Let's go back. Let's go
here. Refresh. And now we see another
function is appearing. So let's go to
query PDF. Let's go to invoke.
I think this time we just need to pass
top K and questions.
Okay. So we're going to say question and
then the question can be like this. Oh,
I'm going to ask it something like what
is the importance of graph problems? And
let's see. I don't know if that's going
to give us an answer, but let's try it.
Okay, so it's running and we can see
that it says, you know, embed in search.
It ran that step and then I did LLM
answer and finalization and then it gave
us the answer here. It said graph
problems are important because they help
build comfort with fundamental concepts
such as modeling problems as graphs,
performing DFS and BFS traversals and
managing state and recursion. They also
enhance blah blah blah. This is a direct
quote out of that PDF that I showed you
and it says that this is the kind of
context that it looked at. You can see
there's another document that I uploaded
kind of as a test. Um although, you
know, it didn't actually use anything
from this. It just was pulling that
context and then it pulled it out of
graphs. And if we look through the other
steps, you know, we can see all of the
input output, everything that was going
on and kind of the importance of that.
Cool. Okay. So, that worked. We now have
the two functions. And to be honest, at
this point, like the application is
pretty much done. We built a rag query
application. But I do want to show you a
few other kind of nice features that
Injust has that we can take advantage
of, as well as how we can add a custom
front end to this. So for the front end,
what I'm actually going to do is I'm not
going to code it from scratch with you
because it's going to take a little bit
of time to do that. I'm going to leave a
link in the description to all of the
code in this video. In that uh link,
you'll see a Streamlit file. It's going
to be called Streamlit app. I'm going to
write it in one second for you. You'll
see it. And you can just copy that and
paste it into this project and it will
be a functioning frontend for you to use
within this application. So let me show
you what I mean. I'm going to make a new
file. I'm going to call it
streamllet_app.py.
Now I'm just going to directly copy this
from my other monitor because I wrote it
before the video for the demo. Okay. So
I'm just going to paste it directly
inside of here. And you guys can find it
from the link in the description in the
GitHub link. Same thing. Literally just
copy it and paste it inside of here. And
I'll quickly step through some of the
code in terms of calling our function so
you see how that works. Uh and then this
will just be a functioning front end.
I'll show you how to run it. So
effectively this is a stream
application. Really nice UI. You can
just build stuff in pure Python. You can
see that we what do you call it? Import
ingest here as well. We create an ingest
client. It's really important that we
specify the correct app ID and that we
don't specify that this is in production
because if it's in production, we need
to pass something called an event key uh
which you can generate on ingest which
is a little bit more complicated if
you're going to deploy this for security
reasons. We have the ability to like
save and upload a PDF. So we store it
and we can upload it. And then we have
things like send the rag ingest event.
So, like I was saying, if you want to
actually trigger one of the functions to
run on ingest, all you have to do is get
the client and then you just send an
event. So, we send the event, which is,
you know, rag ingest PDF just like we
were kind of doing from the UI. We pass
the PDF path and the source ID and
that's it. And then it just triggers it
and it runs and it ingests it. Right
now, same thing for running the query
event. When we run the query event,
what's going to happen is it's going to
return to us the event ID. Now, this is
the thing that's a little bit tricky. If
you want to get the result from an
event, because this is not synchronous,
you need to send a different request
later. So what I mean by that is the
result here is not actually going to be
the result of the LLM call. What it's
going to be instead is just kind of like
some metadata about this event and how
it ran. So what I do is I actually just
return what's going to end up being here
kind of the event ID. And then we have
this code right here which fetches all
of the event runs and then is able to
search through this for the most recent
event that matches the event ID that we
had and then actually get what the
result of that is because again an event
might take you know a day a minute 10
seconds. We don't know how long it's
going to take to run. So what we do is
we kind of run this loop where we're
fetching the events. We're looking for
this particular event ID right? So we're
looking for that event ID and then any
of the runs. We're getting the most
recent run. We're checking its status.
If the status is completed, succeeded,
success, you know, finished, one of
these things, then what we're going to
do is we are going to return what the
output of that is. Otherwise, if it's
failed or canceled, we're going to raise
an error. And if we go past the timeout,
then we're going to say that there was a
timeout. Then we're going to sleep in
between. We're essentially pulling the
endpoint to get the result. This is all
documented extensively in the
documentation for ingest. But
essentially, we just send a request, do
this endpoint. So we get the API base
URL, we send a SLevents SLevent ID
SLruns and then we're good to go. And
this is what the uh base URL will be,
right? Either in the environment
variable or 8288/v1
slash and then this. Okay, hopefully
that makes sense. Again, I'm not going
to explain the entire front end, but I
just want to show you that, you know, we
wrote it. We can use it. So if we go
into another um terminal, we can type
uvun streamlit run and then the name of
this, which is streamllet app.p py when
we do that it should open it up for us
in the browser from here we can just use
this UI right so I can ask a question
like why or let's go this you know why
are graphs
important we can say okay we just want
to retrieve maybe three chunks we can
ask that it's then going to send the
event and generate the answer if we go
to ingest we can see we have a running
event right now right because we sent
that from the front end and then it's
going to take a second because of the
polling how we have it set up on a
little bit of delay and we should get
the result. Okay, so you can see we get
the answer and it tells us kind of the
uh what do you call it? The sources that
it used to get that. Now if we upload a
file, let me do this. Okay, so I just
uploaded a resume here. This is actually
one of our Dev Launch students rƩsumƩs
and uh we did that. You can see it says
triggered ingestion. And if we go back
here, this event just ran. It's very
fast. Ingest PDF. And we can go through
and we can kind of see what happened
here. I don't want to expose this
because it has some personal data in it.
But the point is, you know, you get the
idea. that we embedded, inserted, etc.
In this case, it was just one chunk.
Okay, so that's that. That's the front
end and that's almost everything. The
last thing that I will quickly go
through is a few things that you can add
to the functions for rate limiting and
throttling and some more control. So,
you can do a lot with this, right? I,
you know, kind of barely scratched the
surface of what's possible with this
orchestration tool, but I want to
quickly show you that a nice benefit of
this is how easy it is to add like rate
limiting. So for example if we go to
what is it like overview here you can
see that if we want to add throttling we
can just go throttle is ingest throttle
we can say you know count is two you
know period daytime delta we could
literally just like copy this right go
back to pycharm
let's add this in and now we have
throttling automatically applied to this
function right boom it's there it's
going to work same thing if I go back
here so I could go to flow control here
I could implement concurrency Right. So
you can see we can actually have just
concurrency directly inside of here. We
could add rate limiting. So for example,
let me just copy one of these. Okay. So
we'll just copy that and paste it inside
of here. And you can see now that like
I've just added rate limiting. And for
the rate limiting, you're also able to
add a key. So for something like
ingesting the PDF, uh we may want to
have a key that's actually relative to
the source ID or something. So we could
say key source ID. So now we're only
going to apply that based on this key.
And there's just all kinds of other
things that you can add here to each of
the functions. Then we would just shut
the application down, rerun it, and now
all of a sudden we have a rate limit.
This would mean we can only run this one
time every 4 hours for a given PDF
document. That's it. That's kind of how
this works to make sure that you're not
abusing the functions. There's a lot of
other stuff. Again, I'm not going to go
through everything. We have priority to
bouncing, you know, singleton pattern,
concurrency, all of that kind of stuff,
which is really interesting and makes
this a very powerful tool. Okay, so with
that said guys, that is going to wrap up
this video. I think this is a really
cool application. It goes above and
beyond the kind of basic simple rag apps
where we just do something locally. This
is actually something that we now could
deploy to production. In order to do
that, I would suggest following along
with the ingest documentation. They have
some steps on deployment and essentially
how you kind of configure security and
different applications and how you set
up the environments. It's not something
that I have enough time to cover in this
video, but if you guys do want a video
on it, then definitely let me know and I
can likely team up with INDS again to
get that done. Anyways, that is it. I
hope you guys enjoyed the video. If you
did, make sure to leave a like,
subscribe to the channel, and I will see
you in the next one.
[Music]
Get started with Inngest: https://innge.st/yt-twt-1 š Check out PyCharm, the only Python IDE you need to build data models and AI agents. Download now. Free forever, plus one month of Pro included: https://jb.gg/PyCharm-for-Tim I'll show you how to build an AI RAG application in Python and how to get it ready to deploy to production. I myself have made many AI projects on this channel, you've probably seen a few of them. And while those projects are super fun and cool and you can learn a lot, they're not ready to be deployed into the wild and used in a production environment. That's because they're missing observability, logging, retries, throttling rate, limiting all of the things that you need for a real production grade AI app. DevLaunch is my mentorship program where I personally help developers go beyond tutorials, build real-world projects, and actually land jobs. No fluff. Just real accountability, proven strategies, and hands-on guidance. Learn more here - https://training.devlaunch.us/tim?video=AUQJ9eeP-Ls š Video Resources š Inngest Python Docs: https://www.inngest.com/docs/apps Qdrant: https://qdrant.tech/ LlamaIndex: https://www.llamaindex.ai/ Code in this video: https://github.com/techwithtim/ProductionGradeRAGPythonApp ā³ Timestamps ā³ 00:00:00 | Overview 00:01:21 | Project Demo 00:04:07 | Architecture & Tools Breakdown 00:08:23 | Project Setup & Dependencies 00:11:22 | API Setup 00:12:10 | Inngest Dev Server Setup 00:25:06 | Vector Database Setup 00:36:48 | Loading & Chunking PDFs 00:58:09 | Querying Our VectorDB 01:08:54 | Adding the Frontend 01:13:56 | Rate Limiting, Throttling & Concurrency Hashtags #RAGCoding #AIAgent #Python