How to Build a Production-Ready RAG AI Agent in Python (Step-by-Step) | DailyDevLists

Loading video player...

Full Transcript

16,001 words • EN

In this video, I'll show you how to

build an AI rag application in Python

and how to get it ready to deploy to

production. Now, I say that because I

myself have made many AI projects on

this channel. You've probably seen a few

of them. And while those projects are

super fun and cool and you can learn a

lot, they're not ready to be deployed

into the wild and used in a production

environment. That's because they're

missing observability, logging, retries,

throttling, rate limiting, all of the

things that you need for a real

production-grade AI app. And in this

video, I'm going to show you exactly how

to get all of those things. And the good

news is it's free. It's very easy to do,

and I'm going to walk you through it

step by step using something called

Inest. Now, I want to give you a quick

demo of what the finished application

will look like. Then we're going to dive

in, and we're going to start coding

everything out. And to be clear, the

stack that we're going to use in this

video is going to be Python. That's

where we're going to be writing all of

our code. We're going to use Streamlit

for the front end. We're going to use

Quadrant, which is a specific vector

database, which you can run locally,

which I'm going to talk about when we

get to that point. We're going to use

Inest for all of the orchestration and

observability. And then we're going to

use things like Llama Index for

ingesting PDFs. And we'll also use

OpenAI for the AI components. I know

that sounds like a lot, but don't worry,

I will break it down and explain it step

by step. Now let's have a quick look at

a demo. Okay, so I'm on the computer and

on the lefth hand side you can see a

simple user interface that's built in

streamlet. Now what you can do here is

essentially just chat with various PDFs.

This represents a very simple rag

application which is retrieval augmented

generation. If you're not familiar with

that essentially what it means is you

can upload any file that you want. It

will get vectorized which means turned

into a format that essentially can be

really quickly searched and kind of

pulled in by an LLM. And then we can ask

a question. So I can say something like,

"What does a road map engineer do?" This

is actually in one of the contracts that

I uploaded here for my program Dev

Launch where we're hiring someone to

help us build out road maps for students

and their position is roadmap engineer.

Anyways, we'll wait a second here. It

should actually generate the answer for

us and then it will give us that answer

based on the document that was uploaded.

So you can see it says a road map

engineer is this and then it tells us it

pulled it from this source. Now, in this

case, four of them were the same sources

because there's multiple information

from the same source. And then the last

one was this invoice 4, which obviously

didn't give us any relevant data. Okay,

so that's kind of the application we're

going to build. But what you may notice

is on the right hand side of my screen,

I have this really interesting

application open. Now, this is what

ingest looks like. We're running the

local development server and essentially

this gives us insights into everything

that's happening in our application. So,

we can view our app. We can see that we

have two inest functions here. So we

have one for ingesting a PDF and then

for querying the PDF we can look at

them. We can actually manually invoke

them directly from here if we want to

test them out without a UI and then we

can view all of the runs of these

different functions. This is the most

useful part in my opinion because for

example I was testing this earlier and I

had a run that failed. Now the way that

I have this set up is that it's going to

retry five times automatically and you

can see that every retry it's telling us

exactly how long it took to run. We can

click into it. We can view the logs and

see the error that occurred. I can copy

it and kind of dive into that. And then

I can actually rerun this particular

step and test it again. Then we also

have completed runs. So for example, we

can see that we had multiple steps in

this run right here where we had to do

an embed and search come up with an LLM

answer and we can see the results at

every single step here. So we know

exactly what's going on and we can

really have deep observability into our

application which is super important.

Same thing with actually loading in a

PDF. For example, you can see this whole

thing took 3.4 seconds to load and

chunk. It only took 74 milliseconds. To

actually embed it took a little bit

longer, 1.9 seconds. And then

finalization, 1.2 seconds, where we

returned the result. So that's a quick

demo of the application. Don't worry,

I'll make it all clear as we go through

the tutorial. Now, let's hop over to the

code editor and start building this

project. So before we dive into the code

here, I want to explain to you the

architecture of this project at a high

level and the different components that

we're going to use and why we're going

to use those. Now to do that, let's

start by understanding what rag is. Now

rag stands for retrieval augmented

generation. Essentially, all that means

is that rather than relying on the base

model or the base training from

something like, you know, OpenAI or GPT4

to generate a response for us, we're

going to augment that by actually

passing in additional data to the

prompt. This data is typically going to

be information from like a PDF document

or from our own knowledge store or

something along those lines so that the

LLM can reason based on data that's

relevant to what we're asking it. So in

this case, we're making a PDF rag

application. So that means we'll be able

to upload any PDF that we want and then

we can ask questions about any of the

information that's inside of those PDFs.

What will happen is when we ask that

question, we'll search in the vector

database for relevant information. We'll

then take that relevant information,

give it to our model by just actually

putting it inside of the prompt, and

then we'll tell the model, hey, reason

based on this information and give us an

answer. So, for example, if we had some

invoices or something and we ask, you

know, how much did I pay Tim? We would

go and we would look through all the

invoices, we would take out the relevant

information. We would pass it to the

LLM. We would say, hey, LLM, use this

information and answer the question, and

then it would look at that and say,

"Hey, you paid Tim $10 or whatever the

amount is." Now, in order to do that,

that involves having a vector database

and some other components. Now, as I

said, kind of the core of what I want to

show you in this video is the fact that

we can do this at a production level by

using an orchestration tool called

ingest. Now, this is very important

because when we actually want to go to

production, we need to be able to see

what's happening. We need to have

retries. We need to have throttling. We

need to have rate limiting. We just have

observability and logging into what's

going on with our AI app. Now, I'm

thrilled to say that Ingest has

sponsored this video. Don't worry,

they're completely open source. You do

not need to pay for anything. You don't

even need to make an account to be able

to use this application. And that's

exactly what we're going to do in this

video. We're just going to use what's

called their local dev server. Quickly

going through it, you have all different

kinds of ways that you can effectively

run different steps. And this allows you

to orchestrate, which just means kind of

manage and organize all of the steps

that are happening inside of your AI

application. They have a Python SDK,

which we're going to be using. They also

have a TypeScript and JavaScript SDK,

which is a little bit more popular. And

you can see that this allows us to

actually just have all of the

orchestration immediately ready. So we

can push this into production, see

exactly what's happening, have it

distributed across multiple servers.

It's fault tolerant. We can see

everything that's going on. And you'll

see how much easier this actually makes

development. Okay, that's ingest. That's

the orchestration and the kind of you

know production ready component. The

next thing is we need a vector database.

Now, a vector database is just something

that's going to store all of our data in

a vector format. Now, the way that this

effectively works is we're going to

convert textual data into this numeric

vector and we'll be able to search these

vectors extremely fast for similarity.

So what that means if I type something

like color, right, or you know, whatever

some word that we would be able to turn

that into a vector and then compare that

vector similarity to all of the other

vectors that we have in our database,

which effectively just allows us to

really quickly search through all of the

documents that we're going to be

uploading and find all of the relevant

data so we can give it to the LLM. I'm

not going to go into this in a ton of

depth, but a vector database is just a

really fast database specifically useful

for use with LLMs that lets us search

for similarity. That's typically what

we're looking for uh in kind of

documents or, you know, pieces of

information so we can pull it out and

give it to our LLM. Now, in order to use

the vector database, we're also going to

use something called llama index. This

is going to allow us to actually load in

a PDF and then to parse the PDF and

essentially turn it into something that

we can pass to Quadrant, which is the

local vector database that we'll be

using. And then the last piece of the

puzzle is we're going to be using OpenAI

for our LLM, so using something like

GPT. We'll also use Streamlit for the

front end, but we'll get to that later.

Okay, so let's close all of that and

let's go into PyCharm. And this is where

I'm going to write all of the code. Now,

you can use any editor that you want,

but I typically do recommend PyCharm

because it is just the best for

Pythonheavy projects. I also do have a

long-term partnership with PyCharm, so

you guys can check it out. You can get a

free trial, see if you like it, of the

pro subscription by clicking the link

down below, and you'll see through this

video some of the features that it has

that really does make it nice for these

large projects. Okay, so first things

first, I'm just going to go into full

screen mode. So, let's enter full

screen, and I'm going to start setting

up my dependencies for my Python

project. Now, what I'm going to do is

I'm going to type UV innit and then dot

and I'm going to just do this inside of

a folder that I've opened. So, I've

opened rag production app uvit dot hit

enter. It's going to initialize a new UV

project for me. All right. Now, we need

a few dependencies here. So, what we're

going to do is type UV add and I'm going

to start listing them out. Now, first we

need fast API. This is because we're

going to write effectively an API that's

going to be callable to do this kind of

uh PDF ingestion and the querying of the

PDF. essentially the rag application.

We're going to bring in ingest. Like I

said, this is what we're using for the

orchestration. We're going to bring in

llama-index

and then dash core. We're also going to

bring in llama-index

dash and this is going to be

readers-file.

This is going to allow us to read in a

PDF. We're going to bring in python-env.

This is going to allow us to load

environment variables. We're going to

bring in quadrant. Okay, so let's spell

quadrant correctly. D-client. This is

going to allow us to connect to a

quadrant vector database, which again we

can run locally. Then we're going to

bring in uicorn. This is going to let us

run the fast API server. We're going to

bring in Streamlit and we're going to

bring in OpenAI. Okay, so these are all

of the packages that we need. We're

going to go ahead and hit on enter and

then it should install all of those for

us. Okay, so that's going to take a

second. Now that is done. Perfect. Now,

what I'm going to do from my application

is I'm going to make a new file and I'm

going to call this enenv. Now, this is

where I'm going to put an environment

variable for storing my OpenAI API key.

That's because we're going to use OpenAI

for this project. So, let's just get it

set up right now. So, we're going to

type OpenAI_API_key

is equal to and then I'm going to go

back to my browser. I'm going to go to

platform.opai.com/api.

This is where I can get an API key. In

order to use this, you will need to have

a a credit card on file with OpenAI. It

should only cost you a few cents to do

it with this project, but you also are

welcome to use any other LLM that you

want. Uh assuming you know how to do

that, and I have many videos on my

channel that showcase that. So, I'm

going to go make new secret key. I'm

just going to call this rag app and go

create secret key. And then I'm going to

copy this. And obviously, don't share it

with anyone. That's not a key that you

want to leak. So I'm going to paste that

inside of my env file. I'm now going to

close that. And now we are kind of good

to continue. So first things first, I'm

going to initialize my main.py file. Now

inside of main.py, this is where I'm

going to write the logic to create an

API. Now what I'm going to do is use

fast API to make a simple API

essentially and then I'm going to serve

some of those API endpoints with ingest.

The way that ingest works for our

orchestration is that any endpoint that

we want to essentially have more control

over have observability into that's

typically dealing with some AI

component. So not something basic like

you know adding a event to a database

but typically more for the AI operations

that could take a long time or require

the retries. We can wrap that endpoint

in what's called a decorator. It's

essentially just a line of Python code

and then inest will automatically track

everything that happens inside of that

endpoint and give us kind of those logs

that you saw earlier in the demo that I

was showing. So right now it's not going

to make a ton of sense because I need to

set up a lot of stuff before we can see

the benefit of it. But in the meantime,

we're going to kind of write out or stub

what the API might look like, some of

the functions that we're going to have

inside of here. And then I'll show you

how we connect this to ingest. So while

we're building this application, we can

debug it a lot easier. Again, I know

it's going to be a little bit confusing

when we start. Uh it just requires a

little bit of code before it can start

to become useful. So bear with me here.

So we're going to start writing our

imports here. We're going to import the

logging module. We're going to say from

fast API, import fast API like that.

We're then going to say import and this

is going to be

ingest. We're then going to import

ingest.fast_appi.

That's because it directly um kind of

connects with fast API. We're going to

say from enenv import

load.env.

Uh and that is almost everything that we

need. We're also going to import uyuid.

This is to create a unique ID. We're

going to import os. We're going to

import date time. And we're going to say

from ingest.ex.

experimental

import AI. Okay. And we need to spell

experimental correct. And we need to

spell experimental correctly. So let's

do that. And let's just put all the

injust imports together so they're a

little bit more organized. Okay. Now

we're going to call the load.env

function. What this is going to do is

load the environment variables inside of

this.env file. Now we're also going to

start creating some of our clients. So

we're going to say app is equal to fast

API. And above that, we're going to say

the ingest_client

is equal to the ingest dot.est

like that. Now, inside of here, we're

going to give this an app ID. I'm going

to explain all of this in a second, so

bear with me. We're going to call this

our rag application.

We're going to say logging is equal to,

and this is going to be logging.get

uh and this is going to be logger. And

for the logger, we're going to get the

uvicorn logger because we're going to

run this with uicorn. We're going to say

is production is equal to false. This is

really important when we're doing this

in development mode. We need to make

sure we disable the production because

in production uh this is going to

require us to have a little bit more

security essentially to call these

ingest functions which I will explain

later. We're going to have serializer is

equal to ingest.py

dantic serializer. And that is because

injest supports pyantic typing. And

we're going to use some of those type

hints and typing system here in this

video. If you don't know what that is,

essentially this is a really good typing

system in Python that allows us to

essentially define the types of

different variables in this dynamically

typed programming language. Okay. Now

what we're going to do is down here

we're going to serve the ingest

endpoint. So we're going to say

ingest.fast_appi

and then this is going to be serve. And

then what we're going to do is we're

going to serve app and we're going to

serve the ingest

oops if we can type this correctly

client. And then we're going to put a

list and this list is going to include

the functions from ingest that we want

to serve. So like I was saying really

what we're doing right now is we're just

setting up kind of like a normal API

using fast API. We could go here and we

could say something like you know

app.get get and we could just define the

endpoint here, you know, get notes or

something and just write like a normal

standard fast API endpoint. We could run

the application with unicorn like we

normally would if you've ever used fast

API before. If not, don't worry about it

and it would just work. That's fine.

However, when we have kind of AI heavy

logic, we want that orchestration on top

of it that I was showing you. So, if

that's the case, then what we're going

to do is we're going to create something

called an ingest function. Now when we

make an ingest function because we have

this line right here, ingest will

automatically kind of serve that

function for us and it will connect to

the ingest development server which I'm

going to run for us in a minute and show

you what that means. Effectively what

will happen is we'll have this server

that now is sitting between our API and

our client. So let's say we have some

front end, right? Someone wants to use

the application. What they'll do is

they'll say, "Okay, I want to upload a

new PDF." Rather than directly sending a

request to our API here, they're going

to send a request to the ingest server.

The ingest server is going to take that

request and it's going to forward it in

the correct format to our API. It's then

going to call the ingest function that

we're going to write in just 1 second

and it's going to go through that

process of logging it, making sure that

it's retrying if possible, tracing all

of the errors, and giving us all of

those benefits. So let's look at a quick

example of how we set up one of those

functions. So we're going to say at and

then this is going to be

ingest_client.create

function. Okay. Now when we make the

function what we're going to do is we're

going to say the function ID is equal to

and then we just give this some human

readable name. In this case I'm going to

say rag and then ingest.

Okay PDF like that. Now on the next

line, we're going to specify the

trigger. Now the way that a function is

triggered or called is by some event

being issued to the ingest server. Now

that event can be triggered from a

client, so something like a front end,

or it could be triggered from another

function. So like one of our functions

could call another function and we could

have kind of this large chain of events

that are occurring. So we have an event.

An event typically triggers one or more

functions to run. And there's all kinds

of advanced stuff that you can do with

events and kind of with this flow that

I'm going to show you. So we're going to

say ingest.trigger

event like that. For the event, we need

to give this a name that we could call

from code. So we're going to say rag

slashingest

pdf. And what I've just said is okay,

whenever this event is sent or kind of

yeah is triggered essentially, we're

going to run this inest function. Okay,

now that's all that we need for right

now inside of this decorator, which is

what this thing is called. Beneath here,

we need to define the function. It's

going to be async. So, we're going to

say async define rag_ingest,

and then this is going to be PDF. And

then inside of here, we're going to take

in some context. Now, the type of the

context is going to be ingest.context.

And what we're going to return for right

now is nothing. Uh, but we will set the

type later. And sorry, this needs to be

inest with two ends. Okay, so now that

we've done this, we've created this

function that's going to be effectively

controlled by ingest and the development

server, which again I'm going to run in

one second. When someone triggers this

event, this function will run. We'll get

the context of the event, which can be

like parameters or data or values that

we want to pass here. And then we can

start doing whatever it is that we want

to do using some of kind of the flow

control inside of inest. So if I want to

just make a really simple example, I can

literally just do something like return

hello world, right? I can just return

pretty much anything that I want so long

as it would be valid kind of JSON or a

Python object that is uh what do you

call it? Serializable. So I made this

function now and when you call it, it

just returns hello world. But because

it's an inest function, we get the

benefit of having the observability

which I'm going to show you now. Okay,

so I promise it's going to all start

making sense in just one second. What

we're going to do to start is we're

going to run our Python API. So to do

that, we're going to type uv run

uicorn and then this is going to be main

colon app. Now the reason why I'm doing

that is because the name of our file is

main.py and because the name of our fast

API application is app. Okay, so I'm

going to do that. Going to go ahead and

press on run. And it says I got an

unexpected keyword here of logging.

Okay, so this should not say login. It

should say logger for the ingest client.

So let's just quickly fix that and rerun

and hopefully we should be good to go.

And there you go. We can see that our

API is now running. Okay, so the API is

running which means this function is

technically available to be called.

However, as I mentioned, what's going to

happen is we're going to have something

called the ingest server which is going

to essentially control the invocation of

this ingest function. So in order to

have that server running, well, we need

to run it. So what I'm going to do is

run a command on my computer. You'll be

able to run this as well that will run

this local development server. And this

is a huge advantage where this is open

source. You can run it locally on your

own computer. Of course, you can use

their manage solution as well. And if

you go into production, you're going to

have to deploy some instance of the

server. Uh but in this case, locally,

it's very easy. So what we're going to

do is make sure that we have Node.js

installed on our computer. And we're

going to type npx

and then ingest

client at latest. And then we're going

to type dev- u and we're going to put

essentially a link to our API. So if we

go here, we can see that our API is

running on port 8000. So what we're

going to do is we're going to write

that. We're going to go http

slash and then we're going to go

127.0.0.1

which is effectively localhost and then

this is going to be port

8000/api/ingest.

And then we're going to go d-n no dash

discovery.

Okay. Now what this is doing is it's

going to run this development server for

us. And it's going to tell the

development server, hey, I want to

connect to an application running on

port 8000. And then I put / API/ingest,

which is essentially this. Okay, so

ingest will serve an endpoint at /

API/ingest. And that endpoint will kind

of control the ingest functions. So I'm

going to go ahead and run that. And as

soon as I do, it's going to just install

that for me. And of course, I spelled it

incorrectly. It should not be ingest

client. It should be ingest CLI. I don't

know why I put client there. Let's fix

the command. npxingest-client.

Excuse my dictation tool there. And

sorry, not client- CLI. I keep saying

client. Let's just look at it again cuz

I want to make sure you guys know

exactly what the command is. npxingest-

CLI at latest dev-u. Nothing else

changes other than the fact this is the

CLI tool. And you can see that it's now

running this and it tells us that this

is running on port 8288.

So what we're going to do is we're going

to open up our browser here and we're

going to go to localhost port and then

we can copy this 8288.

When we do that, you should see

something that looks like this being

served on your computer. This is the

user interface for the ingest

development server. And what we're able

to do now is if we go to apps, we should

be able to sync with this. However, it's

telling us there's some error here. This

is not synced properly. Uh so let's make

sure that our API is running. Okay, we

can see that we're getting a internal

server error because we didn't actually

put this function that we need to serve

inside of this functions list. So,

excuse me. Let's fix that. We're going

to go rag ingest PDF. We're just going

to put that inside of here and then

save. And then what we can do is just

shut this down and restart our API.

Okay, so now we're actually serving this

function. And now if we go back here,

let's refresh this. And you can see,

let's close this. We can delete this

application that we have this rag app

that is showing up as connected. Okay,

under our apps and it has one function

called rag ingest PDF. We can view the

function. So if we go here, we can take

a look at it. I can open it up and if I

want to, I can invoke the function just

by essentially calling it, right? Or

passing this kind of event trigger. From

invoke, I can pass any data that I want.

In this case, we can just leave it

empty. And then I can press sorry invoke

function. When I do that, we see that a

run now appears under runs. And if I

click on this, we can see exactly what

happens. So it says it took, you know,

35 milliseconds with a 2 millisecond

delay. And if we look at this, we can

see this was the input, right? So this

is the ID. This is the name of the

function. This is the data. So this is

the function ID, you know, all of this

kind of stuff. This is the timestamp.

And if we look at the output, we can see

that we got hello and then colon world.

Okay. Okay. And then we could rerun this

again if we want. Now we have another

run that we can look at. We can see all

of the information related to it. And

this is extremely useful not just for

debugging the application, but for if we

actually put this into production. And

then if we go to events here, you can

see that this event occurred, right?

Ingest function finished. And then we

also had these functions that were

triggered when we invoked a particular

function. Okay. All right. So that is

kind of a quick example of essentially

how we connect inest. And now we just

leave this development server running.

So it's just going to keep running the

whole time. And we can keep kind of

turning on and off our API as we make

changes to it. So what we're going to

need to do now, we have this function.

We kind of made the connection. We need

to start actually implementing the rag

features. All right. So let's move on to

our vector database. Now as I said,

we're going to be using quadrant for

this database. and we need to write

essentially kind of like a database

client that's going to allow us to

create the database, load data, search

data, etc. So, in order to do that, what

we need to do is we need to run Quadrant

locally on our own computer. Now, in

order for this step to work, you will

need Docker installed on your machine. I

would highly recommend just downloading

Docker Desktop. Okay. Now what we're

going to do is we're just going to make

a new folder inside of our application

here. And we're going to call this

quadrant

storage. Okay. And this is going to be

the volume effectively where we store

the vector database that we're going to

use in our application. Now what I'm

going to do is copy in a command and you

can pause the video and you can type it

out if you want. That's going to allow

us to essentially run Quadrant locally

on our own computer. Now what I'm doing

is I'm saying docker run-d for damon.

The name of the container is going to be

quadrant. You can name this anything

that you want. So if we want to go back,

we can name it, you know, quadrant.

Let's go here rag db or something.

Doesn't matter. We're going to say the

port that we're running on is 6333. This

is the standard port. And then for the

volume that we want to attach here,

we're going to print the working

directory. This is going to work only if

you are on Windows. If you're on Mac or

Linux, you're going to have to change

this command slightly, which I'll talk

about in a second. We're going to go /

quadrant storage. This is the uh what do

you call it folder that we're currently

inside of. So that's why I'm doing that.

And then / quadrant / storage. And then

I'm saying quadrant quadrant, which is

the uh what do you call it? Image that I

want to run this container from. Okay, I

know it's a little bit confusing. The

reason why I have to do this print

working directory thing here with the

dollar sign is because for some reason

in PowerShell this messes up a little

bit and doesn't work locally for you.

You should be able to just replace this

with a dot slash which references the

local directory. Worst case you can put

the full path to where you want to have

this storage and it doesn't need to be

in the same directory as the

application. So it's completely up to

you. If you're on Windows again go with

this. Okay. So, dollar sign and then

pwd. And sorry, I think this needs to be

inside of parenthesis, not braces. If

you're a Mac or Linux, go with dot

slash. And then again, make sure you

have this quadrant storage uh created in

this directory that you're referencing.

Okay, so we're going to go ahead and run

this. You shouldn't really see any

output. You should just get some random

string that kind of looks like a hash

here. And then it should just be done.

Now in the background if you open up

docker desktop you should see that there

is now a new container that is running.

So let's wait for this to load for a

second. Okay. And you can see sorry it

just loaded. We have quadrant rag db

it's running. We have the container ID.

We have the image. We have the ports.

And we can control that from here. Okay.

And I'm just going to shut down this

other container that I don't know why

that's running because I don't need this

active. So let's just delete that. Okay.

Anyways, there we go. Now we have

Quadrant running and we're able to

connect to it from our uh code. So what

we're going to do is we're going to make

a new file here and we're just going to

call this vector_db.

py. Okay, we can add that to get. That's

fine. And inside of here we're going to

write the code that will allow us to

connect to our quadrant database and to

search something in the database. So,

what we're going to do is we're going to

say from quadrant,

okay, if we can spell this correctly,

underscoreclient. We're going to import

the quadrant client. We're then going to

say from the quadrant client dot models,

we're going to import the vector params,

the distance, and the point structure.

Okay. And we're going to make a class.

Now for the class, we're going to call

this just quadrant

storage. We're going to do an

initialization. So we're going to say

define a netit. We're going to take in

self. We're going to take in some URL.

And by default, the URL is going to be

http/

localhost col 6333.

Then we're going to take in a

collection. So for the collection, I'm

going to call this documents. This is

going to be the collection where we're

storing the information essentially. and

we're going to take in the dimensions or

dim and this is going to be equal to

372.

Okay, now there's a lot of stuff to

explain when it comes to vector

databases. Essentially with this

quadrant database that we're using, it's

very high performance and we can run it

locally. However, realistically in

production, you would actually deploy

this database out and then you would

probably end up changing this URL. So

you're connecting to the deployed

instance or you're using Quadrant's kind

of managed service, right? Um, I don't

really know exactly how that works, but

they obviously have their own, you know,

thing that they'll try to sell you. So,

here we can go self.client

is equal to and then we're going to

bring in the quadrant client. We're

going to pass the URL equal to URL and

we're going to say the timeout is equal

to 30 seconds. So that if we don't

connect in 30 seconds, then we

essentially crash this program. We're

going to say self.colction

is equal to collection. So, we store

that as a variable. And then what we're

going to do is we're going to create a

new um collection in our vector database

inside of this quadrant storage folder.

You can see now we have this collections

folder, right? We have these aliases, a

few other pieces of data. So what we're

going to do is say all right, do we

already have a collection called docs?

If we don't create one or sorry, yeah,

if we don't create one, if we do, then

we don't need to create one. So we're

going to say if not, and this is going

to be self.client client dot and this is

going to be collection_exists

and you can see the autocomplete coming

here from PyCharm. We're going to put in

the name self.colction. Then what we're

going to do is say self.client.create

collection and for the collection we're

going to say the collection name is

equal to self.colction

and we're going to say the vector and

this is going to be underscore config is

equal to the vector params. The size of

our vector is going to be the dimensions

and we're going to say the distance is

equal to distance doc cosine. This is

the uh algorithm or formula or whatever

you want to call it for calculating the

distance between different points in our

vector database. Again, without getting

into any advanced linear algebra here,

which is kind of the foundation of how

vector databases work, what we're going

to have is a certain number of

dimensions. This is effectively kind of

the number of values that we have inside

of our vector. And we're going to kind

of turn these text documents into

vectors. We're then going to compare

these vectors against each other using

this distance formula. And vectors that

are closer to each other in this vector

space have kind of similarity. At least

that's what we're hoping for. So we'll

be able to really quickly find those

vectors that are close to us, pull them

out of the database, get their original

text data, and pass that to our LLM. So

that's what we're setting up here uh

with this initialization. Now we're

going to make a function here called

upsert. This just means essentially

insert and update. And we're going to

say self ids vectors and payloads. And

what we're going to do is say points is

equal. And then we're going to have a

new um what do you call it? List

comprehension here. And we're going to

say this is point strruct. And then this

is going to be id equals id's i. vector

equals vector and then i or vectors i

sorry. And then we're going to have

payload is equal to payloads i for i in

range. And then this is going to be the

len of ids. Now, what this is going to

do is it's going to get all of the

associate ids, vectors, and the payloads

from these three uh lists effectively

and create a point structure, which is

what we need to create in order to

insert this into our vector database.

And then we're going to insert it. So,

we're going to say self.client.upsert.

Okay, we're going to upsert the what is

it? Self.colction. So, let's type this

correctly. Self.colction. And then we're

going to say points is equal to points.

Okay, so the idea here is that we're

going to pass a series of ids, which is

a list, a bunch of vectors, which is

kind of the vectorzed version that's

going to be in a dimension of 372, and

then payloads. And payloads is going to

be real data, data that's actually human

readable. Um, that kind of represents

the information that we've vectorized.

We're going to convert all those three

things into this point structure, which

is just what's required for quadrant and

the way that we're doing this. and we're

going to insert that into the vector

database. Okay, so that allows us to now

add vectors effectively. And the more

important thing is searching for

vectors. So we're going to say define

search. We're going to say self. We're

going to say query

vector

like that. And we're going to say top k

this is going to be an int. And by

default it will be equal to five. Now

we're going to say results is equal to

self do.client client dot search and

we're going to search the collection_ame

equal to self.colction.

We're going to say the query vector is

the query vector. We're going to say

with payload is true and we're going to

say the limit is equal to top k. Now top

k just means that we're looking for this

many results from the vector database.

So we're going to look for five results

in this case. We could look for two. We

could look for 10. could look for 20.

Obviously, the more you look for, the

longer this could potentially take, but

it's still very, very fast. Okay, then

we're going to say context is equal to

an empty list. And sources is equal to

an empty list. So, the reason we have

these variables is because I need to get

all of the context or information. I

want to store that in one list. And then

I want to get all the sources. So, the

documents essentially that we pulled

this information from. So I'm going to

say for R in results I'm going to say

payload is equal to get attribute or get

adder this is going to be R payload or

none and then or we're going to have

this empty dictionary. We're going to

say text is equal to payload.get

and we're going to get text or we're

going to get an empty string and we're

going to say source is equal to

payload.get get and we're going to say

source or an empty string.

Okay. Now, what we're going to do here

is we're going to say if text. Okay. So,

if there is text, we're going to say

context.append text and sources.append

source. And actually, this just reminds

me I'm going to just convert the sources

into a set because I don't want to keep

adding the same source over and over

again. So, we're going to say

contextes.add and then rather than

append, we're going to say add source.

And then what we're going to do down

here is we're going to return and we're

going to say

context and we're going to return the

context and then the sources we're going

to convert to a list and we're going to

return the sources. So what this is

going to do is it's going to search our

vector database. It's going to get the

relevant results based on this query

vector which again we'll look at in a

little bit and then we're going to pull

out all of the sources and the context

and return that. Okay, so that's that

for the vector database. Now one thing

to note the way that I've done this is

that we are going to lose which context

is associate with associated sorry with

which source. So if we wanted to we

could change it back and then we would

have kind of the related data based on

the indices. In this case I think it's

fine just to have it as a set. So that's

our vector database. The next thing that

we need to do is have a way to

essentially read in a PDF. So I'm going

to make a new file and this new file is

going to be called data_loadader.

py. Now, this is where I'm going to use

llama index to load in PDF documents and

to embed them because what we just did

is we made the vector database which

will allow us to upload essentially

vectors and search for vectors. But the

thing is I still need to create the

vectors. So let's do that now. So I'm

going to say from OpenAI

import openai. I'm going to say from

llama index dot readers

okay dotfile import the PDF and then

reader I'm going to say from llama index

docore dot node parser okay import the

sentence splitter and then I'm going to

say from

enenv import load envoenv

function again in here and I'm going to

initialize an openai client So I'm going

to say client is equal to open AAI. Now

because we've defined this variable

OpenAI API key inside of this file, we

don't need to actually pass anything

else to what is it here? This client, it

will just automatically find and look

for that variable. And because it

exists, it will essentially allow us to

use OpenAI. Now let me briefly discuss

what we're about to do here. So we're

going to effectively load in a PDF. Now

when we load in a PDF, that could be

very large. It might be, you know, a

thousand pages long. We can't just embed

the entire PDF. And by the way, embed

means just effectively convert it to a

vector so we can store it in the

database. Instead, what we need to do is

chunk it. Chunk it means we need to

break it down into smaller pieces and

then embed those smaller pieces. And the

size of the pieces is relevant here. We

don't want anything that's too big, but

we don't want anything that's too small.

So that we still have a lot of really

relevant data, but we don't have, you

know, massive amounts of data or really

tiny pieces of data that are going to be

hard to search for. So, what I'm doing

here is I'm going to use Llama index to

read in our PDF and then to split all of

the sentences in that PDF into chunks.

We're then going to take those chunks.

We're going to embed them and then we're

going to store that in the vector

database. So, what we're going to say is

our embedders model is equal to the text

embedding and not this one. It's going

to be -3-l large. There's all kinds of

models that can embed text for you. In

this case, we're using OpenAI. This is a

pretty popular one. And we're going to

say the embedders dimension is equal to

372. And we need to make sure that

matches what we have inside of our

vector database, which is 372.

Okay, this is effectively how large the

vector is for the text that we're

embedding. Then we're going to say our

splitter is equal to the sentence

splitter. We're going to have a chunk

size. I'm going to say chunk size is

1,000. And the chunk overlap. Now the

chunk overlap is how much of the end of

one chunk is included in the beginning

of another chunk. The reason why you

would have an overlap is because if you

have a sentence like hello world, my

name is Tim. Let's say we want to split

this right into two chunks. Well, we

might have one chunk that is hello

world. We might have one where his name

is Tim. Now what we would do if we had

an overlap of let's say one or something

and one represented a word is in the

first chunk we would have hello my and

then in the second one we would have my

name is Tim where we would kind of

duplicate the word at the end in the

beginning of this chunk so we don't

potentially lose a lot of relevant

context. So in my case I'm going to go

with a chunk overlap of 200. This

represents characters by the way not

words so that we're able to kind of

split it properly. Okay. Okay. Now,

anyways, that hopefully that makes sense

in terms of how that's going to split,

but it's just going to do essentially

sentence splitting for us. Um, and

create all of these chunks. We're then

going to say define load and_chunk

PDF. We're just going to take in a path

to the PDF, which will be of type

string. Now, we're going to say our

documents are equal to the PDF reader.

Data, and we're going to say the file is

equal to the path. This is just going to

look for the PDF and it's going to load

it. We're then going to say text is

equal to and we're just going to pull

out the text. We're not looking at

images or anything. And we're going to

say d.ext

for d which stands for uh what is it

kind of data or document for d in docs

if get adder dext

none.

So, what this is going to essentially

say is, okay, we're going to get all of

the text content for every single

document inside our documents. All

right, if this document uh has some text

attribute, because we might have a PDF

that only has images, for example, not

text. We're then going to say chunks is

equal to this. We're going to say 4T in

text. And we're going to say

chunks.extend.

And then we're going to say splitter,

okay? Text. and we're going to split

this with t and then we're going to

return

our chunks. Okay, so this is the

chunking process. Again, I don't want to

break it down too much further than

that. Effectively, we're taking the PDF,

turning it into smaller pieces of

textual data, and then we're going to

embed each of those pieces of data. So,

the next function we're going to have is

embedders text.

This is going to take in our text, which

is going to be a list of type string.

This is then going to return, I'll just

type it manually here, a list of list of

type float, which is effectively what

our vectors are going to look like.

We're going to say response is equal to

client.bedings.create.

We're going to pass the model, which is

the embed model, and we're going to say

the input is equal to text. We're then

going to return iteming

for item in response.

Okay, so what this is going to do is

it's going to send a request to OpenAI.

It's going to pass all of the text uh

which is all of the text that we've kind

of chunked here already. It's going to

embed them, which means convert it into

a vector, which we can store in the

vector database. We're then going to go

through the result here. So,

response.data, and we're just going to

pull out the embedding itself. We don't

care about any of the other metadata

that's going to be included. So, that's

our data loader. That's our vector

database. And now, we're going to move

over to main.py. Okay. Now, I want to be

able to test the ingestion of a PDF

first. So, I'm going to write this

function by using some of the functions

that we just wrote here, and then we'll

kind of continue. And there's a bit more

advanced stuff that we want to get into.

Um, all right. So, let's start by

importing some of the stuff that we just

wrote. So, we're going to say from data

loader, import the load and chunk PDF

and import the embed text. We're then

going to say from the vector database

import.

Okay, the quadrant storage like that.

Now, quickly, I'm also just going to

make a new file here. I forgot to do

this where I'm going to create some

custom Python types. So, I'm going to

call this customtypes. py. The reason

for this is that I want to have these

types so that I can make my application

a little bit more readable and I can

import and use paidantic which is

supported inest. So I'm going to say

import pi dantic and I'm just going to

write some really simple Python classes

that represent some types that I'm going

to use in my app. So I'm going to say

class and I'm going to say rag chunk and

src and this is going to be paidantic.

Model and then what I'm going to do is

say chunks and this is going to be a

list of type string and I'm going to say

the source ID string is equal to none.

Now what this type is going to represent

is essentially the result after we chunk

and get the source for a particular PDF

document. I'm then going to say class

and this is going to be rag if we can

spell this correctly upsert result. This

is going to be the result after we

upsert a document. So we're going to say

pyantic dot base model and then we're

just going to say ingested and this is

going to be an int just representing how

many um what is it things that we

ingested. We're then going to say class

and this is going to be rag search

result. So you can guess what this one

is when we're searching for some text.

That's what we're going to have here.

We're going to have pideantic.base base

model and then we're going to have the

context which is going to be a list of

type string and we're going to have the

sources

which is going to be if we type this

correctly a list of type string. Then

we're going to have one more. This is

going to be class rag

query

result. So this is different than the

search result. This is the query that

the user is actually sending uh what is

it to the endpoint. So we're going to

say pantic.base base model. We're going

to say the answer is of type string.

We're going to say the sources is list

of string and we're going to say the

number of context is an int. Okay, so I

think that's all we need for our custom

types. Let's go back to main. Let's now

import these custom types and then we

can use them in this first function.

We'll test it and then we'll move on to

the next one. So we're going to say

actually from custom types import and

then we need to import all of these. So

what did we have? We had the rag query

result. We had the rag search result. We

had the rag upsert result and we had the

rag chunk and source. Okay. So now let's

go into this function that we wrote and

let's start setting it up to actually

utilize inest properly and to kind of

perform the steps that we need. So first

things first, what I'm going to do is

kind of explain to you a little bit

about how this works. when we actually

run an ingest function. So as you saw,

we kind of have a diagram that looks

like this, right? We have ingest which

is kind of the execution engine or the

local server. We send a request to there

that then sends request to our API

endpoint. The API endpoint then goes to

our ingest function and then inside of

the function we can have these things

called steps. Now each step is kind of

an individual operation that we're going

to track that we could retry if needed

and that we're going to kind of observe,

right? and get all of the logging and

the information for. So if I just go

quickly over to overview here, you can

see right that there's kind of three

main things in inest. We have the

triggers which we've already kind of

talked about where it's essentially

events that can trigger something to

run, right? It could be a web hook,

could be a crown schedule, could be a

manual trigger like in our case. We have

flow control, which we're not really

getting into right now, but we will talk

about the concurrency, the throttling

and all of that later on. And then we

have steps. And steps are kind of how we

convert a function into a workflow with

multiple retryable checkpoints. So if

you look at an example right here, we

have this step, right? We're saying,

okay, we're going to run step one, which

is getting data. We're going to wait for

that step to finish, and then we're

going to save the data. Now by wrapping

these different operations in these

steps from ingest, this allows us again

to have all those advantages of the step

where we can retry it. We can wait for a

step to finish and we can see kind of

what the application's actually doing at

each step. So we have kind of deep

observability into our functions. So I'm

going to show you how you make a step.

But if we go here, you can see running

retryable blocks, pausing execution,

pause for an amount of time or wait for

a specified amount of time. There's a

crazy amount of stuff that you can do

with the steps here, but what we're

going to do is have two steps in our

function. The first step is going to be

for loading the PDF, and then the second

step is going to be for embedding it and

kind of chunking it or not chunking it,

but adding it to the vector database.

So, what I'm going to do is I'm going to

write two internal functions. The first

function is going to be load. And this

is going to take in a ingest dot context

and it's going to return a rag chunk and

src result. Okay. And for now we're just

going to go with pass. We're going to

have another function and I'm going to

call this upsert. Now this is going to

take in chunks and what do you call it

underscore src. This is going to be of

type rag chunk and src. And it is going

to return a rag upsert result. For now

again we are going to pass. So the idea

is I have these two individual steps

that I want to run inside of this

function. We need to load and then we

need to add to the vector database. And

we can make as many steps as we want. In

this case it's kind of the logical thing

to do. We could even make more steps if

we want. And if we did that then we

would have obviously an even more

detailed function where we go through

everything. We can see the timing all of

that kind of stuff. So what I'm going to

do down here is I'm going to say chunks

and src is equal to and rather than just

calling the load function directly which

is what you would do if you're just

working kind of standardly in Python is

we're going to wrap this in a step. Now

the way we do that is we say await and

then this is going to be ctx.step.run

and then we put the name of the step. So

in this case we can just call this

anything human readable that we want. So

I'm going to say load and chunk and then

we put the function that we want to

call. Now, because we want to call these

functions with arguments, what we're

going to do is put a lambda, and we're

going to say lambda, and then this is

going to be underscore load. And then

we're just going to pass ctx, which is

this value essentially here, right into

this load function. Now, we don't really

need to pass it like this because we

could just use this as a global variable

inside of the function. But for now,

that's how I'm going to do it. Now, we

also have the ability here to specify

the output type. So I can say the output

type is rag chunk and src because like I

said this now supports paidantic. All

right. Now we're going to do the exact

same thing for the next step. So what

we're going to do is we're going to say

inested

is equal to await ctx.step.r

run. This step is going to be called

embed

and upsert. And keep in mind like you

can name these steps anything that you

want. It's more for the logging to see

it. We're going to say lambda

and then this is going to be underscore

upsert and then we're going to pass the

result from the previous step chunks and

src and we're going to put the output

type equal to the rag and what is this

upsert result okay so this is how we

call the steps right and then what we

could do is we could say return ingested

dot and I'm going to go model dump what

this does is it just takes our pyanti

model and converts it into JSON or a

Python object. So, kind of Python

dictionary, sorry. Um, and allows us to

return that because these functions need

to return something that is what's known

as serializable. So, we just model dump

the uh what is it? Pantic object and

we're good to go from there. So, that's

kind of how we set up the steps, right?

We're saying, okay, we're going to run

this step. We're going to wait for it to

finish. Then, we're going to run this

step. Now, you could run these steps in

parallel, right? You could run them at

the exact same time. You don't need to

wait for one to finish because we're

doing these asynchronously. Um, so we

could just remove the await, right? And

we could just run them kind of one by

one. So we could just remove the await.

We could run them at the exact same

time. We could run it in parallel. Like

we can control the flow however we want.

But in this case, I do want to wait for

them to finish running because they will

take, you know, a second and I need the

result from this one before I can do the

result here or before I can execute

this. So let's write the context of the

functions now. So I'm going to say

inside of load, what we need to do is

get our PD PDF path. So I'm going to say

ctx.event

event

dot data and then I'm going to get the

PDF path. I'm then going to say the

source id is equal to ctx

doevent dot data.get

and I'm going to say source id and then

I'm going to put pdf path here. Now this

is because if I pass a source ID myself

then I will use that. If not then we'll

just use the pdf path as the source ID.

We're going to say chunks is equal to

load and chunk PDF. We're going to take

the PDF path and what we're going to do

is return rag chunk and src and we're

going to say the chunks is equal to the

chunks and the source ID is equal to the

source ID. Okay, so that is loading.

What we're doing right is we're just

going to load and chunk the PDF and

we're effectively going to return that

result. Now upserting we're going to say

chunks is equal to chunks and

src.chunks.

Right? The typing is nice here. We know

what we're going to be getting from this

object. We then are going to say the

source ID is equal to chunks and src

dots source ID. We're going to say the

vectors is equal to and this is going to

be the embed text. From the embed text,

we're going to take in the chunks. And

by the way, if we wanted to like we

could convert this into a step. It's not

necessary because it's already kind of a

part of this and this is the long

running operation anyways. But we can

have you know other steps from one of

these steps. We can also trigger another

function from this ingest function and

we can make it you know as complex as we

want. So I'm going to embed that. I'm

then going to say ids because I need to

generate a unique ID for all of these

vectors and I'm going to say this is

string.

And then we're going to say uyu ID

doyuyu ID 4. This is going to create a

unique identifier for us. And sorry,

actually, UUID 5, not four. We're going

to take UUID.namespace,

and we're going to use namespace URL.

Don't worry too much about exactly what

this is. Just bear with me because this

will identify or make a unique ID. We're

going to say source ID. And this is

going to be associated with I. And then

we're going to go for i

in range len of yes len of vex like

that. Or actually rather than going vex

let's just go len of chunks although it

shouldn't really make a difference. Now

that's the ids. The next thing that we

need is the payloads because we've

generated the vector but I need the ID

for every vector. I need the payload for

every vector. So I'm going to say

payload is equal to and then this is

going to be source. The source is going

to be the source ID so we know where it

came from. We're going to say the text

is equal to chunk at i and then sorry

this is chunks not chunk. So it's chunks

i for i in range the len of chunks. So

again what we're doing is we're kind of

looping through all the chunks and we're

getting uh what do you call it? All of

the text and the source ID for that. So

we can have that as a payload. So we

have an ID, we have a payload, we have

the vectors. Now we're able to actually

pass this to the quadrant store so that

we can store it there. So we're going to

say quadrant storage. Okay, we're just

going to initialize it. We're going to

say upsert. And then we're going to pass

our ids, our vectors, and our payloads.

And we're going to return the rag upsert

result. And for this, we're going to say

ingested is equal to the length of and

then this is going to be what did we

have here? Chunks. Okay, so the number

of chunks that we actually ended up

ingesting. All right, so that's actually

it for this function. You know, in

theory, assuming I didn't make any

mistakes, which is unlikely, this will

run and we will be able to actually see

the results running inside of Injest and

we'll be able to upsert this into the

Quadrant database. So, right now, let's

look uh at what we have. We have the uh

Injust server running. Our server is

currently running, but we need to stop

it and restart it because we made a

change here and it's not in kind of

debug mode right now or reload mode.

It's going to take a second to load up

here because it's connecting to the

database. Okay, we can see that it's all

good. What we're going to do now is

we're going to go back to our

development server. Now, we're going to

go apps. We're just going to make sure

it's connected. Looks like it is. Let's

go to our our functions. Let's view

that. And I actually want to test

invoking this. So, in order to invoke

this, I'm going to press on invoke. And

for my data, I'm going to pass my PDF

path.

And I need to pass a valid PDF path. So,

what I'm going to do is I'm going to go

and get the absolute path to a PDF from

my documents. And I'm just going to

paste it inside of here. Let me do that

now. Okay. So, I just have a path here.

I had to escape the slashes because it's

not going to allow me to do this in the

string due to how like the slashes are

how they escape the string essentially.

Now, this is just a link to one of the

resources that we actually have for dev

launch where we have kind of like a DSA

road map. And anyways, what I'll do is

just go and press invoke function. When

I do that, we see a new run is

triggered. Now, it looks like it's just

waiting for this to run. So, it says

running. And then I guess there was an

error. State finished. Okay. I don't

know what that means. Uh, file not found

error. Okay. So, you can see we had a

file not found error. So, clearly I kind

of made a mistake there. It's going to

keep attempting this a few times based

on how the default settings are set up.

So, what I could do is just cancel that

for now. And that's the whole point of

having this, right? We could debug this.

I'm going to go to functions, go back

here, go to invoke, and I'm now just

going to put a new file. And it looks

like actually it wasn't two graphs, it

was 12 graphs. I'm just looking in my

file explorer. So, let's try it. Now, we

have this new run. We can see a loadin

chunk look to work pretty fast. 48

milliseconds. We get two chunks here

because of how much content was in this

document. And then we have embed and

upsert. And we can see that we ingested

two documents. And that took a little

bit longer to go in the vector database

whereas loading and chunking was very

fast, right? Where the embedding, you

know, required us to call out to OpenAI.

So it takes a second to get the result.

And then we have finalization down here.

Cool. So that's also just a really nice

way to test this, right, from this UI

rather than having to make the front

end. So what I'm going to do now is I'm

going to move on to the next function

which is going to allow us to actually

query our PDFs. Right? So let's make

another ingest function. So here we're

going to say at ingest and then this is

going to be underscoreclient

dotcreate

function. For the function we're going

to give this an ID. So we're going to

say function ID is equal to and this is

going to be rag. And then we'll go here.

Query PDF.

Okay, that can be the name. Then we're

going to say the trigger is equal to

ingest.t trigger event. For the trigger,

we'll go rag query PDF AI because we're

using AI to do this. We could do it

without AI as well. We're then going to

say async define rag_query

pdf_ai.

We're going to take in our context,

which is the

ingest

dot context like that. And what we're

going to do now is start setting up

essentially everything that we need to

do to query the PDF. Now, the first

thing that we're going to do is have a

function. Okay, I'm going to call this

search. Now for search we're going to

take in a question

which is a string and we're going to

take in the top k which by default will

be equal to five and which will be an

int.

Okay. Now inside of here we're going to

say our query vector is equal to embed

text and we're going to pass a list with

our question and then we're going to

pull out index zero. The reason for this

is that if I want to query my database,

I need to do it with a vector. So

whatever the question is that the user

asked, I need to embed that so it's in

the same format as everything in the

vector database. Because this normally

takes in a long list of text, we're just

going to pass it as a list and then take

out the first result. So we'll have this

as our query vector. Then what we're

going to do is we're going to say our

store is equal to quadrant storage. And

we're going to say found is equal to

store.arch. We're going to pass our

query vector and we're going to pass our

top K. we don't need to pass it as the

keyword parameter uh or name parameter

whatever you want to call it and we're

going to say found is equal to store

search and we're going to search based

on that query vector. We're then going

to return our rag and this is going to

be search result and we're going to say

context is found and then context and

sources is found and then sources and we

can also just add the return value here

which is the rag search result. Okay.

Now, if we go down here, we've kind of

built out the first step, but there is

more that we need to do. So, first

thing, we're going to get the question

from our event data. So, we're going to

say ctx.event.data,

and we're going to pull out what the

question is. We're also going to pull

out the top k because this is something

that will allow the user to pass. So,

we're going to say ctx.event.data.get

top k and then five. And we're also just

going to convert this to an int in case

they pass it as a string. Okay. Now

we're going to say found is equal to and

we're going to await ctx.step.tr

run. We're going to run the step which

we're going to call embed and search.

And same thing as before, we're going to

call this function. So we're going to

say lambda

search and we're going to pass the

question and the top k and we're going

to say that the output type. So let's

specify that is equal to the rag search

result. Okay. And I don't know why it's

giving me that. I didn't want that

autocomplete. All right. So we have the

output type rag search result. Let's

make this left side a little bit

smaller. And we're running this now as a

step. So again, it will be retriable and

we'll have all of those benefits. Now

I'm just going to copy in the prompt

that I'm going to use for the LLM

because now once we find the

information, we need to pass this in a

prompt. So what I'm going to do is say

content block and I'm going to join all

of the context. So all of the sentences

essentially that I found uh like this

where I have a dash and then I have the

uh what do you call sentence 4C in

contextes and I'm going to combine them

with kind of two new line characters. I

know this looks a little bit weird but

I'm just taking all of the context in a

list and converting it into a string.

Then what I'm going to say is the user

content. So this is essentially what I

want to ask or kind of like the um yeah

the the prompt is going to be use the

following context to answer the

question. Here's the context. Then

here's the question. Answer concisely

using the context above. Okay, that's

the prompt that I want. Now what I'm

going to do is I'm going to use an

adapter here from uh what do you call

ingest to actually call the AI model. So

I'm going to say adapter is equal to AI.

OpenAI dot adapter.

Now for the adapter I need to pass my

authorization key. This is going to be

OS.get envai API key. I need to manually

pass it here because we're not using the

open A client now. We're using the

adapter from uh what do you call it? Uh

ingest. I'm going to say the model is

equal to and then I'm going to pass

GPT-40-

MIDI. I'm just going to use a small one

because I don't want it to be super

expensive. And then I'm going to

generate a response. Now, by using this

AI adapter and generating a response

here with the method you're going to see

in a second, again, we get the same

benefits like we do with the step

function where it will automatically be

retrieded. It will handle the

throttling. It will handle the rate

limiting. As you probably know, when you

call these LM providers, there can be a

lot of errors that pop up. So, we're

going to say response or res is equal to

await ctx.step.ai.infer.

This is an AI inference. It has a

special kind of syntax here inside of

ingest. We're going to call this the LLM

answer. So again, it will be observable

to us. We're going to pass the adapter,

which is equal to the adapter that we

just wrote. We're going to pass the

body, and for the body, this needs to be

equal to. We're going to pass max tokens

equal to 1,024.

We're going to pass the temperature

equal to 0.2. This is essentially how

random the model is going to be. So, we

want to be pretty low on temperature.

Now, I'm spelling temperature completely

incorrectly. So, let's fix that

spelling. And then we're going to have a

message. So, we're going to say messages

actually is a list. And for the

messages, the first thing that we're

going to do is we're going to say roll.

And this is system.

And then the content, let's fix this

typing here, is going to be equal to

kind of a system message. And I'm just

going to paste in a simple one here. You

can make this as detailed as you want.

I'm going to say you answer questions

using only the provided context to make

sure that it really is not kind of going

crazy and it's just using what we

provided. And then I'm going to say roll

user and then content and then I'm going

to pass to this again if we fix the

typing the what did we call this up

here? The user content.

Okay. So that is going to now generate

the response for us. Now after the

response we need to get the answer. So

we're going to say answer is equal to

res. This is the default format from

OpenAI. So we're going to say choices

zero

message and then content. Okay? Because

you can pass multiple things. So this is

kind of the way that we need to do it.

And then we're going to say dot strip.

So we're going to strip off any leading

or trailing white spaces. Then what

we're going to do is we are going to

return and we're going to say answer is

answer. We're going to say sources. And

this needs to be as a string.

So sources is going to be equal to found

do.ources like that. And then we're

going to say the num

context. And then what we'll do is we'll

just pass the number. So we're going to

say len of found.context.

Okay. So that should pretty much be it

in terms of this function. Again, what

we're doing is we have the search step

where we're embedding the query, the

thing the user asked essentially, and we

are then searching for the results in

the vector database. We are running that

as a step. We're creating a simple

prompt here to say, okay, here's the

information. Here's the question. We're

creating an AI adapter. We're passing

that to the inference step with our

messages and then we're just kind of

parsing the response and returning it.

It's really not too complex, but just

because we're doing with ingest, I want

to explain it a bit more in depth. Okay,

now that we have that, we can restart

the server again and we should be able

to actually run this inference. So, if

we go back, we can refresh. Okay. And we

now know that we have one document that

we added. I added it about graphs,

right? And actually, let me open the PDF

so you can kind of see what it looks

like. Something like this, right? Where

we have like some problems, leak code

problems, um, you know, just like some

basic information because we recommend

people do, and there's, you know, a

bunch more here. uh certain questions on

leak code during certain days based on

our DSA road mapap in dev launch. So

what I'm going to ask it is something

like you know what is the importance of

graph I don't know we we'll come up with

something right but if I go to functions

now oh we only have one function okay so

let's go back here my my apologies guys

let's shut this down okay close that

shut this down and let's add this

function to the list so what do we call

this uh rag query PDF AI because we need

to serve the function so we got to add

it into that list there okay let's rerun

the application

and apologies. Let's go back. Let's go

here. Refresh. And now we see another

function is appearing. So let's go to

query PDF. Let's go to invoke.

I think this time we just need to pass

top K and questions.

Okay. So we're going to say question and

then the question can be like this. Oh,

I'm going to ask it something like what

is the importance of graph problems? And

let's see. I don't know if that's going

to give us an answer, but let's try it.

Okay, so it's running and we can see

that it says, you know, embed in search.

It ran that step and then I did LLM

answer and finalization and then it gave

us the answer here. It said graph

problems are important because they help

build comfort with fundamental concepts

such as modeling problems as graphs,

performing DFS and BFS traversals and

managing state and recursion. They also

enhance blah blah blah. This is a direct

quote out of that PDF that I showed you

and it says that this is the kind of

context that it looked at. You can see

there's another document that I uploaded

kind of as a test. Um although, you

know, it didn't actually use anything

from this. It just was pulling that

context and then it pulled it out of

graphs. And if we look through the other

steps, you know, we can see all of the

input output, everything that was going

on and kind of the importance of that.

Cool. Okay. So, that worked. We now have

the two functions. And to be honest, at

this point, like the application is

pretty much done. We built a rag query

application. But I do want to show you a

few other kind of nice features that

Injust has that we can take advantage

of, as well as how we can add a custom

front end to this. So for the front end,

what I'm actually going to do is I'm not

going to code it from scratch with you

because it's going to take a little bit

of time to do that. I'm going to leave a

link in the description to all of the

code in this video. In that uh link,

you'll see a Streamlit file. It's going

to be called Streamlit app. I'm going to

write it in one second for you. You'll

see it. And you can just copy that and

paste it into this project and it will

be a functioning frontend for you to use

within this application. So let me show

you what I mean. I'm going to make a new

file. I'm going to call it

streamllet_app.py.

Now I'm just going to directly copy this

from my other monitor because I wrote it

before the video for the demo. Okay. So

I'm just going to paste it directly

inside of here. And you guys can find it

from the link in the description in the

GitHub link. Same thing. Literally just

copy it and paste it inside of here. And

I'll quickly step through some of the

code in terms of calling our function so

you see how that works. Uh and then this

will just be a functioning front end.

I'll show you how to run it. So

effectively this is a stream

application. Really nice UI. You can

just build stuff in pure Python. You can

see that we what do you call it? Import

ingest here as well. We create an ingest

client. It's really important that we

specify the correct app ID and that we

don't specify that this is in production

because if it's in production, we need

to pass something called an event key uh

which you can generate on ingest which

is a little bit more complicated if

you're going to deploy this for security

reasons. We have the ability to like

save and upload a PDF. So we store it

and we can upload it. And then we have

things like send the rag ingest event.

So, like I was saying, if you want to

actually trigger one of the functions to

run on ingest, all you have to do is get

the client and then you just send an

event. So, we send the event, which is,

you know, rag ingest PDF just like we

were kind of doing from the UI. We pass

the PDF path and the source ID and

that's it. And then it just triggers it

and it runs and it ingests it. Right

now, same thing for running the query

event. When we run the query event,

what's going to happen is it's going to

return to us the event ID. Now, this is

the thing that's a little bit tricky. If

you want to get the result from an

event, because this is not synchronous,

you need to send a different request

later. So what I mean by that is the

result here is not actually going to be

the result of the LLM call. What it's

going to be instead is just kind of like

some metadata about this event and how

it ran. So what I do is I actually just

return what's going to end up being here

kind of the event ID. And then we have

this code right here which fetches all

of the event runs and then is able to

search through this for the most recent

event that matches the event ID that we

had and then actually get what the

result of that is because again an event

might take you know a day a minute 10

seconds. We don't know how long it's

going to take to run. So what we do is

we kind of run this loop where we're

fetching the events. We're looking for

this particular event ID right? So we're

looking for that event ID and then any

of the runs. We're getting the most

recent run. We're checking its status.

If the status is completed, succeeded,

success, you know, finished, one of

these things, then what we're going to

do is we are going to return what the

output of that is. Otherwise, if it's

failed or canceled, we're going to raise

an error. And if we go past the timeout,

then we're going to say that there was a

timeout. Then we're going to sleep in

between. We're essentially pulling the

endpoint to get the result. This is all

documented extensively in the

documentation for ingest. But

essentially, we just send a request, do

this endpoint. So we get the API base

URL, we send a SLevents SLevent ID

SLruns and then we're good to go. And

this is what the uh base URL will be,

right? Either in the environment

variable or 8288/v1

slash and then this. Okay, hopefully

that makes sense. Again, I'm not going

to explain the entire front end, but I

just want to show you that, you know, we

wrote it. We can use it. So if we go

into another um terminal, we can type

uvun streamlit run and then the name of

this, which is streamllet app.p py when

we do that it should open it up for us

in the browser from here we can just use

this UI right so I can ask a question

like why or let's go this you know why

are graphs

important we can say okay we just want

to retrieve maybe three chunks we can

ask that it's then going to send the

event and generate the answer if we go

to ingest we can see we have a running

event right now right because we sent

that from the front end and then it's

going to take a second because of the

polling how we have it set up on a

little bit of delay and we should get

the result. Okay, so you can see we get

the answer and it tells us kind of the

uh what do you call it? The sources that

it used to get that. Now if we upload a

file, let me do this. Okay, so I just

uploaded a resume here. This is actually

one of our Dev Launch students résumés

and uh we did that. You can see it says

triggered ingestion. And if we go back

here, this event just ran. It's very

fast. Ingest PDF. And we can go through

and we can kind of see what happened

here. I don't want to expose this

because it has some personal data in it.

But the point is, you know, you get the

idea. that we embedded, inserted, etc.

In this case, it was just one chunk.

Okay, so that's that. That's the front

end and that's almost everything. The

last thing that I will quickly go

through is a few things that you can add

to the functions for rate limiting and

throttling and some more control. So,

you can do a lot with this, right? I,

you know, kind of barely scratched the

surface of what's possible with this

orchestration tool, but I want to

quickly show you that a nice benefit of

this is how easy it is to add like rate

limiting. So for example if we go to

what is it like overview here you can

see that if we want to add throttling we

can just go throttle is ingest throttle

we can say you know count is two you

know period daytime delta we could

literally just like copy this right go

back to pycharm

let's add this in and now we have

throttling automatically applied to this

function right boom it's there it's

going to work same thing if I go back

here so I could go to flow control here

I could implement concurrency Right. So

you can see we can actually have just

concurrency directly inside of here. We

could add rate limiting. So for example,

let me just copy one of these. Okay. So

we'll just copy that and paste it inside

of here. And you can see now that like

I've just added rate limiting. And for

the rate limiting, you're also able to

add a key. So for something like

ingesting the PDF, uh we may want to

have a key that's actually relative to

the source ID or something. So we could

say key source ID. So now we're only

going to apply that based on this key.

And there's just all kinds of other

things that you can add here to each of

the functions. Then we would just shut

the application down, rerun it, and now

all of a sudden we have a rate limit.

This would mean we can only run this one

time every 4 hours for a given PDF

document. That's it. That's kind of how

this works to make sure that you're not

abusing the functions. There's a lot of

other stuff. Again, I'm not going to go

through everything. We have priority to

bouncing, you know, singleton pattern,

concurrency, all of that kind of stuff,

which is really interesting and makes

this a very powerful tool. Okay, so with

that said guys, that is going to wrap up

this video. I think this is a really

cool application. It goes above and

beyond the kind of basic simple rag apps

where we just do something locally. This

is actually something that we now could

deploy to production. In order to do

that, I would suggest following along

with the ingest documentation. They have

some steps on deployment and essentially

how you kind of configure security and

different applications and how you set

up the environments. It's not something

that I have enough time to cover in this

video, but if you guys do want a video

on it, then definitely let me know and I

can likely team up with INDS again to

get that done. Anyways, that is it. I

hope you guys enjoyed the video. If you

did, make sure to leave a like,

subscribe to the channel, and I will see

you in the next one.

[Music]

How to Build a Production-Ready RAG AI Agent in Python (Step-by-Step)

Tech With Tim

53 days ago

1:16:29

RAG & Vector Search

Rank #2

Description

Get started with Inngest: https://innge.st/yt-twt-1 👉 Check out PyCharm, the only Python IDE you need to build data models and AI agents. Download now. Free forever, plus one month of Pro included: https://jb.gg/PyCharm-for-Tim I'll show you how to build an AI RAG application in Python and how to get it ready to deploy to production. I myself have made many AI projects on this channel, you've probably seen a few of them. And while those projects are super fun and cool and you can learn a lot, they're not ready to be deployed into the wild and used in a production environment. That's because they're missing observability, logging, retries, throttling rate, limiting all of the things that you need for a real production grade AI app. DevLaunch is my mentorship program where I personally help developers go beyond tutorials, build real-world projects, and actually land jobs. No fluff. Just real accountability, proven strategies, and hands-on guidance. Learn more here - https://training.devlaunch.us/tim?video=AUQJ9eeP-Ls 🎞 Video Resources 🎞 Inngest Python Docs: https://www.inngest.com/docs/apps Qdrant: https://qdrant.tech/ LlamaIndex: https://www.llamaindex.ai/ Code in this video: https://github.com/techwithtim/ProductionGradeRAGPythonApp ⏳ Timestamps ⏳ 00:00:00 | Overview 00:01:21 | Project Demo 00:04:07 | Architecture & Tools Breakdown 00:08:23 | Project Setup & Dependencies 00:11:22 | API Setup 00:12:10 | Inngest Dev Server Setup 00:25:06 | Vector Database Setup 00:36:48 | Loading & Chunking PDFs 00:58:09 | Querying Our VectorDB 01:08:54 | Adding the Frontend 01:13:56 | Rate Limiting, Throttling & Concurrency Hashtags #RAGCoding #AIAgent #Python

Video Details

Category

RAG & Vector Search

Featured Date

November 12, 2025

Quality Rank

#2

AI Recommended