Building Scalable LLM Evaluation Pipelines with Azure Cosmos DB | DailyDevLists

Loading video player...

Full Transcript

6,523 words • EN

Heat. Heat.

[Music]

Heat. Heat.

[Music]

[Music]

Heat.

Heat.

[Music]

[Music]

Nat.

[Music]

Heat. Heat.

[Music]

[Music]

[Music]

Hey everybody, my name is Jay Gordon.

Welcome back to the Azure Cosmos DB

global user group. It is September.

Wow, the summer went by way too fast for

me. Um, I am still kind of recovering

from a very fun summer. Got to take some

uh amazing trips. I did some real fun

stuff and I I can't begin to tell you

that uh I'm really looking forward to

Halloween season coming up. We've got

that very very soon here. And um more

than anything, I'm really also excited

for Microsoft Ignite that's coming in

November. So be prepared. Uh we're going

to get started in just a minute, but I

want to remind you all of a few things.

And I have some help from my friend Anna

from the reactor who's just going to do

a little housekeeping. So let's take a

listen.

>> Hey everyone, thanks for joining us for

our next live session. My name is Anna.

I'm a producer for Reactor joining you

from Redmond, Washington. Before we

start, I do have some quick

housekeeping. Please take a moment to

read our code of conduct.

We seek to provide a respectful

environment for both our audience and

presenters.

While we absolutely encourage engagement

in the chat, we ask that you please be

mindful of your commentary. Remain

professional and on topic. Keep an eye

on that chat. We'll be dropping helpful

links and checking for questions for our

presenters to answer live.

Our session is being recorded. It will

be available to view on demand. right

here on the Reactor YouTube channel.

With that, I'll turn it back to Jay.

Thanks again.

>> Thank you so much, Anna. You are the

best. Uh we love the Microsoft Reactor.

Make sure you go, you spend some time,

find out about all of the other things

that they do. You can go down to

aka.msreactor.

It's right down there. I'll put it in

the chat in a little bit. Um, before we

bring in our guests, um, a cute little

video that came out recently, let's take

a look at it.

>> If you're diving into Azure Cosmos DB,

you absolutely need to check out the

Azure Cosmos DB samples gallery. It's a

treasure trove of resources that can

supercharge your development journey.

You'll find everything from blogs and

presentations to detailed documentation

and engaging videos. Whether you're

coding innet, Go, Java, JavaScript,

Python, or any popular language, there

are samples tailored just for you. And

if you're curious about generative AI,

the gallery has you covered with tons of

material on agents, interactive chat,

MCP, and the rag pattern. This is your

go-to source for patterns and content

that can help you harness the full power

of Azure Cosmos DB. So dive in and

explore all the amazing resources

waiting for you.

[Music]

I like that. And it's just a little

thing to get you started. All right,

we've got so much more to do today. I've

got an amazing guest and we're going to

be talking more more AI. We are all

talking about AI. We're talking about it

when it comes to building our

applications. We're talking about it

about uh things like Microsoft Fabric.

There's so many things around ARI right

now. And I needed someone someone

special, someone that's been doing

amazing things in this part of our uh

technology world. Uh that's also really

really great. And that person is uh

Farah Abdu. Farah, make sure you unmute

yourself. Hi, welcome. How are you

today?

>> Hi Jay. Hi everyone.

I'm good actually.

>> So let everybody know I personally I'm

in Brooklyn, New York. Uh, I love being

in New York City. It's just me. It's my

heart. It's my love. Uh, tell everybody

where you're from, Farra.

>> I'm based on Egypt right now. It's 9 uh

900 PM right here. So, good night to

everyone. Uh,

>> oh, thank you so much for giving us some

of your evening. I know. I appreciate

it. I know our audience does. And I want

to remind my audience that you can find

out so much great stuff uh by just about

Far by just going to uh her LinkedIn. It

is right there below. Uh I'll also be

putting in the chat. You can go you can

find uh more information about how cool

Far is and also um get information about

today's session. Ask some questions. You

may not be sure about something. You may

not get a question answered. you may get

the amazing GitHub repo. Oh, there's a

link for the GitHub repo. You'll see why

that's important today. Um, and you can

always take a look at the GitHub repo if

you've got some questions. I know Far

will be able to help you out. But Far

before we jump in, um, first of all, you

are a AI engineer, is that right? Or ML

engineer with Sparkable.

>> Yeah. uh working with Spartle as an ML

engineer, but basically I'm an AI

engineer.

>> Fantastic. Um it takes a lot to be that

person right now. Um and you know the

world is filled with so many AI

engineers, so many po people that are

just making use of this technology. How

did you get to that part? Like I I

personally like you've told me your

story uh but someone may be watching

this, this is the first time they've met

you. So tell me a little bit about how

you became uh an AI engineer and ended

up at this part of your career.

>> Yeah, starting with my interest in how

to make things work uh smarter like uh

making chatbots, making intelligent

systems uh build smarter tools. Uh this

interested me. So this is how I jumped

into the AI field.

uh it was a lot of years ago. Uh but

then I have been specialized in the NLP

field like uh specialized in how

chatbooks were built uh how we can

improve them, how we can build AI

agents, how we can connect them to

databases, how to how to be uh to do

some benefits to the society uh with our

knowledge. So this is how I have become

here.

>> Fabulous. Fabulous. Well, before we jump

into your session, let's say some hellos

to people that are watching. I know that

they're super excited to see everything

you have to show them. There's Robbie

from San Diego, California. We've got uh

Aquarist 123 from East Anglia in

England. Um we've got Kuma. Kuma, hi,

thank you for watching all the way from

Canada. Uh we've got Muhammad from

Qatar. Hi, welcome Muhammad. Thank you

for joining us. And then a little bit

more local for me at least, Jonathan. He

is dialing in today from Chicago. So

we've got a global audience. Uh we've

even got uh

our friend Gabbor from Budapest. So

we've got I'll give you one more one

more person Desh from Germany. So far

we're global session today. We've got

people from all over this thing and they

want to hear what you have to say. So

with that, Far, why don't you get your

presentation ready? I'll go ahead and

bring that up and then I'm going to give

you the floor. Uh we're going to be here

for the next hour or so. Um within that,

so we'll be here. It's it's 2:07 here.

We'll be here till 3. Um so about an

hour. Uh please, while you're watching,

ask your questions. Uh put them in the

chat. Uh we'll also drop some links all

that stuff and um we'll get to your

questions at the end of the session. So

please uh put your questions in the

chat. I will make sure this wonderful

person that is going to be doing this

presentation.

Uh I'm going to bring up now and then

I'm going to hand you the floor. Um

everybody let's uh be respectful in the

chat. like ask questions and more than

anything let's learn something from far

today. Thanks Far why don't you get

started.

>> Thanks Jay. All right let's start

everyone. So today we are basically

talking about building scalable uh LM

evaluation pipelines and we are using uh

Azure Cosmos DB focus.

Basically I will be focusing on the Raj

evaluation

focusing on the semantic caching and the

scalable architecture. All right. So

let's start by the agenda for today. So

this is a hands-on session. We will

start with some introduction some

fundamental knowledge uh some system

architecture deep dive. I will go

through some walk through

implementation. You will see the code by

yourself and we will do some hands-on

exercise. I will talk about some

advanced and scaling topics and then we

will jump into the Q&A at the end of the

session.

So our objective today is that you will

be uh able to understand how the LLM

evaluation metrics

uh work, how we can use them in our

prediction environments, how we can use

Azure Cosmos DP to build the scalable

evaluation pipelines. We also know how

to explore the multimetric evaluation

approaches like the Rajas and the RO and

some semantic similarity.

We have uh we will have also some

hands-on experience actually with some

real world pipeline uh with some

examples and you will see it yourself.

So

here we will need uh this for our

session. You will need some like some

credits for your subscription. We will

need uh Azure open AI service and we'll

use the API and we will need the Azure

Cosmos DP. We will use mainly the NoSQL

API.

I need your environment to be equipment

with Python 3.8 anything like newer than

that it's okay. uh I need the packages

Azure Cosmos, OpenAI, Pandas, Numpy,

NLTK, U score. All of these I will be it

will be needed in our code.

The code editor is you can specify

yourself uh by yourself. It's not a big

problem. You can use VS code, PyCharm,

Jupyto notebook, cursor, anything. Uh

and of course you will need to be

familiar with Python. You'll need to be

familiar with z concepts and LLMs and

evaluation fundamentals because we will

not talk about this too much. We'll jump

into our topic.

So to give an intro why the evaluation

matters

like we have our businesses we need to

make decisions and the LLM's outputs

directly impact us and our customers

experiences also will be impacted. We

have also critical operations that we

need uh to uh avoid

like messing with it. we need to um have

our organization risk

like not pass we will need to uh deliver

our content to our end user um

successfully. So that's why we need the

evaluation.

Our key challenges for the evaluation

is that you will notice that too much

like if you give the LLM a prompt you'll

find that it give you the output that is

different every time right so we don't

need to uh have this every time this is

unpredictable so this one of the

challenges we have the hallucinations

also like how it is detecting the

factual inaccuracies

uh we need also to measure the accuracy,

the relevance, the coherence all of

these needed to be measured. We need our

production evaluation to be scaled

right.

So after that we can notice that the

traditional metrics like uh Q or alone

are insufficient for our R system. We

need some specialized evaluation

approaches. That's why we are here.

If any of you don't know about dra so

it's a retrieval augmented generation

the user have a query a question or a

request it give it to the LLM it the LLM

retrieve or find what is need from the

relevant documents or context then it

would it will generate the answer from

the context. So we have a document that

is stored somewhere a database or

something and zaraj is how we deal with

this. So raj will reduce harassation by

grounding in facts. It provide the

upto-date information beyond our

training data

and it enables some domain specific

knowledge without any form retrain.

And what the evaluation need is the

retrieval accuracy and relevance, the

answerfulness

to retrieve the context of course and

the response quality.

So as I have mentioned, we will not stop

at the metrics like Pew or Roach.

We need to go after that.

that traditional metrics fall short for

the rash systems that require more

context aware evaluation as we can see

let's talk more about ro and uh pl for

example there focus on the text

similarity right but they miss some

factual accuracy and context elements

and we have perplexity also which is

measuring that measuring the statistical

likelihood of the text but not its

correctness or usefulness. So the key

limitation as you can see is that

there's no factual accuracy assessment.

The context utilization is not measured

and the retrieval quality is ignored.

The R specific metric we will talk more

about the faithfulness today measure if

the responses are grounded in retrieved

context. So we can prevent the

hallucinations actually and we can see

that a lot of companies now make some

researches about the hallucination how

we can decrease its impact. So it's

important to know about these metrics

the relevance metrics like evaluating

post the answer relevancy to the query

and the context relevancy to the

information that I need. This is so

important and this is how Raj can be uh

important for us or our tool for example

and also the context quality. So we

measure the precision and the recall and

this is a very fundamental thing uh that

we need in our metric.

So mainly our pipeline will use the raga

framework and this is to measure the

specialized metrics automatically and at

scale.

So what is the reg uh evaluation

framework?

We have a query, a context and the

answer. So we have let's focus on the

context relevancy first. This is how

relevant the retrieved context is to the

it's relevant or not how it is relevant

to it.

Then the precision the context

precision. So it's a proportional of uh

the relevant retrieved passages. We have

also the recall. So the context recall

is about coverage of the information

needed for the answer. The faithfulness

here is so important as it's the

answer's grounding in the retrieved

context.

And at last we have the answer

relevancy. This is how well the answer

addresses the query.

So this is a comprehensive evaluation

framework. It is specifically designed

for the red system

and of course uh some quick intro about

Azure Cosmos DB. So Azure Cosmos DB is a

fully managed global distributed NoSQL

service or database service which is

designed for scalable applications.

That's why we are using it today. We

need a scalable applications and it is a

global distribution. It's a multimodel

with a lot of databases and the elastic

scalability of it and SLA package.

The most question that you can uh like

need to ask and you need us to answer is

why Cosmos DB for ML and evaluation

pipelines. So I have tried a lot of

databases actually. So it have a

seamless handling of JSON documents for

the ML metrics. We will see it by

ourselves today. And we have a fast

queries that is uh across large

evaluation data sets. The transactional

patch operations for efficient data

processing and the built-in TTL support

for semantic caching.

So backing to the point why Cosmos DP

why Azure Cosmos DP for the LLM

evaluation this is because it provides

some critical capabilities for building

scalable LLM evaluation pipelines. So

for example the unlimited scalability so

we can handle millions of evaluation

records

and also the semantic caching. So we

have the TTL that is support reduces the

impeding API costs and improve the

response time the global distribution so

for low latency evaluation across the

regions with multim masteraster

replication also the transactional batch

operation so it can be efficient

processing for some evaluation loads at

scale and the time series analysis so we

can track the evaluation metrics over

time and we have a lot of model versions

for example we can track the evaluation

metrics over them also.

So in our implementation today we mainly

will use some dedicated containers for

the evaluation and semantic cache with

some optimized partition keys and we

will see this.

So let's talk uh quick about the

semantic caching with DB. Why the

caching matters for LM valuation? Why we

uh talk about this? Because it reduce

the redundant API calls. It lower the

cost. It is dramatically improve the

response times for similar evaluation

queries.

We have the vector search and the

traditional key value. What is the

difference? The traditional

cache require exact match, right? But

the semantic cache use the picture

similarity so it can find some close

enough result and you will see also how

that will be happen like it need to

understand what I'm saying not just the

exact match. This is so important. We

also have the fast hash

which is combined with similarity

threshold offer both the speed and

semantic flexibility.

This is our system design for today.

I will go through it quickly and you

will see it implemented. So starting we

have a query and context. Okay. And I

give it an example. And then we have the

embedding which will be the Azure Open

AI with semantic cache. We have then the

evaluation

which is the multimetric assessment.

Then the storage which is Cosmos DB.

The Azure open AI here is thickest

embedding for semantic analysis and the

GPT models for answer generation

and also it cache embeddings to reduce

API calls.

The evaluation engine that we have is

the RAS which is the faithfulness and

the relevancy. We have the R scores for

the text comparison. We will see the

semantic similarity for context and

answer.

Azure Cosmos DP will be used here for

the as an evaluation result container

and the semantic cache container. So

we'll have these two containers.

Each of them will have a partition key.

So for the evaluation result I have both

the partition key is query ID. The

semantic catch container I have put the

cache ID. Then we have the reporting

which is a very important part. of our

pipeline. So here we will have some

comprehensive summary reports and metric

aggregation analysis which you can use

after that and also you can export your

uh your data after that what you have

got so far as a CSV or JSON or whatever.

So the as saying also uh batch

processing is efficiently process

multiple evaluation requests in parallel

with error handling and tack off

strategies. So this is our end to-end

architecture for today or for our

scalable evaluation with Azure Cosmos

DB.

This will be part of our code today or

implementation today which is the core

classes. We will have mainly three

classes. The evaluation config. This

will be the centralized configuration

data class for all the settings. You

will find in it the Azure Open AI

credentials, the Cosmos DB endpoint, the

model parameter, the cache setting and

evaluation thresholds.

We will then have the

uh modern Raj evaluator which is the

main orchestrated cloud that manages the

matrix calculation, the semantic

caching, the database interaction, the

batch processing and comprehensive

reporting.

The Cosmos DB container is two container

types. We have evaluation for storing

the assessment results and we will have

also the semantic cache for efficient

embedding storage and retrieving. All

right. So we can start

uh have a look first at what we have so

far. We here have a comprehensive metric

suit for what we are going to have like

the faithfulness, the answer relevancy,

the context precision and recall and the

ro scores. You will find that I have

used uh the root one for the inagram two

with the L which is for the common or

the longest common subsequence. Right?

So let's start with this.

First of all, this is the evaluation

config.

This one is the main part for our Azure

Cosmos DB configuration, the Azure

OpenAI configuration, the LM parameters,

the uh the pipeline configuration

itself. We have here some key parameter

groups as we can see and this data class

the evaluation config is centralized all

the setting it make it easy to modify

uh the behavior without changing the

code. So if you notice right here in the

evaluation config we have the database

name I have named it the LLM evaluation

the container named the evaluations and

the cache container which is a semantic

cache.

Second we have the modern badge

evaluator class and this is the M

orchestrator for evaluation with

multiple metrics and also the async

batch processing.

As you can see, we have some key

evaluation methods. For example, we have

the behavioral retrieval evaluation for

evaluating how efficiently the documents

are retrieved using the semantic

similarity,

precision and record metrics. as well.

We have generation evaluation which is

assesses the LM generation answer to

relevance uh for relevance to queries

and coherence and metrics and Raja's uh

metric calculation which is specialized.

It will have metrics like faithfulness,

answer relevancy, context precision or

recall and has sync patch processing

which enables the parallel evaluation of

multiple queries with error isolation.

Right?

So

let's have a look at this before jumping

into the hands-on. This is some of the

best practice and troubleshooting. So

best practice for your API key is to

store them um in your AMV file for

example and we have also the Azure key

volt and you need to use managed

identities. We have also to configure

the appropriate uh RUS based on the

evaluation volume and adjust your

partition key based on what you have

written in your code. also track the

hitness ratios and adjust the similarity

ratio accord accordingly. You need also

to track the model versions and

evaluation metrics over time for the AP

testing.

some troubleshooting. If you have any

problems with uh Cosmos DP like you need

to increase uh some you will need to

increase for example some are use uh

allocations if you receive some errors

and you can adjust the patch size and

add more as sync

uh workers for the par processing.

You can process larger databases in

smaller chunks and implement some proper

garbage collection and implement the

token bucket weight limiting and monitor

the users uh goods for your Azure Open

AI and make sure to check this

uh every time. So I will just go now to

our

screen.

All right,

I'll jump into

my Azure account.

All right. So this is my Azure AI

foundry right?

You will find that you have here a lot

of model catalog.

If you jump to the deployment right

here,

it's you'll find that you have model

names that we have just seen. You can

choose from them what you need. So for

example, you just need to say like

deploy model and deploy for example a P

model or a fine tuned model as you like.

Then I have choose for example here the

GBT4 and the text embeddings large for

our work today.

Check that the state is succeeded and uh

and the model version you will need this

and also you make sure that you have the

deployment type is what you need like

globally standard then make sure that

the name of it you uh you have the name

correctly because we will need to use it

in our code. So let's have a look first

also at our

um Cosmos DB. So this is my data

explorer at my Azure Cosmos DB account.

I have created here my database quickly.

I have my containers right here. It's

it's all empty like I have semantic

search. I have evaluations. They don't

have anything right. You can like delete

the container. You can uh change the

name. It's up to you of course. But I

will stuck with these names for our

demo.

If we jump it into our code now, we have

you pass through quickly our classes.

But the thing that I need you to noted

is that I have created some UI designs

with the rich uh CLI imports. So we will

have like a CLI not just some uh

evaluation pipeline.

It will look cool. So we just need to

use reg for now.

I have downloaded the required NLTK

data. I have my variables right here. It

it's in my EMV file.

This class I have explained in the

presentation so far like we have the

database name, the container name and

the cache container name.

And then I have a class for the modern

CLI logo. I make it like if it succeed

warning error any of these uh but we

have the hashtext in the catch

you will find the full code on the repo

you can have a look but let's jump into

the most important parts like for

example we have here the modern Raj

evaluator class that we have talked

about also in the presentation

it initialized here is the cosmos DP and

then we have to set up the semantic

cache set up the evaluation database

calculate the retrieval metric

and we have here the matrix itself

calculated like we have the precision at

K the recall at K and the F1 score at K.

So if it is the first time to you to

hear about this just like uh search for

this precision and you'll find at K for

example and recall at K and F1 score at

K.

Okay. So these our metrics today. So

let's say we have here wrote all the

calculations for the core precision and

A1 score and I have here the

similarities and calculated all of these

and have the average of the semantic

similarity the maximum the minimum all

right then I have to calculate the

generation metric

and the faithfulness of course it's

right here

and the revance of the sentence count

and calculate some vales metrics.

We have here the context relevancy, the

context precision, the faithfulness more

right here and the context recall. So

all of these will be calculated right

here in this class. the modern badge

evaluator.

The scores are for the roach like we

have some scores the roach F one the

roach uh two for example the L it

depends on what you are measuring

and here here we are evaluating the Raj

responses. So we just take what we have

calculated so far and just evaluating

the the Raj responses by it like you can

see it right right we have here the

evaluation result we have the query ID

the query itself retrieve the context

generated answer expected answer all of

these right here this is what we will be

evaluating

we have the time stamp

retrieval metric generation metric

faithfulness

answer relevancy, context precision,

context recall, context, uh ro scores,

all of this you on PC.

Then I have here to store the evaluation

result

and we have the patch evaluator.

All of these will be right here and

reflecting in our database. After that

then we have some displays for the

result table. I have created a CLI. So

you will find that we have a lot of

things uh that's related to the CLI

itself.

This is the display ones for the

dashboard for the metrics and the most

important thing let's come down here.

All of these for the CLI you can skip

this if you don't need it. I have

created a sample evaluation data tool.

So it would not take a lot of time to

process a PDF but you if you you can of

course attach a PDF and try it. I have

put a query ID, a query and a retrieved

context expected answer and relevant

uh okay I have put a lot of like I think

three the one Q1 Q2 and Q3. Yes. And

there's different

one about machine learning, one about

Cosmos DB and one about the cloud

computing.

Right? So it's done right here. Just

some display for the welcome component

and some uh CLI work.

But here we have our config evaluation

configurations. You can check this. You

need to make sure that you take the same

names correctly. And now

let's run and see what we will have.

All right. So here we go. We have our

CLI start working

with my default configuration. Yes, I

have defa I have all my keys. You can

see that it's say like my envine my

embedding the model the model the

database name the container the cache

container my temperature my max token

and the batch size then I have let me

make it this bigger

then I have my rad evaluator right so I

have made to make sure that my azure

open AI client is initialized correctly

DB is initialized correctly all is

our work right so let's see I have

implemented a lot of weights like the

data source collection or selection it

can be from my sample data the three

queries that we have just uh had a look

at you can load from a JSON file you can

load from a CSV file you can manually

entering it so let's say one so it will

just work on the sample data those three

queries that we have. So let's give it

some time.

It just starting evaluating it and it

will be seeing what will happen. So

until this we can jump again to our

evaluation. Let's refresh and wait until

it finished.

Then you'll see how it is different.

Let's check again.

I think it may be Yeah, most of it

finished.

After that it will be reflected to my

database. Let's make it.

We are in query three now.

Right.

Wait, let's go first to see our

see.

All right. So we have here our

evaluations with career ids. Let's check

the first one.

Right?

So if we check this you will find that

we have our query one and the query is

specified the retrieved context the

generated answer the expected one my

time stamp and you can find all the

metrics right we have here all the

metrics that we have the faithfulness

for example is 0.9 wow that's so great

and now we have the roach the context

recall the precision all of these so you

can find all of these metrics

You can add more of course. Let's check

the semantic catch. So let's refresh.

These are our ids. So let's check a

random one for example. So here we have

the text it provide automatic scaling

multi- region replication and etc. And

these are all

the vectors that help us in our vector

s. Right? So we have the embeddings the

semantic is ready we have the queries

are ready for evaluation let's have a

look at our C cli again right so this is

how our CLI will look like it will give

us a vis dashboard and it will give us a

detailed result for what happened in my

uh evaluation results you can see like

we have the queries the faithfulness the

answer the precision the recall and the

ro

it's not so great of course it's just

some samples you can try it with a lot

of more data but the overall score is

good actually 0.7 for z as score is not

so bad

so the next step is see how I need to

export this right if I need to export it

as a CSV a JSON the pulse of them or

none um let's say like I need to make it

as a JSON file.

It's say like okay it's just exported as

this one and you can see what you have

as a JSON file. So let's check this one.

Yeah. So this is our JSON file, right?

You can see it.

We have just had our the LLM evaluation

pipeline completed using Azure Cosmos DP

and Azure Open AAI. We have done this

with a CLI with uh semantic caching and

so fast as you can see it didn't take a

lot of time.

Let me go back.

Yes,

thank you.

All right, I just wanted to remind

everybody that uh we are uh definitely

uh looking forward to everyone sharing

questions. Um it looks like we lost your

presentation.

Yeah, I have well that's actually I have

finished but I can show it today.

>> Okay. So you um uh done with today uh as

far as your presentation is concerned

>> because I don't see your slide deck

anymore uh or your share. Just making

sure.

>> Yeah, we have. So let me

share it.

There we go.

>> Yeah, we have. Yes.

>> Okay. All right. So, I think we've

reached the end of it. Uh I want to take

a moment and see if anyone in our

audience has questions. Uh so, if you

have a question, uh please feel free to

ask. Um if not, I think what we'll do is

we'll start wrapping up. So, we'll give

everybody just a minute or two. Um I

wanted to uh share uh that we did have a

poll today. Uh we'd love it if you could

take today's poll. Just gives us a

little bit more information on um what

you are doing right now. We just want to

know if you are um using Azure Cosmos DB

in your AI apps. Um and today we know

that uh Far you are. And uh that's huge.

We always appreciate um and I think uh

we're we're just about ready to wrap up.

So far, I want uh first of all, everyone

to know that you are on LinkedIn. They

can find you right there. Uh I gave you

a little easy to use short link aka.ms

far abu LinkedIn. Um have you got any

other uh great u sessions coming up,

talks, anything that you would want

everyone to know about before we kind of

close up

>> uh for this month? Now obviously this is

my only session so we can meet nearly I

think.

>> Wow. Well I feel very lucky that I uh

got to host. Um so Maxim says I think

validation makes sense as LLM models get

updated every month. They really do. Um

it's amazing how rapidly models are

there are new models and models are

updated. Um there's so much data

constantly being collected and uh being

used for AI applications. It's it's

quite amazing. Um so thanks Maxim. Um

Far. It's been really wonderful. So I'm

going to going to start doing what I

like to do towards the end of a session.

Uh we're going to play a little lovely

music in the background

and we're going to tell everybody that

um hopefully we'll be back very soon. uh

we'll have another great session for

you. Uh we we appreciate everybody being

part of this and we want to remind you

that um we would love you to uh share

your opinion on today's session. Uh

thank you to our friends from the

Microsoft Reactor for uh partnering with

the Azure Cosmos DB team. Please take a

moment to share your opinion. Uh we'd

love to hear what you thought about

today's session. Um we've got a couple

more questions. Wow, they just poured

in. Uh, should we go with those? What do

you think?

>> Yeah, we can go too.

>> Sure. So, um, we have a question just to

check. Was the query and document data

ingested into Cosmos DB from the JSON

files?

>> Uh, no, just a sample function and we

have passed it some examples.

>> Got it. Got it. All right. Well, we've

got a couple more questions.

So, Pablo, hi Far. Is there other method

to evaluate LLMs within Microsoft

platform? If yes, what are the

differences?

Uh, you can use a lot of APIs actually

like the same that I have used today.

Uh, the open AI API and the Azure Cosmos

DP API. I haven't tried uh another ways,

but uh these are the most uh sufficient

ways that I have used so far. Cool.

Cool. All right. Um,

so Tech with Kirk asks, "Why would you

choose Cosmos over AI search for vector

DB? So many options. It's overwhelming."

>> Um, I love the vector search and I have

to choose Cosmos DB for this. Like it

helped me if I'm using u or making a a

big tool for example with a big

database.

>> Great. We've got another question. you

you are popular today. Very popular. Uh

so Kristen asks what are the standard

scores for each of the elements that you

review? What is a good score?

>> Uh actually for let's say for for

example for the faithfulness it will be

one for example. So as soon as I go up

and I have reached one so I'm good. And

each of these have of course some of the

skills. So we can check this out.

>> Great. And then I got one more question

for you unless one another one rolls in.

Use uh Ragus knowledge graph build for

generating QA pairs.

>> Uh no actually just using as a metric

evaluation. Just this.

>> Oh, we've got another question. They

just keep rolling in.

Pablo wants to know could we use this

method if we have a combination of

vector search and a knowledge graph for

retrieval.

>> I haven't tried it with a knowledge

graph for retrieval but yeah as soon as

you are using the victory search I think

it will be good.

>> Wow. So this was you you this was a

great session. I I I can't get over I

know I keep saying it but it really was.

Um so far uh I hope we can do another

one soon. you are such a great presenter

and you always have such amazing

information to share. So, um let's go

ahead and say goodbye to all of our

friends who joined us today. Thank you

so much everybody for being part of this

session. Uh stay tuned. We'll have our

next one announced very soon. Thank you

all for watching and we will see you

again very soon. Bye everybody.

>> Bye bye everyone.

[Music]

Building Scalable LLM Evaluation Pipelines with Azure Cosmos DB

Microsoft Reactor

65 days ago

50:11

AI Evaluation & Monitoring

Rank #5

Description

This hands-on workshop teaches participants to build cost-effective evaluation systems for RAG applications using Azure Cosmos DB's vector search capabilities. Attendees will learn to implement semantic caching techniques that significantly reduce LLM evaluation costs while maintaining fast query performance. Participants will create a complete evaluation pipeline that measures retrieval quality, answer accuracy, and system performance using industry-standard metrics. By the end of this session, attendees will have production-ready code and benchmarking tools that can scale across different deployment environments. #AzureCosmosDB #LLM #AI Useful links: • (GitHub) Azure Cosmos DB LLM Evaluation - https://github.com/FarahAbdo/azure-cosmos-llm-evaluation • Subscribe to this channel - https://aka.ms/AzureCosmosDBYouTube • Check out past meetups on YouTube to catch anything you might have missed - https://www.youtube.com/playlist?list=PLmamF3YkHLoJSJ1qdHDXXSlmkj2HKz-nb • Want to present at a future meetup? Fill out our intake form - https://aka.ms/AzureCosmosDB/UserGroupSubmission • Try Azure Cosmos DB Free - https://aka.ms/trycosmosdb • Microsoft Reactor - https://aka.ms/Reactor • Follow Azure Cosmos DB on X - https://twitter.com/AzureCosmosDB • Follow Azure Cosmos DB on LinkedIn - https://www.linkedin.com/company/azure-cosmos-db Speaker: Farah Abdou Farah Abdou is a Machine-learning engineer, STEM advocate, and international tech speaker whose work bridges artificial intelligence research with large-scale industrial deployment. Best known for her contributions to natural-language processing (NLP), quantum reinforcement learning (QRL), and cloud-native AI systems, She has become a prominent voice for open-source innovation and women’s representation in technology across the Middle East and North Africa (MENA) region Chapters: 02:16 - Jay Gordon kicks off the September session 03:01 - Housekeeping with Anna from Microsoft Reactor 04:14 - Azure Cosmos DB Samples Gallery overview 05:13 - Introducing guest Farah Abdu, AI Engineer 07:01 - Farah’s journey into AI and NLP 10:30 - Session agenda and learning objectives 13:05 - Why LLM evaluation matters 14:50 - Challenges with traditional metrics 15:50 - Understanding RAG and its evaluation needs 17:59 - Deep dive into RAG-specific metrics 19:55 - Using RAGAS for scalable evaluation 20:59 - Why Azure Cosmos DB is ideal for ML pipelines 22:45 - Semantic caching explained 24:33 - End-to-end architecture overview 25:58 - Core classes in the evaluation pipeline 27:29 - Hands-on demo setup 31:20 - Azure OpenAI and Cosmos DB configuration 33:05 - Code walkthrough: metrics and evaluation logic 36:34 - CLI interface and sample data 39:51 - Viewing evaluation results in Cosmos DB 41:18 - Exporting results to JSON 42:35 - Final thoughts and wrap-up 44:01 - Audience Q&A 47:13 - Cosmos DB vs. other vector DBs 48:08 - What makes a good score? 49:18 - Using vector search with knowledge graphs 49:50 - Closing remarks and poll reminder #microsoftreactor #learnconnectbuild [eventID:26291]

Video Details

Category

AI Evaluation & Monitoring

Featured Date

November 13, 2025

Quality Rank

#5

AI Recommended