Loading video player...
Heat. Heat.
[Music]
Heat. Heat.
[Music]
[Music]
Heat.
Heat.
[Music]
[Music]
Nat.
[Music]
Heat. Heat.
[Music]
[Music]
[Music]
Hey everybody, my name is Jay Gordon.
Welcome back to the Azure Cosmos DB
global user group. It is September.
Wow, the summer went by way too fast for
me. Um, I am still kind of recovering
from a very fun summer. Got to take some
uh amazing trips. I did some real fun
stuff and I I can't begin to tell you
that uh I'm really looking forward to
Halloween season coming up. We've got
that very very soon here. And um more
than anything, I'm really also excited
for Microsoft Ignite that's coming in
November. So be prepared. Uh we're going
to get started in just a minute, but I
want to remind you all of a few things.
And I have some help from my friend Anna
from the reactor who's just going to do
a little housekeeping. So let's take a
listen.
>> Hey everyone, thanks for joining us for
our next live session. My name is Anna.
I'm a producer for Reactor joining you
from Redmond, Washington. Before we
start, I do have some quick
housekeeping. Please take a moment to
read our code of conduct.
We seek to provide a respectful
environment for both our audience and
presenters.
While we absolutely encourage engagement
in the chat, we ask that you please be
mindful of your commentary. Remain
professional and on topic. Keep an eye
on that chat. We'll be dropping helpful
links and checking for questions for our
presenters to answer live.
Our session is being recorded. It will
be available to view on demand. right
here on the Reactor YouTube channel.
With that, I'll turn it back to Jay.
Thanks again.
>> Thank you so much, Anna. You are the
best. Uh we love the Microsoft Reactor.
Make sure you go, you spend some time,
find out about all of the other things
that they do. You can go down to
aka.msreactor.
It's right down there. I'll put it in
the chat in a little bit. Um, before we
bring in our guests, um, a cute little
video that came out recently, let's take
a look at it.
>> If you're diving into Azure Cosmos DB,
you absolutely need to check out the
Azure Cosmos DB samples gallery. It's a
treasure trove of resources that can
supercharge your development journey.
You'll find everything from blogs and
presentations to detailed documentation
and engaging videos. Whether you're
coding innet, Go, Java, JavaScript,
Python, or any popular language, there
are samples tailored just for you. And
if you're curious about generative AI,
the gallery has you covered with tons of
material on agents, interactive chat,
MCP, and the rag pattern. This is your
go-to source for patterns and content
that can help you harness the full power
of Azure Cosmos DB. So dive in and
explore all the amazing resources
waiting for you.
[Music]
I like that. And it's just a little
thing to get you started. All right,
we've got so much more to do today. I've
got an amazing guest and we're going to
be talking more more AI. We are all
talking about AI. We're talking about it
when it comes to building our
applications. We're talking about it
about uh things like Microsoft Fabric.
There's so many things around ARI right
now. And I needed someone someone
special, someone that's been doing
amazing things in this part of our uh
technology world. Uh that's also really
really great. And that person is uh
Farah Abdu. Farah, make sure you unmute
yourself. Hi, welcome. How are you
today?
>> Hi Jay. Hi everyone.
I'm good actually.
>> So let everybody know I personally I'm
in Brooklyn, New York. Uh, I love being
in New York City. It's just me. It's my
heart. It's my love. Uh, tell everybody
where you're from, Farra.
>> I'm based on Egypt right now. It's 9 uh
900 PM right here. So, good night to
everyone. Uh,
>> oh, thank you so much for giving us some
of your evening. I know. I appreciate
it. I know our audience does. And I want
to remind my audience that you can find
out so much great stuff uh by just about
Far by just going to uh her LinkedIn. It
is right there below. Uh I'll also be
putting in the chat. You can go you can
find uh more information about how cool
Far is and also um get information about
today's session. Ask some questions. You
may not be sure about something. You may
not get a question answered. you may get
the amazing GitHub repo. Oh, there's a
link for the GitHub repo. You'll see why
that's important today. Um, and you can
always take a look at the GitHub repo if
you've got some questions. I know Far
will be able to help you out. But Far
before we jump in, um, first of all, you
are a AI engineer, is that right? Or ML
engineer with Sparkable.
>> Yeah. uh working with Spartle as an ML
engineer, but basically I'm an AI
engineer.
>> Fantastic. Um it takes a lot to be that
person right now. Um and you know the
world is filled with so many AI
engineers, so many po people that are
just making use of this technology. How
did you get to that part? Like I I
personally like you've told me your
story uh but someone may be watching
this, this is the first time they've met
you. So tell me a little bit about how
you became uh an AI engineer and ended
up at this part of your career.
>> Yeah, starting with my interest in how
to make things work uh smarter like uh
making chatbots, making intelligent
systems uh build smarter tools. Uh this
interested me. So this is how I jumped
into the AI field.
uh it was a lot of years ago. Uh but
then I have been specialized in the NLP
field like uh specialized in how
chatbooks were built uh how we can
improve them, how we can build AI
agents, how we can connect them to
databases, how to how to be uh to do
some benefits to the society uh with our
knowledge. So this is how I have become
here.
>> Fabulous. Fabulous. Well, before we jump
into your session, let's say some hellos
to people that are watching. I know that
they're super excited to see everything
you have to show them. There's Robbie
from San Diego, California. We've got uh
Aquarist 123 from East Anglia in
England. Um we've got Kuma. Kuma, hi,
thank you for watching all the way from
Canada. Uh we've got Muhammad from
Qatar. Hi, welcome Muhammad. Thank you
for joining us. And then a little bit
more local for me at least, Jonathan. He
is dialing in today from Chicago. So
we've got a global audience. Uh we've
even got uh
our friend Gabbor from Budapest. So
we've got I'll give you one more one
more person Desh from Germany. So far
we're global session today. We've got
people from all over this thing and they
want to hear what you have to say. So
with that, Far, why don't you get your
presentation ready? I'll go ahead and
bring that up and then I'm going to give
you the floor. Uh we're going to be here
for the next hour or so. Um within that,
so we'll be here. It's it's 2:07 here.
We'll be here till 3. Um so about an
hour. Uh please, while you're watching,
ask your questions. Uh put them in the
chat. Uh we'll also drop some links all
that stuff and um we'll get to your
questions at the end of the session. So
please uh put your questions in the
chat. I will make sure this wonderful
person that is going to be doing this
presentation.
Uh I'm going to bring up now and then
I'm going to hand you the floor. Um
everybody let's uh be respectful in the
chat. like ask questions and more than
anything let's learn something from far
today. Thanks Far why don't you get
started.
>> Thanks Jay. All right let's start
everyone. So today we are basically
talking about building scalable uh LM
evaluation pipelines and we are using uh
Azure Cosmos DB focus.
Basically I will be focusing on the Raj
evaluation
focusing on the semantic caching and the
scalable architecture. All right. So
let's start by the agenda for today. So
this is a hands-on session. We will
start with some introduction some
fundamental knowledge uh some system
architecture deep dive. I will go
through some walk through
implementation. You will see the code by
yourself and we will do some hands-on
exercise. I will talk about some
advanced and scaling topics and then we
will jump into the Q&A at the end of the
session.
So our objective today is that you will
be uh able to understand how the LLM
evaluation metrics
uh work, how we can use them in our
prediction environments, how we can use
Azure Cosmos DP to build the scalable
evaluation pipelines. We also know how
to explore the multimetric evaluation
approaches like the Rajas and the RO and
some semantic similarity.
We have uh we will have also some
hands-on experience actually with some
real world pipeline uh with some
examples and you will see it yourself.
So
here we will need uh this for our
session. You will need some like some
credits for your subscription. We will
need uh Azure open AI service and we'll
use the API and we will need the Azure
Cosmos DP. We will use mainly the NoSQL
API.
I need your environment to be equipment
with Python 3.8 anything like newer than
that it's okay. uh I need the packages
Azure Cosmos, OpenAI, Pandas, Numpy,
NLTK, U score. All of these I will be it
will be needed in our code.
The code editor is you can specify
yourself uh by yourself. It's not a big
problem. You can use VS code, PyCharm,
Jupyto notebook, cursor, anything. Uh
and of course you will need to be
familiar with Python. You'll need to be
familiar with z concepts and LLMs and
evaluation fundamentals because we will
not talk about this too much. We'll jump
into our topic.
So to give an intro why the evaluation
matters
like we have our businesses we need to
make decisions and the LLM's outputs
directly impact us and our customers
experiences also will be impacted. We
have also critical operations that we
need uh to uh avoid
like messing with it. we need to um have
our organization risk
like not pass we will need to uh deliver
our content to our end user um
successfully. So that's why we need the
evaluation.
Our key challenges for the evaluation
is that you will notice that too much
like if you give the LLM a prompt you'll
find that it give you the output that is
different every time right so we don't
need to uh have this every time this is
unpredictable so this one of the
challenges we have the hallucinations
also like how it is detecting the
factual inaccuracies
uh we need also to measure the accuracy,
the relevance, the coherence all of
these needed to be measured. We need our
production evaluation to be scaled
right.
So after that we can notice that the
traditional metrics like uh Q or alone
are insufficient for our R system. We
need some specialized evaluation
approaches. That's why we are here.
If any of you don't know about dra so
it's a retrieval augmented generation
the user have a query a question or a
request it give it to the LLM it the LLM
retrieve or find what is need from the
relevant documents or context then it
would it will generate the answer from
the context. So we have a document that
is stored somewhere a database or
something and zaraj is how we deal with
this. So raj will reduce harassation by
grounding in facts. It provide the
upto-date information beyond our
training data
and it enables some domain specific
knowledge without any form retrain.
And what the evaluation need is the
retrieval accuracy and relevance, the
answerfulness
to retrieve the context of course and
the response quality.
So as I have mentioned, we will not stop
at the metrics like Pew or Roach.
We need to go after that.
that traditional metrics fall short for
the rash systems that require more
context aware evaluation as we can see
let's talk more about ro and uh pl for
example there focus on the text
similarity right but they miss some
factual accuracy and context elements
and we have perplexity also which is
measuring that measuring the statistical
likelihood of the text but not its
correctness or usefulness. So the key
limitation as you can see is that
there's no factual accuracy assessment.
The context utilization is not measured
and the retrieval quality is ignored.
The R specific metric we will talk more
about the faithfulness today measure if
the responses are grounded in retrieved
context. So we can prevent the
hallucinations actually and we can see
that a lot of companies now make some
researches about the hallucination how
we can decrease its impact. So it's
important to know about these metrics
the relevance metrics like evaluating
post the answer relevancy to the query
and the context relevancy to the
information that I need. This is so
important and this is how Raj can be uh
important for us or our tool for example
and also the context quality. So we
measure the precision and the recall and
this is a very fundamental thing uh that
we need in our metric.
So mainly our pipeline will use the raga
framework and this is to measure the
specialized metrics automatically and at
scale.
So what is the reg uh evaluation
framework?
We have a query, a context and the
answer. So we have let's focus on the
context relevancy first. This is how
relevant the retrieved context is to the
it's relevant or not how it is relevant
to it.
Then the precision the context
precision. So it's a proportional of uh
the relevant retrieved passages. We have
also the recall. So the context recall
is about coverage of the information
needed for the answer. The faithfulness
here is so important as it's the
answer's grounding in the retrieved
context.
And at last we have the answer
relevancy. This is how well the answer
addresses the query.
So this is a comprehensive evaluation
framework. It is specifically designed
for the red system
and of course uh some quick intro about
Azure Cosmos DB. So Azure Cosmos DB is a
fully managed global distributed NoSQL
service or database service which is
designed for scalable applications.
That's why we are using it today. We
need a scalable applications and it is a
global distribution. It's a multimodel
with a lot of databases and the elastic
scalability of it and SLA package.
The most question that you can uh like
need to ask and you need us to answer is
why Cosmos DB for ML and evaluation
pipelines. So I have tried a lot of
databases actually. So it have a
seamless handling of JSON documents for
the ML metrics. We will see it by
ourselves today. And we have a fast
queries that is uh across large
evaluation data sets. The transactional
patch operations for efficient data
processing and the built-in TTL support
for semantic caching.
So backing to the point why Cosmos DP
why Azure Cosmos DP for the LLM
evaluation this is because it provides
some critical capabilities for building
scalable LLM evaluation pipelines. So
for example the unlimited scalability so
we can handle millions of evaluation
records
and also the semantic caching. So we
have the TTL that is support reduces the
impeding API costs and improve the
response time the global distribution so
for low latency evaluation across the
regions with multim masteraster
replication also the transactional batch
operation so it can be efficient
processing for some evaluation loads at
scale and the time series analysis so we
can track the evaluation metrics over
time and we have a lot of model versions
for example we can track the evaluation
metrics over them also.
So in our implementation today we mainly
will use some dedicated containers for
the evaluation and semantic cache with
some optimized partition keys and we
will see this.
So let's talk uh quick about the
semantic caching with DB. Why the
caching matters for LM valuation? Why we
uh talk about this? Because it reduce
the redundant API calls. It lower the
cost. It is dramatically improve the
response times for similar evaluation
queries.
We have the vector search and the
traditional key value. What is the
difference? The traditional
cache require exact match, right? But
the semantic cache use the picture
similarity so it can find some close
enough result and you will see also how
that will be happen like it need to
understand what I'm saying not just the
exact match. This is so important. We
also have the fast hash
which is combined with similarity
threshold offer both the speed and
semantic flexibility.
This is our system design for today.
I will go through it quickly and you
will see it implemented. So starting we
have a query and context. Okay. And I
give it an example. And then we have the
embedding which will be the Azure Open
AI with semantic cache. We have then the
evaluation
which is the multimetric assessment.
Then the storage which is Cosmos DB.
The Azure open AI here is thickest
embedding for semantic analysis and the
GPT models for answer generation
and also it cache embeddings to reduce
API calls.
The evaluation engine that we have is
the RAS which is the faithfulness and
the relevancy. We have the R scores for
the text comparison. We will see the
semantic similarity for context and
answer.
Azure Cosmos DP will be used here for
the as an evaluation result container
and the semantic cache container. So
we'll have these two containers.
Each of them will have a partition key.
So for the evaluation result I have both
the partition key is query ID. The
semantic catch container I have put the
cache ID. Then we have the reporting
which is a very important part. of our
pipeline. So here we will have some
comprehensive summary reports and metric
aggregation analysis which you can use
after that and also you can export your
uh your data after that what you have
got so far as a CSV or JSON or whatever.
So the as saying also uh batch
processing is efficiently process
multiple evaluation requests in parallel
with error handling and tack off
strategies. So this is our end to-end
architecture for today or for our
scalable evaluation with Azure Cosmos
DB.
This will be part of our code today or
implementation today which is the core
classes. We will have mainly three
classes. The evaluation config. This
will be the centralized configuration
data class for all the settings. You
will find in it the Azure Open AI
credentials, the Cosmos DB endpoint, the
model parameter, the cache setting and
evaluation thresholds.
We will then have the
uh modern Raj evaluator which is the
main orchestrated cloud that manages the
matrix calculation, the semantic
caching, the database interaction, the
batch processing and comprehensive
reporting.
The Cosmos DB container is two container
types. We have evaluation for storing
the assessment results and we will have
also the semantic cache for efficient
embedding storage and retrieving. All
right. So we can start
uh have a look first at what we have so
far. We here have a comprehensive metric
suit for what we are going to have like
the faithfulness, the answer relevancy,
the context precision and recall and the
ro scores. You will find that I have
used uh the root one for the inagram two
with the L which is for the common or
the longest common subsequence. Right?
So let's start with this.
First of all, this is the evaluation
config.
This one is the main part for our Azure
Cosmos DB configuration, the Azure
OpenAI configuration, the LM parameters,
the uh the pipeline configuration
itself. We have here some key parameter
groups as we can see and this data class
the evaluation config is centralized all
the setting it make it easy to modify
uh the behavior without changing the
code. So if you notice right here in the
evaluation config we have the database
name I have named it the LLM evaluation
the container named the evaluations and
the cache container which is a semantic
cache.
Second we have the modern badge
evaluator class and this is the M
orchestrator for evaluation with
multiple metrics and also the async
batch processing.
As you can see, we have some key
evaluation methods. For example, we have
the behavioral retrieval evaluation for
evaluating how efficiently the documents
are retrieved using the semantic
similarity,
precision and record metrics. as well.
We have generation evaluation which is
assesses the LM generation answer to
relevance uh for relevance to queries
and coherence and metrics and Raja's uh
metric calculation which is specialized.
It will have metrics like faithfulness,
answer relevancy, context precision or
recall and has sync patch processing
which enables the parallel evaluation of
multiple queries with error isolation.
Right?
So
let's have a look at this before jumping
into the hands-on. This is some of the
best practice and troubleshooting. So
best practice for your API key is to
store them um in your AMV file for
example and we have also the Azure key
volt and you need to use managed
identities. We have also to configure
the appropriate uh RUS based on the
evaluation volume and adjust your
partition key based on what you have
written in your code. also track the
hitness ratios and adjust the similarity
ratio accord accordingly. You need also
to track the model versions and
evaluation metrics over time for the AP
testing.
some troubleshooting. If you have any
problems with uh Cosmos DP like you need
to increase uh some you will need to
increase for example some are use uh
allocations if you receive some errors
and you can adjust the patch size and
add more as sync
uh workers for the par processing.
You can process larger databases in
smaller chunks and implement some proper
garbage collection and implement the
token bucket weight limiting and monitor
the users uh goods for your Azure Open
AI and make sure to check this
uh every time. So I will just go now to
our
screen.
All right,
I'll jump into
my Azure account.
All right. So this is my Azure AI
foundry right?
You will find that you have here a lot
of model catalog.
If you jump to the deployment right
here,
it's you'll find that you have model
names that we have just seen. You can
choose from them what you need. So for
example, you just need to say like
deploy model and deploy for example a P
model or a fine tuned model as you like.
Then I have choose for example here the
GBT4 and the text embeddings large for
our work today.
Check that the state is succeeded and uh
and the model version you will need this
and also you make sure that you have the
deployment type is what you need like
globally standard then make sure that
the name of it you uh you have the name
correctly because we will need to use it
in our code. So let's have a look first
also at our
um Cosmos DB. So this is my data
explorer at my Azure Cosmos DB account.
I have created here my database quickly.
I have my containers right here. It's
it's all empty like I have semantic
search. I have evaluations. They don't
have anything right. You can like delete
the container. You can uh change the
name. It's up to you of course. But I
will stuck with these names for our
demo.
If we jump it into our code now, we have
you pass through quickly our classes.
But the thing that I need you to noted
is that I have created some UI designs
with the rich uh CLI imports. So we will
have like a CLI not just some uh
evaluation pipeline.
It will look cool. So we just need to
use reg for now.
I have downloaded the required NLTK
data. I have my variables right here. It
it's in my EMV file.
This class I have explained in the
presentation so far like we have the
database name, the container name and
the cache container name.
And then I have a class for the modern
CLI logo. I make it like if it succeed
warning error any of these uh but we
have the hashtext in the catch
you will find the full code on the repo
you can have a look but let's jump into
the most important parts like for
example we have here the modern Raj
evaluator class that we have talked
about also in the presentation
it initialized here is the cosmos DP and
then we have to set up the semantic
cache set up the evaluation database
calculate the retrieval metric
and we have here the matrix itself
calculated like we have the precision at
K the recall at K and the F1 score at K.
So if it is the first time to you to
hear about this just like uh search for
this precision and you'll find at K for
example and recall at K and F1 score at
K.
Okay. So these our metrics today. So
let's say we have here wrote all the
calculations for the core precision and
A1 score and I have here the
similarities and calculated all of these
and have the average of the semantic
similarity the maximum the minimum all
right then I have to calculate the
generation metric
and the faithfulness of course it's
right here
and the revance of the sentence count
and calculate some vales metrics.
We have here the context relevancy, the
context precision, the faithfulness more
right here and the context recall. So
all of these will be calculated right
here in this class. the modern badge
evaluator.
The scores are for the roach like we
have some scores the roach F one the
roach uh two for example the L it
depends on what you are measuring
and here here we are evaluating the Raj
responses. So we just take what we have
calculated so far and just evaluating
the the Raj responses by it like you can
see it right right we have here the
evaluation result we have the query ID
the query itself retrieve the context
generated answer expected answer all of
these right here this is what we will be
evaluating
we have the time stamp
retrieval metric generation metric
faithfulness
answer relevancy, context precision,
context recall, context, uh ro scores,
all of this you on PC.
Then I have here to store the evaluation
result
and we have the patch evaluator.
All of these will be right here and
reflecting in our database. After that
then we have some displays for the
result table. I have created a CLI. So
you will find that we have a lot of
things uh that's related to the CLI
itself.
This is the display ones for the
dashboard for the metrics and the most
important thing let's come down here.
All of these for the CLI you can skip
this if you don't need it. I have
created a sample evaluation data tool.
So it would not take a lot of time to
process a PDF but you if you you can of
course attach a PDF and try it. I have
put a query ID, a query and a retrieved
context expected answer and relevant
uh okay I have put a lot of like I think
three the one Q1 Q2 and Q3. Yes. And
there's different
one about machine learning, one about
Cosmos DB and one about the cloud
computing.
Right? So it's done right here. Just
some display for the welcome component
and some uh CLI work.
But here we have our config evaluation
configurations. You can check this. You
need to make sure that you take the same
names correctly. And now
let's run and see what we will have.
All right. So here we go. We have our
CLI start working
with my default configuration. Yes, I
have defa I have all my keys. You can
see that it's say like my envine my
embedding the model the model the
database name the container the cache
container my temperature my max token
and the batch size then I have let me
make it this bigger
then I have my rad evaluator right so I
have made to make sure that my azure
open AI client is initialized correctly
DB is initialized correctly all is
our work right so let's see I have
implemented a lot of weights like the
data source collection or selection it
can be from my sample data the three
queries that we have just uh had a look
at you can load from a JSON file you can
load from a CSV file you can manually
entering it so let's say one so it will
just work on the sample data those three
queries that we have. So let's give it
some time.
It just starting evaluating it and it
will be seeing what will happen. So
until this we can jump again to our
evaluation. Let's refresh and wait until
it finished.
Then you'll see how it is different.
Let's check again.
I think it may be Yeah, most of it
finished.
After that it will be reflected to my
database. Let's make it.
We are in query three now.
Right.
Wait, let's go first to see our
see.
All right. So we have here our
evaluations with career ids. Let's check
the first one.
Right?
So if we check this you will find that
we have our query one and the query is
specified the retrieved context the
generated answer the expected one my
time stamp and you can find all the
metrics right we have here all the
metrics that we have the faithfulness
for example is 0.9 wow that's so great
and now we have the roach the context
recall the precision all of these so you
can find all of these metrics
You can add more of course. Let's check
the semantic catch. So let's refresh.
These are our ids. So let's check a
random one for example. So here we have
the text it provide automatic scaling
multi- region replication and etc. And
these are all
the vectors that help us in our vector
s. Right? So we have the embeddings the
semantic is ready we have the queries
are ready for evaluation let's have a
look at our C cli again right so this is
how our CLI will look like it will give
us a vis dashboard and it will give us a
detailed result for what happened in my
uh evaluation results you can see like
we have the queries the faithfulness the
answer the precision the recall and the
ro
it's not so great of course it's just
some samples you can try it with a lot
of more data but the overall score is
good actually 0.7 for z as score is not
so bad
so the next step is see how I need to
export this right if I need to export it
as a CSV a JSON the pulse of them or
none um let's say like I need to make it
as a JSON file.
It's say like okay it's just exported as
this one and you can see what you have
as a JSON file. So let's check this one.
Yeah. So this is our JSON file, right?
You can see it.
We have just had our the LLM evaluation
pipeline completed using Azure Cosmos DP
and Azure Open AAI. We have done this
with a CLI with uh semantic caching and
so fast as you can see it didn't take a
lot of time.
Let me go back.
Yes,
thank you.
All right, I just wanted to remind
everybody that uh we are uh definitely
uh looking forward to everyone sharing
questions. Um it looks like we lost your
presentation.
Yeah, I have well that's actually I have
finished but I can show it today.
>> Okay. So you um uh done with today uh as
far as your presentation is concerned
>> because I don't see your slide deck
anymore uh or your share. Just making
sure.
>> Yeah, we have. So let me
share it.
There we go.
>> Yeah, we have. Yes.
>> Okay. All right. So, I think we've
reached the end of it. Uh I want to take
a moment and see if anyone in our
audience has questions. Uh so, if you
have a question, uh please feel free to
ask. Um if not, I think what we'll do is
we'll start wrapping up. So, we'll give
everybody just a minute or two. Um I
wanted to uh share uh that we did have a
poll today. Uh we'd love it if you could
take today's poll. Just gives us a
little bit more information on um what
you are doing right now. We just want to
know if you are um using Azure Cosmos DB
in your AI apps. Um and today we know
that uh Far you are. And uh that's huge.
We always appreciate um and I think uh
we're we're just about ready to wrap up.
So far, I want uh first of all, everyone
to know that you are on LinkedIn. They
can find you right there. Uh I gave you
a little easy to use short link aka.ms
far abu LinkedIn. Um have you got any
other uh great u sessions coming up,
talks, anything that you would want
everyone to know about before we kind of
close up
>> uh for this month? Now obviously this is
my only session so we can meet nearly I
think.
>> Wow. Well I feel very lucky that I uh
got to host. Um so Maxim says I think
validation makes sense as LLM models get
updated every month. They really do. Um
it's amazing how rapidly models are
there are new models and models are
updated. Um there's so much data
constantly being collected and uh being
used for AI applications. It's it's
quite amazing. Um so thanks Maxim. Um
Far. It's been really wonderful. So I'm
going to going to start doing what I
like to do towards the end of a session.
Uh we're going to play a little lovely
music in the background
and we're going to tell everybody that
um hopefully we'll be back very soon. uh
we'll have another great session for
you. Uh we we appreciate everybody being
part of this and we want to remind you
that um we would love you to uh share
your opinion on today's session. Uh
thank you to our friends from the
Microsoft Reactor for uh partnering with
the Azure Cosmos DB team. Please take a
moment to share your opinion. Uh we'd
love to hear what you thought about
today's session. Um we've got a couple
more questions. Wow, they just poured
in. Uh, should we go with those? What do
you think?
>> Yeah, we can go too.
>> Sure. So, um, we have a question just to
check. Was the query and document data
ingested into Cosmos DB from the JSON
files?
>> Uh, no, just a sample function and we
have passed it some examples.
>> Got it. Got it. All right. Well, we've
got a couple more questions.
So, Pablo, hi Far. Is there other method
to evaluate LLMs within Microsoft
platform? If yes, what are the
differences?
Uh, you can use a lot of APIs actually
like the same that I have used today.
Uh, the open AI API and the Azure Cosmos
DP API. I haven't tried uh another ways,
but uh these are the most uh sufficient
ways that I have used so far. Cool.
Cool. All right. Um,
so Tech with Kirk asks, "Why would you
choose Cosmos over AI search for vector
DB? So many options. It's overwhelming."
>> Um, I love the vector search and I have
to choose Cosmos DB for this. Like it
helped me if I'm using u or making a a
big tool for example with a big
database.
>> Great. We've got another question. you
you are popular today. Very popular. Uh
so Kristen asks what are the standard
scores for each of the elements that you
review? What is a good score?
>> Uh actually for let's say for for
example for the faithfulness it will be
one for example. So as soon as I go up
and I have reached one so I'm good. And
each of these have of course some of the
skills. So we can check this out.
>> Great. And then I got one more question
for you unless one another one rolls in.
Use uh Ragus knowledge graph build for
generating QA pairs.
>> Uh no actually just using as a metric
evaluation. Just this.
>> Oh, we've got another question. They
just keep rolling in.
Pablo wants to know could we use this
method if we have a combination of
vector search and a knowledge graph for
retrieval.
>> I haven't tried it with a knowledge
graph for retrieval but yeah as soon as
you are using the victory search I think
it will be good.
>> Wow. So this was you you this was a
great session. I I I can't get over I
know I keep saying it but it really was.
Um so far uh I hope we can do another
one soon. you are such a great presenter
and you always have such amazing
information to share. So, um let's go
ahead and say goodbye to all of our
friends who joined us today. Thank you
so much everybody for being part of this
session. Uh stay tuned. We'll have our
next one announced very soon. Thank you
all for watching and we will see you
again very soon. Bye everybody.
>> Bye bye everyone.
[Music]
This hands-on workshop teaches participants to build cost-effective evaluation systems for RAG applications using Azure Cosmos DB's vector search capabilities. Attendees will learn to implement semantic caching techniques that significantly reduce LLM evaluation costs while maintaining fast query performance. Participants will create a complete evaluation pipeline that measures retrieval quality, answer accuracy, and system performance using industry-standard metrics. By the end of this session, attendees will have production-ready code and benchmarking tools that can scale across different deployment environments. #AzureCosmosDB #LLM #AI Useful links: • (GitHub) Azure Cosmos DB LLM Evaluation - https://github.com/FarahAbdo/azure-cosmos-llm-evaluation • Subscribe to this channel - https://aka.ms/AzureCosmosDBYouTube • Check out past meetups on YouTube to catch anything you might have missed - https://www.youtube.com/playlist?list=PLmamF3YkHLoJSJ1qdHDXXSlmkj2HKz-nb • Want to present at a future meetup? Fill out our intake form - https://aka.ms/AzureCosmosDB/UserGroupSubmission • Try Azure Cosmos DB Free - https://aka.ms/trycosmosdb • Microsoft Reactor - https://aka.ms/Reactor • Follow Azure Cosmos DB on X - https://twitter.com/AzureCosmosDB • Follow Azure Cosmos DB on LinkedIn - https://www.linkedin.com/company/azure-cosmos-db Speaker: Farah Abdou Farah Abdou is a Machine-learning engineer, STEM advocate, and international tech speaker whose work bridges artificial intelligence research with large-scale industrial deployment. Best known for her contributions to natural-language processing (NLP), quantum reinforcement learning (QRL), and cloud-native AI systems, She has become a prominent voice for open-source innovation and women’s representation in technology across the Middle East and North Africa (MENA) region Chapters: 02:16 - Jay Gordon kicks off the September session 03:01 - Housekeeping with Anna from Microsoft Reactor 04:14 - Azure Cosmos DB Samples Gallery overview 05:13 - Introducing guest Farah Abdu, AI Engineer 07:01 - Farah’s journey into AI and NLP 10:30 - Session agenda and learning objectives 13:05 - Why LLM evaluation matters 14:50 - Challenges with traditional metrics 15:50 - Understanding RAG and its evaluation needs 17:59 - Deep dive into RAG-specific metrics 19:55 - Using RAGAS for scalable evaluation 20:59 - Why Azure Cosmos DB is ideal for ML pipelines 22:45 - Semantic caching explained 24:33 - End-to-end architecture overview 25:58 - Core classes in the evaluation pipeline 27:29 - Hands-on demo setup 31:20 - Azure OpenAI and Cosmos DB configuration 33:05 - Code walkthrough: metrics and evaluation logic 36:34 - CLI interface and sample data 39:51 - Viewing evaluation results in Cosmos DB 41:18 - Exporting results to JSON 42:35 - Final thoughts and wrap-up 44:01 - Audience Q&A 47:13 - Cosmos DB vs. other vector DBs 48:08 - What makes a good score? 49:18 - Using vector search with knowledge graphs 49:50 - Closing remarks and poll reminder #microsoftreactor #learnconnectbuild [eventID:26291]