Loading video player...
Hey everyone, thank you for joining us
for the next session of our Azure
integration services series. My name is
Anna. I'll be your producer for this
session. I'm an event planner for
Reactor joining you from Renmond,
Washington.
Before we start, I do have some quick
housekeeping.
Please take a moment to read our code of
conduct.
We seek to provide a respectful
environment for both our audience and
presenters.
While we absolutely encourage engagement
in the chat, we ask that you please be
mindful of your commentary, remain
professional and on topic.
Keep an eye on that chat. We'll be
dropping helpful links and checking for
questions for our presenters to answer
live.
Our session is being recorded. It will
be available to view on demand right
here on the Reactor channel.
With that, I'd love to turn it over to
our speakers for today. Thank you all so
much for joining.
>> All right. Hi everyone. Um, so today
we're going to talk about how you can
secure and scale your AI workloads with
the gateway uh in Azure P Management. So
my name is Andre. I'm a senior PM in
Microsoft uh Azure API management team.
And today with me I also have Alex who
works as a global leg build um in our um
in our team.
So let's get started. Um so first of all
why we talk about AI gateway um today.
So there are still some challenges and
we talk to customers every week about
those challenges about uh on building AI
applications and agents
and even though we started kind of a
year ago with all of this kind of large
language models agents and applications
intelligent applications in particular
we still see that there are some
challenges when it comes to security
right there is still there are still
questions on how to do proper key
management uh with with your models
because you have a model you have an API
key, how you distribute, how you rotate
keys and so on. Um, we still see some uh
struggles with keeping the token
consumption and token management
practices in place. uh we we see a lot
of customers who are trying to build uh
a way to track token consumption within
the organization and also limit token
consumption within the organization to
make sure that they can distribute
whatever quotas and whatever uh model
capacity they have between different
teams. Um and also in terms of security
um guard rails or content safety becomes
important because we see more and more
different attacks such as jailbreaks or
prompt attacks um on on models [snorts]
and we see a lot of enterprises they
actually um struggle with understanding
how to better protect their models
especially if those models are deployed
to their own infrastructure
and why pay management. So essentially
if you think about like AI or artificial
intelligent it is exposed as an API
because if you think about the model uh
when you when you talk to the model you
typically use like chat completions API
or recently you can use like responses
API uh if you think about different
tools such as MCP servers you can still
treat them as APIs and now we also have
agents which can be exposed through kind
of a new protocols that are developed
out there such as A2A or agent to agent
But then the majority of them are still
rest APIs.
And ideally you want these APIs to be
managed in the same way as traditional
APIs just like uh rest and soap and
graphical APIs. You want to design,
develop, secure them, publish, you want
to monitor their behavior and then you
want to analyze how to improve the
performance, how to improve um the um uh
those APIs that you that you developed.
um and API management as uh is in a good
position to do that because we sit we
kind of a uh sit between your API
consumers or AI consumers in that case
and your APIs which can be again model
API tools APIs or agents APIs
and what we've seen over the last year
is this emerging pattern called AI
gateway where you basically put an AI
gateway like essentially like in our
case an API management gateway between
your API applica AI applications or
agents and your models dat uh data and
tools. Essentially what you can do is
you can enforce security rules, you can
ensure resiliency, scalability,
observability and you can also mediate
traffic uh to those models and and
tools.
Um so over the last years we added
simple capabilities such as for example
token limiting load balancing and um
recently we introduced more kind of
advanced capabilities. So I just want to
walk you through some of the features
that we added and then we will focus on
some of them and also we'll demonstrate
how they work in action. Um so first on
security and safety side um being an
Azure service and especially mediating
access and kind of proxying access
proxying requests to Azure Azure models
we can actually provide a keyless
authentication which means that you can
configure managed identity
authentication from Azure API management
to uh your Azure models and in that case
you don't need to uh share the single
key with anyone instead you can generate
new keys on API management side and
distribute them between different
developers, development teams,
departments or or even organizations if
you have like partner organizations that
you're working with. And then you can
also assign token limits to this keys to
make sure that you actually have the um
LLM consumption under control. And then
on safety side, we also introduced
content safety feature uh which is
currently uh GA and later we will show
and talk more about it. On resilience
aside, we introduced um a year ago a
couple of features to um to load balance
between different uh LLM endpoints such
as weighted load balancing, priority
load balancing and so on. But with the
introduction of new APIs which are more
more kind of on the stateful side which
are response API for example or
assistance API, we understood that there
is a need for session awareness in those
load balancing uh mechanisms. So that's
why we introduce session aware low
balancing
on scalability sides. So I already
mentioned token rate limit and token
quarters. Uh we introduced semantic
caching which allows you to um to save
on latency and token cost. So
essentially you can think about it as a
regular uh caching uh that we have in
API management world. But in that case
you actually uh return uh completions
for similar prompts which not
necessarily might be um the same uh word
to word but they're semantically
similar. Um next section is traffic
mediation and control. Uh so initially
again as I mentioned being an Azure
service we started with Azure OpenAI
models and Azure AI foundry models but
now we also introduce support for um any
open compatible model which means that
whatever you see in the next um uh 50
minutes of demos and features that you
see on the slide. We actually support
not only Azure models but we also
support any open air compatible model
which means that if you have model
deployed through hugging phase or like
any other inference provider or even if
you deploy model uh directly to your uh
to your own infrastructure to your own
GPUs as long as it supports chat
completions API we'll also support all
of these features for this model. Um we
also introduced support for uh other
vendors such as Amazon Bedrock which is
now GA we also support policies for that
and then with Gemini being a popular
model we also supported through this
open compatibility.
Um now on developer velocity side you
can still use all of the Azure pay
management features such as developer
portal. uh you can easily onboard um
APIs to API management through new kind
of wizard uh one-click gestures uh that
we have in API management on
observability side uh we introduced
token counting and prompts and
completions logging because we
understand that it's important uh to
improve your models to understand what
is the token consumption between
different uh different departments and
we also made it very simple to get
started because we also now have
built-in reporting dashboard that you
can use to have this kind of view on
different uh different uh uh different
aspects of uh LLM consumption
organization. And then on governance
side, we still have other policies. We
have more than 60 uh different policies
in API management that you can still
apply to your AI APIs and AI resources
such as jot validation, monitoring
policies, uh regular rate limiting and
so on.
Um so with that let me talk a little bit
about one feature in particular which is
semantic caching. Um so as I mentioned
it works pretty similar to regular
caching but here instead of caching the
exact match for request we are
calculating how two prompts uh we're b
calculating the similarity of two
prompts. So what we do is essentially on
API management side when there is a new
prompt we calcul we send it to the
embeddings model to get the vector for a
specific prompt and then we we uh we
check if there is any completion cached
for a similar vector. Uh you can
configure similarity between different
prompts you can you can make them very
similar to uh so that completion is
returned from cache or they can be
pretty uh pretty basically ambiguous so
to speak. uh when it comes to similarity
and then if there is no match we only
then we send uh this prompt to the chat
completions model and then return this
completion and again this completion is
going to be cached if there is a new
prompt which is similar to to the first
one. So in that case you you're saving a
lot on latency but you're also saving on
token cost because obviously like
generating vectors is much much cheaper
compared to uh chat completions model.
Um so with that let me hand it over to
Alex for our first demo.
>> Let's do it. Uh hello everyone. It's a
pleasure to be here and uh in this demo
we will start to walk through um
deploying and a new inference API and to
use that inference API with the
different capabilities that Andre just
described.
So within within Azure um it all starts
in uh in AI foundry right in AI foundry
is where we will deploy the the models
different kinds of models as we will see
here and uh and then after we deploy the
models we want to uh serve them through
API management so that we have all the
governance capabilities that Andrea was
just describing.
Microsoft has been uh uh doing um a lot
of work to to host different models from
different providers. We started with
OpenAI but we extended with a lot of uh
other providers from um uh Grock,
Mistro, Lama etc. Okay. And I have
already deployed here a bunch of
different models in the in this
environment. So to deploy a new model is
very easy. We just go here and we say
deploy deploy based model and then we
can search for all the provider that I
was describing and deploy the model
directly here and then the model will be
start to be served. We have here
different options in terms of u reserve
capacity etc. But after we deploy the
model the model is just available for us
to consume. And um as you can see here
in this example, I have included models
from um GPT models from OpenAI, Grock,
Lama, Mistrol. This model router is
interesting as well because it's an
abstraction of different models. So that
uh automatically selects the best model
for the prompt that was sent and API
management is compatible with the model
router as well. And then we have here
some reasoning models by four for
Microsoft etc. Okay. So a lot of
different options options. Let's let's
see now um how to serve this uh these
models through API management. So I I
have already here an API management
instance already deployed. But to deploy
a new one is very easy. Just like create
a resource like we do for any other
resource in uh in Azure and then search
for API management and then after some
minutes I will have my instance ready.
Okay.
So then I go to the API blade. That's
where I add the APIs. That's where most
of the you know the the magic with API
management happens. And here uh we added
new capabilities for different API types
and we add experiences specifically for
AI founder. So that when we select this
option we will see the resources that
have deployed and you can see here that
I have bunch already of resources
deployed and I can see here that my uh
resource that I was just showing is the
same models that I was seeing on AI
foundry. Okay. So all these models they
can be served through this uh resource
that I'm configured now. So I select the
this foundry resource. I do next. I'll
give it a name inference API. I think
it's a good name. Uh when we are
exposing the the LLMs
we can configure a base path and then
optionally we can configure products for
the for this inference API. we have
different options in terms of
compatibility with the with the client
SDKs. Okay.
And then automatically we can configure
uh different policies like the ones that
Andre just described. For this first
one, we can set limits on how many
tokens per minute it's allowed to be
consumed. Okay? And the limit is usually
applied to the subscription. Not to be
confused with the Azure subscription.
This is an API management subscription
that we can configure easily in
displayed here in the left. It's
commonly referred to API keys. It's like
giving different API keys to the model
consumer so that for the same model
deployment we can have multiple uh
consumers of the model handled by API
management through the API keys and then
we can we we may configure different
rate limits for each one of these API
consumers. So I I will uh leave this
default value.
uh there's additional options that I can
estimate the prompt tokens within API
management automatically but I will
leave it the default values and then I
can do also track a token usage okay and
here
uh the information will be sent to an
application insights and I can say send
information about the subscription ID
that is being used as additional
information like the user ID or custom
information that I can send in the other
I will hit text. I can also configure
the semantic caching uh basically with
um pointing to a to a vector store and
to embeddings model. I I will do this
later and the same for content safety
can configure it here the content safety
but I will we will do this demo later.
So at this point um I will review the
configuration and an important aspect
here is that this will use Microsoft
enter ID authentication between Azure
API management and the Azure AI foundry
resources so that we don't need to use
any keys between those services the
underlying mechanism of manage
identities through Azure will be used in
this case so Azure API management will
be allowed
to call the APIs from Azure AI foundry
and this is handled automatically with
this wizard. As the next step, we we
want to enable an important aspect here
that is um we can see the other name now
is expecting the API key. That is how
usually the consumer will send uh their
keys and then we can also enable the the
the Azure uh diagnostic log to enable
LLM logging that I will show them in a
second. Okay, so this looks good. I can
go here and check the policies that were
configured. I can open copilot
and copilot can help me to explain what
are what is this XML that is being
configured for the policies. I can add
new policies through uh this. So it's
it's pretty straightforward
and uh I guess that the next question
now is it working? So I can do a simple
test. I can search for the completion uh
operation
and then it uh provides me this testing
console ask me about the deployment ID.
So let me check. I have here different
models, right? This is the deployment
name. Let's use this first one.
I'll put it here the deployment ID
and then automatically puts here some uh
uh sample content like for example how
are you might be a some kind of hello
world or pink uh request. So in this
case deployment not found because maybe
it's something wrong
an additional space in the beginning.
Okay.
So uh you can see that they have a a
successful response and I can see that
all the details including information
about the content safety that was
applied the of course the response right
from from the model uh token counting
etc. Okay. And uh also very useful
within the testing console, I can do a
trace. And the trace is useful because
then I can see a step-by-step invocation
of the policies that I that uh I just
configured. Okay. For each of these
stages for the inbound
when the back end is called etc. So that
I have here the full details and
understand exactly what's the processing
that happens with API management. But
let's let's think about u now in this
flow. So I create the the inference API
and I enable token limits token
counting. I will enable semantic caching
later and I test API in the portal. But
does it um
does it work for example the the rate
limit the so for to test this it's
easier for us to get it closer to the
real usage of this inference API and
most of the AI engineers nowadays they
are using Python or other uh programming
language like net or Java but Python is
very popular in this case for AI
engineers. So at a certain point might
be important for us to try the same SDKs
that developers are using to test API
and this is what I'm doing here. So I'm
using the OpenAI SDK in this example.
I'm passing the the API management
gateway URL as the endpoint. The API key
is one of the keys that I have uh
created within API management and then I
can run and ch check if the API is
working. Okay, so it will it will
consume the the LLM through API
management and uh I see that it works. I
got a response but let's now test the
token rate limit. So instead of sending
just one request, I will send a couple
of them. 15 actually in this case it
will be a loop with 15 requests. And I
want to stress the the the gateway to
understand if I will start to see the
the rate limits um and the the 4 to9
response from the gateway. Okay. So it's
sending the request.
It's taking a little bit longer than I
was expecting. Okay, here we go. [cough]
So, I start to see that I have
successful responses, right? So, I get
the the response as it's supposed. I
configured a low uh rate limit just
1,000 tokens. So, that after a while, I
should now start to see the 429s. Okay,
so token limit is exceeded. Try again in
4 seconds. In this case, I'm not using
the open AI SDK because the open AI SDK
has a retry logic automatically
and with the retry logic, I will not see
this error. Okay. So, it's important to
have this kind of experimentation, this
kind of environment for me to be uh
absolutely sure that the
API is working as expected. And now I
can quickly analyze
the behavior and I can see that when it
was close to uh reach the threshold that
I have configured I start to receive the
4 to9 to avoid consuming more tokens
than it's allowed. Okay.
In uh in the another instance I have
configured the semantic caching and the
semantic caching works like Andre was
describing. We will um use the
embeddings models to to calculate the
embeddings for the prompt and then we
will check in the vector store in this
case manage radies to check if the if
that uh uh prompt is already in the
cache with a simulator uh similarity uh
threshold. In this example, uh
I'm just providing different prompts
that they have different text so that if
we apply the built-in caching mechanism
from uh from Azure API management that
just that is a key value it will not
work because the the keys are different
right the text is different but if we
apply the semantic caching uh it's
different because the the text is
different but the meaning is very
similar. Basically, we are I'm asking
very similar questions. Okay. So, and if
I run this,
I will start to to get the response.
Okay. And uh it will took more than 3
seconds the first one and then you can
see that the the next ones were pretty
fast. And if analyze the performance I
have this right more than three seconds
the first request and then all the
subsequent requests they didn't consume
tokens right uh because they took the
value that was already stored in the
cache we can all connect to the radius
cache and see uh the cache hits and
cache misses that I got so that I can
debug and understand exactly the way
that it works for this kind behavior. If
I do the same tracing with API
management, I can also see if the policy
uh gave me a cash hit or a cash miss so
that I know exactly the way that this
policy is working and that does it's not
just magic. I have full control on the
way it works. Okay.
So now let's uh move to this next part
with the prompts and completion locking.
Do you want to add something here Andre?
>> No, I think that's that's a perfect demo
that demonstrates the one of the main
values, right? Because you have similar
problems but at the same time like you
save what what was that pretty much like
10x improvement on latency for
responses, right? So
>> yes,
>> so that's that's pretty cool if you're
especially if you're building um like
chat applications that respond to like
for example if you have like a website
and you have like an FAQ type of thing
where customers ask a lot of similar
questions that is especially where it's
it can be useful because yes you
generate responses based on some
knowledge
>> but at the same time there's going to be
like a lot of repeated questions
formatted in a different way. So that
that is a perfect way to to optimize
your your intelligent app in this case.
Um so yeah and another way to to
optimize your intelligent application is
actually to log prompts and completions
because that is the only way for you to
understand how your application or in
that in that particular example your
chat application performs. What kind of
the best sort of like prompts and
completions pairs are out there. So that
then you can use that to select perhaps
a cheaper model that can perform in the
same way but obviously like tokens will
cost less for you. So that is
essentially like a way for you to
evaluate or perform evaluation um of
different models and see how they
perform in your application. Um so this
is one of the features that we have um
in API management and with that I think
we have a demo for you right Alex?
>> Yeah exactly exactly. So uh remember
that uh we started with the model
selection we deploy different models
right then we imagine that this API this
inference API that I've just configured
that is now running in in uh in
production so I will have logging about
this API but then I should apply this
continuous model evaluation process to
ensure that I'm using the best model for
the job because as we know new models
are coming like every week. Uh and uh
new models they came with better capab
capabilities sometimes more effective in
terms of pricing and in terms of uh
token usage. So we we should apply this
continuous model evaluation and then
having like simple processes to adapt
and to change the model. The API
management uh um by sitting in the
middle between the model consumers and
the model providers allows us to also uh
to tweak imagine that the the the an old
model is still being used and we might
translate to a new model automatically
so that we don't need to change the the
client application and do this kind of
uh translation. Okay. So in this case in
this demo the model is already uh live
is being served. I have an application
that is consuming the model and I'm I
want to observe the logs and the metrics
and uh then I want to export those logs
and um maybe configure some alerts to
guarantee that everything is working and
then import those logs into AI foundry
to do evaluations automatically. Okay,
so with that let me switch again to the
API management portal.
and in this case to this instance where
I have already disconfigured.
So
uh in the previous demo I didn't show
this part here
with um that is very important with the
with the Azure monitor. I just need to
come here and say enabled for the log
LLM messages and then optionally
I might log the prompts and the
completions. Okay, even if this exceedes
the the size that is here uh it will
split in different rows in the loggings
and then we have ways to concatenate
everything.
Important to mention as well
[clears throat] that uh this mechanism
works also with the streaming enabled.
Streaming is like a a nice mechanism
that allows to to stream chunks um of
the response directly to the to the
client. And um in some chat applications
it's important to have that so that the
user understand that the model is
already answering and they can see um
you know the the answer from the model
coming and is more interactive. But in
terms of handling this uh the server
side events that is the technology
underneath sometimes it's hard in order
to combine features like logging for
example. Okay. uh but the team has made
a great work to simplify this and now
with just this checkbox I can log the
prompts and completions in certain
scenarios I don't want to log the
prompts and completions because they
might have sensitive information and I
might not want to log that so it's it's
an option that is here available so I've
already disconfigured and I've run a
couple of uh requests already and now if
I go to the monitoring part and I click
on analytic ICS I have here a bunch of u
information that I can see. Okay, I can
see the timeline of requests. I can see
the the different APIs. In this case, I
just have one but I could have multiple
ones for different providers. And of
course, the Azure API management
instance is being used as an AI gateway
but can be used to serve other APIs that
I have in the enterprise and aggregate
all those requirements in the same API.
uh the concept of subscriptions that I
uh that I just described. So each
subscription is like a consumer of the
of the model and I can see here the the
requests the which ones were successful
etc. And then very important
[clears throat] go through this language
models
um option where we have uh this specific
dashboard where I can see information
for a certain time period right and then
uh have like a aggregated view on how
much prompt completion tokens were
consumed total requests and then having
a drill down of the details. In this
case, I want to see per subscription
what was the model that was used, how
many tokens were consumed, how many
requests, etc. Here I have a
distribution on the average duration for
different models, API versions, etc.
Okay, so all this information is
available uh for me to to use with um
with the with the locks. Okay, this is
exploring the the logging um the logs
from the API management that I can
uh go here and and uh uncheck directly.
Okay, let me replace this just with um
with a with a table. Okay, this is an
Azure monitor table that is uh
[clears throat] something very standard
within Azure that then I have a lot of
options uh to handle this information.
It uses this uh custoquy language so
that I can build advanced queries uh
with this data. And here is the row
data.
uh and I can see that I have important
information like the the region, what
are the number of the prompt tokens,
completion tokens, what was the model
name, if it was streamed or not, and
then information uh details about the
request and um and the response. Okay.
And then I can correlate this with other
things. So a lot of options here
available. And uh when I have this um
this kind of information then I can
extend and customize this for additional
scenarios. one that I like to describe
is this one where we can have uh we can
provide like an analytics dashboard with
a phenops view within the enterprise
because at a certain point it might be
important to really understand uh how
much tokens are being used but then to
uh express or to show the amount of
dollars or euros or any other currency
uh and translate that to that to to that
currency. See and and maybe per each
subscription each subscription might be
an individual developer or might be a
team or might be an application and I
might want to set budgets for these
different model consumers. So in this
case for example you can see here that
uh I can see the total cost and I can
see the cost quarter the budget that was
uh assigned. Okay. And this is using a
nice feature from API management that is
the the the products the product
concept. And here I'm applying different
quarters for different products and then
I'm just assigning the subscriptions for
the different products. Okay. And um and
also a nice thing about this
architecture is since it's using Azure
monitor, Azure monitor allows me to send
alerts. Alert might be a simple email,
right? But that's nothing new. Of
course, we can send an email, but we can
also do remediate actions. Like for
example in this case I'm calling a logic
app automatically and the logic app as
an action will disable the subscription
to prevent uh you know continue to
consume the model. In some use cases it
might make sense to cancel the
subscription. In others you'll just
inform send an email. So it's it's
easily customizable to these kind of
things. So for each one of these uh kind
of uh charts I can go here
and enter in this uh custo language and
adapt the query so that I extract the
information that I need uh so that I'm
top uh on what's happening with the
model consumption and I can uh easily
adjust if needed. Okay. So uh with this
let me change to yeah to this to this uh
to this query
that in this case is used
um in order to to extract the input and
output because I'm using the LLM logging
and I said that I want to to save the
prompts and the completions and um and I
put this in a format that then I can
quickly go here
and uh export to a CSV. Okay, I can also
export to PowerBI, Excel. Uh I can also
save this to a dashboard. So plenty of
options here. But in this case, I will
just export to a CSV. Very easy. The the
file was stored and now I will switch to
the to the same AI foundry resource. And
here I have a new option that is
evaluation. Okay, that is still in
preview. We should uh release this to to
G soon. And from here I will create a
new evaluation
for everyone using models. You know that
doing evolves is very important. It's
like a a a very good best practice. So I
will select this from an existing uh
data set. So now I will upload my data
set.
Okay, that is basically the file, the
CSV file that I was just extracted. I
will see a preview from this data set
here in a moment that is being loaded.
Okay.
Okay, looks good. This data set is like
for a couple of prompts uh to ask
information about um uh uh telco uh
telco service provider like what is my
plan uh how can I pay my bill these kind
of things nothing uh spectacular and
then I have an output text that was the
one that API management um track and uh
it was saved. So from here I will
provide a text criteria. I've been
providing uh some of them as templates
that I can then quickly customize here.
Okay. And then I will uh do the testing
with a selected model I will put here.
And here in the bottom you'll see that
to the user input it will be the item
out output and the out and the output
will be the item output text. This is
the way I have configured.
I can put additional messages if needed.
For now I will keep it simple. So I will
hit next. Okay. And then I will submit
my evaluation so that I start the job
and [clears throat] uh and the
evaluation will run. So this will cue
the the evaluation. It will take some um
some seconds. In this case, I don't have
a lot of rows. Okay. So depending on the
size of the data set, this can take
longer. And um the the model is already
evaluating this and will provide me a
score. Okay. So in this case I can see
that it passed 100% uh and I can go to
the data tab
and see details about this. Okay can see
that it passed. Why it passed? Let me
check. So I can see that this was the
the system prompt that was provided with
the model evaluation.
This was the input that was provided
from the data set and then I have here
a response why it was evaluated this way
and the result. This is easily
customizable with this instruction uh
from the from the beginning and I can
also use the SDK not to do this in the
UI but to do this continuously in an
automated way. Okay. So with this then I
can decide to change the model that I'm
using for a particular application
because I might get to the conclusion
that I have a model that is more
optimized uh for my use case. Okay. So
with this let let's talk also about the
other um other uh models and other APIs.
>> Yeah. So in and a nice thing about this
evals and logs um as we also see on the
slide and as I mentioned at the
beginning that we not only support Azure
models and not only model catalog from
Azure foundry we also support other
models which are compatible and the
bedrock which has kind of a unique um
API format to their API. So in that case
you don't really need to chase all of
these different logs and inputs and
outputs on different platforms. what you
can do is you can collect them on the
gateway level and essentially that
simplifies a lot of things for you. So
that was what I wanted to share here is
that we all of the policies that we
demonstrated so far all of them are
supported for other models not only
Azure models. Um and now the the next
thing that we wanted to share as well is
um the content safety.
So one of the key advantages of models
in Azure foundry and in Azure
specifically is that they have content
safety filters in place right but you
might have other models as I mentioned
you can have you can use hugging face
you can you use other providers you can
deploy them um locally and ideally what
you want to have is you want to have
like a single content safety
configuration for all of your models
across across your your environment
regardless of their on if they're on
Azure or they're on your own
infrastructure for example or any
thirdparty provider. Uh so we we're
actually integrating with Azure content
safety service uh and we have a policy
to to do that in a very simple way. So
essentially for every prompt coming to
API management to the gateway we can
send it to Azure content safety service
to check for three things. First we can
check if there is like any harmful
content such as like um uh hateful
content, sexual content, violent content
and so on. We can also you can also
create a block list uh in the content
safety service. For example, uh going
back to our uh example with the chat
application on your website, you
probably don't want your model to
respond to any question about your
competitor. So what you can do is you
can create a block list with your
competitor names. So in that case, this
chat application will not respond uh to
any questions regarding your
competitors. And the last thing which
becomes also important is that you can
configure it to shield you against any
prompt attack, prompt attacks, jailbreak
attacks and so on so that your model is
not misbehaving uh in in production
environment.
Um so yeah with that let's let's see how
it works.
>> So it it works in a very seamless way. I
could go here and let me add a new API.
In this case, I will not select this uh
AI foundry option because the model is
not hosted on AI foundry. Okay. Instead
is hosted on a face and I'm using uh an
uh an inference provider directly from
phase and that's what I will use to to
configure. So I will start here to to to
copy the URL that is being provided here
and I will select this language model
API this option. Okay, that is very
convenient for me because it simplifies
the process. I will uh give it a name.
In this case I will call it deepseek
because it will be the model that will
be provided. I will provide this URL. I
will use the deepseek path to expose
this API and then I have here these two
options an open a AI API compatible or
just a pass through that it doesn't care
what is the format it will accept all
the formats and uh we can use this
option as well so then I will pass uh
the authorization bearer token that in
this case is this uh token that face is
expecting
So that allows me to consume the model.
And here I can just configure this in
the other. I can use uh Azure key vault
to store these keys in a secure way also
if I need it. So now I will skip this uh
this policies because I want to
configure the content safety. Okay. So I
will start to select an endpoint an
existing endpoint for the Azure AI
content safety service that Andre just
described. And then I have different
options here that I can configure.
The the first one is to enable text
moderation where I have these two
options of uh four and eight severity
levels and then for each one of the
levels I can configure the policy
threshold. Okay. like for example
between zero and uh and six and just
configure the values. Okay. And uh then
I can also detect unblock prompt uh
prompts with uh with specific keywords
like the PI data like credit card
information these kind of things so that
I can lock this or competitor names like
the the example that Andre was
describing and then I can also prevent
jailbreak attacks to the model with just
a single checkbox. Okay. So I've
configured this. I will review what we
what will be configured in the policy
and I will create. Okay. Similar
experience that I did before in this
case not connecting to foundry but
connecting to a face and uh to the to
this inference provider that is being uh
shown here. Okay. So the the API will be
created
and um now as the next step I will copy
paste
an example that I put here with a
message content. You can see also that I
have here the model information as part
of the of the payload and I will use
this information. So my my deepseek API
was created. I can also go here and do
um the complete
operation. It's not a complete sorry
completion.
Okay. going to do a completion
and then uh I will replace here with the
payload uh that is being provided as as
an example. Okay, I will eat send
and let me check the answer that I have
and I see that it was successful and um
and then I can see the the the response
that was provided. Okay, simple ask
question and I can see the model
response and also I can see again usage
information etc. And uh and then after
this of course I can test with different
uh uh content in order to test the
content safety. I can put here a jail
break attack. I can use uh you know
nasty words about uh hate and those kind
of things so that I can check the the
content safety being applied uh as well.
I can do the tracing
and I can see then the response from the
content safety policy with details
exactly if it was blocked or if it was
allowed. Okay.
So basically in this then I can I can
ensure that my API is safe.
>> All right. Um so so far we talked a lot
about how to secure manage and govern
models. Uh but now we know we all know
that we are kind of an enentic error,
right? And we now have these agents that
not only just send prompts and receive
completions from models, but they also
consume tools, right? And and the one of
the main protocols that we see nowadays
being used for tool invocation is model
context protocol or MCP. And we've built
a lot of different capabilities for MCP
servers. You can create MCP servers on
top of APIs. You can proxy servers, MCP
servers and apply policies. You can uh
create on behalf of authorization
authentication flows for MCPs. Uh but
this is something that we will not be
showing today. We'll just consider it as
kind of a sneak peek to MCP capabilities
that we have in energy pay management
today. Uh I believe we have a full
onehour session next week um about MCP.
So please join that to learn more about
all of the great features that we have
for MCP. But for now, let's just focus
on like one one scenario. Um so
essentially what we're going to do in
this demo is essentially we're going to
build build a simple very simple agent
that consumes MCP on top of existing API
and API management. Uh but first we need
to figure out what's the best way to
build an agent. So at Microsoft we have
essentially can think about like three
different ways to build agents. One is
the simplest one uh but it provides like
not a lot of different like
configuration options which is copilot
studio. You can think about as kind of a
SAS where you have an agent you can
connect the tools you can uh infor
improve it with knowledge and so on.
Then we have kind of a middle ground
which is pass approach or Azure founder
where you have agents which are hosted
on behalf of you but at the same time
you have flexibility to to kind of
configure and um and adjust the behavior
of this agent pretty flexibly with a lot
of different options. And then the the
last option is kind of I option where
essentially you have all of the
different frameworks out there to build
agents but you also need to host them
you need to code them and so on such as
like uh curi since we mentioned hugging
phase small agent framework and then
also we had for a while two different
options at Microsoft to build agents. we
had semantic kernel and we had um
autogen um and with that uh recently I
believe at the beginning of the month we
actually announced that we are
introducing a Microsoft agent framework
uh which is essentially going to be like
a convergence of these two great
frameworks that we've had so far. So
semantic kernel was more about
supporting a lot of different languages
more enterprise ready with a lot of
different capabilities surrounding
production while autogen was more on the
kind of a research side of a house of
Microsoft where we were like building
new capabilities and kind of everything
from research were kind of pouring into
autogen. Um so right now these two
frameworks are merging into Microsoft
agent framework which is currently in
public preview. Um and in the next demo
uh Alex I believe will be showing how to
use Microsoft agent framework with with
MCP.
>> Yes exactly. So uh we have enabled this
MCP experience within API management and
we will start with that and then uh
there's this tool the MCP inspector is
very popular that allows us to to
understand if the MCP is is working to
do like a simple test then using this
agent framework uh and the and the AI
foundry agent service to use the the to
put it all together right to use the
inference API I to use the MCP tools to
do meaningful work in the demo will not
be very meaningful because we will just
play with the weather uh information but
then you can change to your own MCP
tools and do you know uh more
interesting work and then I want to
trace the agent behavior very important
right I want to understand what the
agent did which cools tools it called
and uh within AI foundry I can uh see
all that Okay. So again I will start on
the my API management instance but in
this case I was in the APIs blade and
now I will switch to this new a uh blade
that is the MCP servers currently in
preview. So here I configure different
MCP servers. It's very easy to configure
new ones. I can configure from an
existing API or for from existing MCP
servers. And uh I have here a bunch of
them already. Okay. So, for example, in
this case, I've configured a weather MCP
that if I go to the MCP inspector,
the tool that um the MCP community
provided to test the MCP.
I can put here the URL.
I can connect. I'm connected. I can see
the tools. Okay. I can click in one of
the tools. I can put here city name run
the tool
and I will see the response. Okay,
important for me to debug to understand
what's uh how it's working and then of
course I can play with advanced things
like authentication etc. But we will see
all that in the in the in the other
session. So uh from here I'm confident
that my uh API is uh my MCP is working.
I will switch to uh Visual Studio Code
again because I want to use these tools
with uh with an agent. Okay. And here
you can already see this agent framework
libraries being used. Okay. And uh this
is a simple request to the using the
libraries. In this case I'm not using
the NEMCP tool. I'm just routing the
request to Azure API management. And
here I'm showing that uh Azure um that
the model the the new Microsoft agent
framework works with uh with the API
management. So I have a response. But
now things will get more interesting.
Okay. Because here I will combine
my uh Microsoft learn MCP one of the
tools that I have configured that goes
through API management and then I will
ask what is Microsoft agent framework
and uh and this agent will automatically
call the the MCP tool to provide me this
answer. Okay, the MCPs they are very
powerful but of course with great powers
comes great responsibility. So sometimes
we need to have like um
we need to have an approval mode so that
we can allow or disallow to to call the
the tool. So I have here yeah uh it's a
very popular tool the MSL learn [snorts]
MCP. I think I have too many requests on
that uh on that MCP tool. Let's see
again if it works.
Okay, too many requests. No, no worries.
Okay, we we will test with um with the
weather. Okay, so uh let me test with
the weather. That is basically do the
same thing but in this case I will ask
what's the weather in Lisbon, Lisbon,
Cairo and London. Okay. So in this case
I will use the weather uh MCP tool
with a simple snippet. All these
snippets of code they are in the GitHub
repo that you can then reuse uh to do
your own experiments. So I can see that
uh the agent is ceued. Okay. And uh
and then it switched to in progress and
then in a few seconds I should have a
response. Okay. And as I expect in this
case, I put it a lot a lot of debugging
information because I want to see the
the output from each tool and in the end
the agent with the model right provide
me a summary of what's the weather in
these three different cities. What's
important here to understand if I switch
back to
if I switch to the to AI foundry is that
I have here an agents uh blade where I
can see information about the my agents
and my threads that just run. Okay. So I
will refresh. I will see that I just add
one execution
and I can try it in the background to
see details.
And here I can see the
the conversation history, right? And uh
I can see the response that was provided
and I can see the thread locks.
And as I expect three different MCP
invocations were performed and I can see
each one of them to understand the MCP
uh invocation for each so that I can
trace and debug what's happened and I
can see all the metadata information
etc. Okay, this is very powerful because
uh agents will have autonomy and I want
to control exactly what each agent uh
did. Okay. So with this let's provide a
summary.
>> Yeah. Thank you. Thank you for the demo
and and again for kind of a more
in-depth uh MCP content uh as I see that
yeah our session is going to be on
Tuesday next week. So make please make
sure to to join that to learn more about
MCP and API management. Right. So just
to summarize what you've seen today is
that we provide a lot of capabilities in
Azure management through AI gateway to
control AI agents. Uh so first of all on
security side we've seen uh keyless
authentication we're using managed
identities. Um there is a way to enhance
even that to have authorization
authentication flows on top of API
management with jaw token validation
also. Again next week you will learn
about MCP authorization flows. uh we
have content safety to ensure that u no
harmful content is provided to the model
or sent to the model or generated by the
model. We didn't really show a lot about
like load balancing but uh this is
something uh that is we have in API
management through backends feature. So
you can enable like due redundancy with
load balancing. You've seen the logging
capability to help you not only track
token consumption but also uh build
evaluations for your models to optimize
your intelligent workloads. Um for cost
efficiency we have again login
capabilities to track cost but at the
same time you also have control over
cost because you can assign token limits
per different teams. You can assign
token quotas and as we've seen with kind
of extension of the login capability
which is a phenops framework that that
Alex just showed you can create alerts
based on budgets you can actually
disable subscriptions based on budgets
which is really cool for you to get your
cost under control. Um also caching is
important right not only improving
latency but also saving cost on sending
the u the uh the prompts to the model.
Um and then uh developer velocity um u
we shown how easy it is to bring new uh
AI APIs to API management through
different options. So if it's Azure
Foundry, you just use import from Azure
Foundry. If that's the model deployed
elsewhere such as hug interface, you can
just use um LLM model import. Uh that
works with any model out there. Um and
essentially we as I mentioned we have
like developer portal in API management
where you can still publish those APIs
even though there there's there's APIs
for the models there are still APIs so
you can treat them as as such and then
on the governance side of course all of
the capabilities to monitor to audit
logs across all your agentic workloads
to see uh what kind of a tool
invocations were happening during the
agent agent call and so on. uh also
allows you to have full control across
all of your different components of of
your agentic application whether it's
like the agent itself tool or or the
model. Um so yeah with that thank you
very much for joining. Uh that's all we
wanted to share and again uh make sure
to check our out our MCP session next
week. Thank you.
>> Thank you.
[snorts]
Thank you all for joining and thanks
again to our speakers.
This session is part of a series to
register for future shows and watch past
episodes on demand. You can follow the
link on the screen or in the chat.
We're always looking to improve our
sessions and your experience. If you
have any feedback for us, we would love
to hear what you have to say. You can
find that link on the screen or in the
chat. And we'll see you at the next one.
[music]
>> [music]
[music]
[music]
[music]
[music]
Building AI-powered workflows is only the first step making them secure, reliable, and scalable is where most developers run into challenges. The AI Gateway in Azure API Management provides enterprise-grade tools to expose AI workloads safely and efficiently. In this session, you’ll learn how to manage token quotas, semantic caching, safety policies, and authentication, ensuring that your AI services perform reliably under load while staying secure. We’ll demo how to wrap AI services in API Management, apply policies for rate limiting, monitoring, and cost control, and optimize AI workload performance in production. By the end, you’ll have practical patterns and examples for turning AI capabilities into secure, production-ready APIs that your teams can confidently consume. 📌 This session is a part of a series! Learn more here - https://aka.ms/AIS/series Chapters: 00:06 – Welcome & Housekeeping 01:00 – Introducing the Speakers & Session Overview 01:54 – Challenges in Securing AI Workloads 03:05 – Why AI Needs an API Gateway 04:20 – The AI Gateway Pattern Explained 05:02 – Key Features: Security, Token Limits & Content Safety 06:07 – Load Balancing & Session Awareness 06:34 – Semantic Caching for Cost & Latency Optimization 07:03 – Supporting OpenAI-Compatible & Third-Party Models 07:58 – Developer Velocity & Observability Features 08:42 – Demo: Deploying Models in Azure AI Foundry 10:21 – Demo: Serving Models via Azure API Management 13:00 – Setting Token Limits & Logging 15:02 – Testing Rate Limits with Python SDK 17:01 – Semantic Caching in Action 20:06 – Logging Prompts & Completions 23:07 – Cost Monitoring & FinOps Dashboards 26:04 – Continuous Model Evaluation with Azure AI Foundry 30:04 – Content Safety Integration with Azure API Management 33:52 – Demo: Blocking Unsafe Prompts & Jailbreak Attacks 36:00 – Agentic Workloads & MCP (Model Context Protocol) 38:00 – Demo: Building Agents with Microsoft Agent Framework 41:00 – Tracing Agent Behavior in Azure AI Foundry 43:00 – Summary & Key Takeaways 45:00 – Upcoming Session on MCP Deep Dive #MicrosoftReactor #learnconnectbuild [eventID:26309]