Securing and Scaling AI Workloads with AI Gateway in Azure API Management | DailyDevLists

Loading video player...

Full Transcript

9,456 words • EN

Hey everyone, thank you for joining us

for the next session of our Azure

integration services series. My name is

Anna. I'll be your producer for this

session. I'm an event planner for

Reactor joining you from Renmond,

Washington.

Before we start, I do have some quick

housekeeping.

Please take a moment to read our code of

conduct.

We seek to provide a respectful

environment for both our audience and

presenters.

While we absolutely encourage engagement

in the chat, we ask that you please be

mindful of your commentary, remain

professional and on topic.

Keep an eye on that chat. We'll be

dropping helpful links and checking for

questions for our presenters to answer

live.

Our session is being recorded. It will

be available to view on demand right

here on the Reactor channel.

With that, I'd love to turn it over to

our speakers for today. Thank you all so

much for joining.

>> All right. Hi everyone. Um, so today

we're going to talk about how you can

secure and scale your AI workloads with

the gateway uh in Azure P Management. So

my name is Andre. I'm a senior PM in

Microsoft uh Azure API management team.

And today with me I also have Alex who

works as a global leg build um in our um

in our team.

So let's get started. Um so first of all

why we talk about AI gateway um today.

So there are still some challenges and

we talk to customers every week about

those challenges about uh on building AI

applications and agents

and even though we started kind of a

year ago with all of this kind of large

language models agents and applications

intelligent applications in particular

we still see that there are some

challenges when it comes to security

right there is still there are still

questions on how to do proper key

management uh with with your models

because you have a model you have an API

key, how you distribute, how you rotate

keys and so on. Um, we still see some uh

struggles with keeping the token

consumption and token management

practices in place. uh we we see a lot

of customers who are trying to build uh

a way to track token consumption within

the organization and also limit token

consumption within the organization to

make sure that they can distribute

whatever quotas and whatever uh model

capacity they have between different

teams. Um and also in terms of security

um guard rails or content safety becomes

important because we see more and more

different attacks such as jailbreaks or

prompt attacks um on on models [snorts]

and we see a lot of enterprises they

actually um struggle with understanding

how to better protect their models

especially if those models are deployed

to their own infrastructure

and why pay management. So essentially

if you think about like AI or artificial

intelligent it is exposed as an API

because if you think about the model uh

when you when you talk to the model you

typically use like chat completions API

or recently you can use like responses

API uh if you think about different

tools such as MCP servers you can still

treat them as APIs and now we also have

agents which can be exposed through kind

of a new protocols that are developed

out there such as A2A or agent to agent

But then the majority of them are still

rest APIs.

And ideally you want these APIs to be

managed in the same way as traditional

APIs just like uh rest and soap and

graphical APIs. You want to design,

develop, secure them, publish, you want

to monitor their behavior and then you

want to analyze how to improve the

performance, how to improve um the um uh

those APIs that you that you developed.

um and API management as uh is in a good

position to do that because we sit we

kind of a uh sit between your API

consumers or AI consumers in that case

and your APIs which can be again model

API tools APIs or agents APIs

and what we've seen over the last year

is this emerging pattern called AI

gateway where you basically put an AI

gateway like essentially like in our

case an API management gateway between

your API applica AI applications or

agents and your models dat uh data and

tools. Essentially what you can do is

you can enforce security rules, you can

ensure resiliency, scalability,

observability and you can also mediate

traffic uh to those models and and

tools.

Um so over the last years we added

simple capabilities such as for example

token limiting load balancing and um

recently we introduced more kind of

advanced capabilities. So I just want to

walk you through some of the features

that we added and then we will focus on

some of them and also we'll demonstrate

how they work in action. Um so first on

security and safety side um being an

Azure service and especially mediating

access and kind of proxying access

proxying requests to Azure Azure models

we can actually provide a keyless

authentication which means that you can

configure managed identity

authentication from Azure API management

to uh your Azure models and in that case

you don't need to uh share the single

key with anyone instead you can generate

new keys on API management side and

distribute them between different

developers, development teams,

departments or or even organizations if

you have like partner organizations that

you're working with. And then you can

also assign token limits to this keys to

make sure that you actually have the um

LLM consumption under control. And then

on safety side, we also introduced

content safety feature uh which is

currently uh GA and later we will show

and talk more about it. On resilience

aside, we introduced um a year ago a

couple of features to um to load balance

between different uh LLM endpoints such

as weighted load balancing, priority

load balancing and so on. But with the

introduction of new APIs which are more

more kind of on the stateful side which

are response API for example or

assistance API, we understood that there

is a need for session awareness in those

load balancing uh mechanisms. So that's

why we introduce session aware low

balancing

on scalability sides. So I already

mentioned token rate limit and token

quarters. Uh we introduced semantic

caching which allows you to um to save

on latency and token cost. So

essentially you can think about it as a

regular uh caching uh that we have in

API management world. But in that case

you actually uh return uh completions

for similar prompts which not

necessarily might be um the same uh word

to word but they're semantically

similar. Um next section is traffic

mediation and control. Uh so initially

again as I mentioned being an Azure

service we started with Azure OpenAI

models and Azure AI foundry models but

now we also introduce support for um any

open compatible model which means that

whatever you see in the next um uh 50

minutes of demos and features that you

see on the slide. We actually support

not only Azure models but we also

support any open air compatible model

which means that if you have model

deployed through hugging phase or like

any other inference provider or even if

you deploy model uh directly to your uh

to your own infrastructure to your own

GPUs as long as it supports chat

completions API we'll also support all

of these features for this model. Um we

also introduced support for uh other

vendors such as Amazon Bedrock which is

now GA we also support policies for that

and then with Gemini being a popular

model we also supported through this

open compatibility.

Um now on developer velocity side you

can still use all of the Azure pay

management features such as developer

portal. uh you can easily onboard um

APIs to API management through new kind

of wizard uh one-click gestures uh that

we have in API management on

observability side uh we introduced

token counting and prompts and

completions logging because we

understand that it's important uh to

improve your models to understand what

is the token consumption between

different uh different departments and

we also made it very simple to get

started because we also now have

built-in reporting dashboard that you

can use to have this kind of view on

different uh different uh uh different

aspects of uh LLM consumption

organization. And then on governance

side, we still have other policies. We

have more than 60 uh different policies

in API management that you can still

apply to your AI APIs and AI resources

such as jot validation, monitoring

policies, uh regular rate limiting and

so on.

Um so with that let me talk a little bit

about one feature in particular which is

semantic caching. Um so as I mentioned

it works pretty similar to regular

caching but here instead of caching the

exact match for request we are

calculating how two prompts uh we're b

calculating the similarity of two

prompts. So what we do is essentially on

API management side when there is a new

prompt we calcul we send it to the

embeddings model to get the vector for a

specific prompt and then we we uh we

check if there is any completion cached

for a similar vector. Uh you can

configure similarity between different

prompts you can you can make them very

similar to uh so that completion is

returned from cache or they can be

pretty uh pretty basically ambiguous so

to speak. uh when it comes to similarity

and then if there is no match we only

then we send uh this prompt to the chat

completions model and then return this

completion and again this completion is

going to be cached if there is a new

prompt which is similar to to the first

one. So in that case you you're saving a

lot on latency but you're also saving on

token cost because obviously like

generating vectors is much much cheaper

compared to uh chat completions model.

Um so with that let me hand it over to

Alex for our first demo.

>> Let's do it. Uh hello everyone. It's a

pleasure to be here and uh in this demo

we will start to walk through um

deploying and a new inference API and to

use that inference API with the

different capabilities that Andre just

described.

So within within Azure um it all starts

in uh in AI foundry right in AI foundry

is where we will deploy the the models

different kinds of models as we will see

here and uh and then after we deploy the

models we want to uh serve them through

API management so that we have all the

governance capabilities that Andrea was

just describing.

Microsoft has been uh uh doing um a lot

of work to to host different models from

different providers. We started with

OpenAI but we extended with a lot of uh

other providers from um uh Grock,

Mistro, Lama etc. Okay. And I have

already deployed here a bunch of

different models in the in this

environment. So to deploy a new model is

very easy. We just go here and we say

deploy deploy based model and then we

can search for all the provider that I

was describing and deploy the model

directly here and then the model will be

start to be served. We have here

different options in terms of u reserve

capacity etc. But after we deploy the

model the model is just available for us

to consume. And um as you can see here

in this example, I have included models

from um GPT models from OpenAI, Grock,

Lama, Mistrol. This model router is

interesting as well because it's an

abstraction of different models. So that

uh automatically selects the best model

for the prompt that was sent and API

management is compatible with the model

router as well. And then we have here

some reasoning models by four for

Microsoft etc. Okay. So a lot of

different options options. Let's let's

see now um how to serve this uh these

models through API management. So I I

have already here an API management

instance already deployed. But to deploy

a new one is very easy. Just like create

a resource like we do for any other

resource in uh in Azure and then search

for API management and then after some

minutes I will have my instance ready.

Okay.

So then I go to the API blade. That's

where I add the APIs. That's where most

of the you know the the magic with API

management happens. And here uh we added

new capabilities for different API types

and we add experiences specifically for

AI founder. So that when we select this

option we will see the resources that

have deployed and you can see here that

I have bunch already of resources

deployed and I can see here that my uh

resource that I was just showing is the

same models that I was seeing on AI

foundry. Okay. So all these models they

can be served through this uh resource

that I'm configured now. So I select the

this foundry resource. I do next. I'll

give it a name inference API. I think

it's a good name. Uh when we are

exposing the the LLMs

we can configure a base path and then

optionally we can configure products for

the for this inference API. we have

different options in terms of

compatibility with the with the client

SDKs. Okay.

And then automatically we can configure

uh different policies like the ones that

Andre just described. For this first

one, we can set limits on how many

tokens per minute it's allowed to be

consumed. Okay? And the limit is usually

applied to the subscription. Not to be

confused with the Azure subscription.

This is an API management subscription

that we can configure easily in

displayed here in the left. It's

commonly referred to API keys. It's like

giving different API keys to the model

consumer so that for the same model

deployment we can have multiple uh

consumers of the model handled by API

management through the API keys and then

we can we we may configure different

rate limits for each one of these API

consumers. So I I will uh leave this

default value.

uh there's additional options that I can

estimate the prompt tokens within API

management automatically but I will

leave it the default values and then I

can do also track a token usage okay and

here

uh the information will be sent to an

application insights and I can say send

information about the subscription ID

that is being used as additional

information like the user ID or custom

information that I can send in the other

I will hit text. I can also configure

the semantic caching uh basically with

um pointing to a to a vector store and

to embeddings model. I I will do this

later and the same for content safety

can configure it here the content safety

but I will we will do this demo later.

So at this point um I will review the

configuration and an important aspect

here is that this will use Microsoft

enter ID authentication between Azure

API management and the Azure AI foundry

resources so that we don't need to use

any keys between those services the

underlying mechanism of manage

identities through Azure will be used in

this case so Azure API management will

be allowed

to call the APIs from Azure AI foundry

and this is handled automatically with

this wizard. As the next step, we we

want to enable an important aspect here

that is um we can see the other name now

is expecting the API key. That is how

usually the consumer will send uh their

keys and then we can also enable the the

the Azure uh diagnostic log to enable

LLM logging that I will show them in a

second. Okay, so this looks good. I can

go here and check the policies that were

configured. I can open copilot

and copilot can help me to explain what

are what is this XML that is being

configured for the policies. I can add

new policies through uh this. So it's

it's pretty straightforward

and uh I guess that the next question

now is it working? So I can do a simple

test. I can search for the completion uh

operation

and then it uh provides me this testing

console ask me about the deployment ID.

So let me check. I have here different

models, right? This is the deployment

name. Let's use this first one.

I'll put it here the deployment ID

and then automatically puts here some uh

uh sample content like for example how

are you might be a some kind of hello

world or pink uh request. So in this

case deployment not found because maybe

it's something wrong

an additional space in the beginning.

Okay.

So uh you can see that they have a a

successful response and I can see that

all the details including information

about the content safety that was

applied the of course the response right

from from the model uh token counting

etc. Okay. And uh also very useful

within the testing console, I can do a

trace. And the trace is useful because

then I can see a step-by-step invocation

of the policies that I that uh I just

configured. Okay. For each of these

stages for the inbound

when the back end is called etc. So that

I have here the full details and

understand exactly what's the processing

that happens with API management. But

let's let's think about u now in this

flow. So I create the the inference API

and I enable token limits token

counting. I will enable semantic caching

later and I test API in the portal. But

does it um

does it work for example the the rate

limit the so for to test this it's

easier for us to get it closer to the

real usage of this inference API and

most of the AI engineers nowadays they

are using Python or other uh programming

language like net or Java but Python is

very popular in this case for AI

engineers. So at a certain point might

be important for us to try the same SDKs

that developers are using to test API

and this is what I'm doing here. So I'm

using the OpenAI SDK in this example.

I'm passing the the API management

gateway URL as the endpoint. The API key

is one of the keys that I have uh

created within API management and then I

can run and ch check if the API is

working. Okay, so it will it will

consume the the LLM through API

management and uh I see that it works. I

got a response but let's now test the

token rate limit. So instead of sending

just one request, I will send a couple

of them. 15 actually in this case it

will be a loop with 15 requests. And I

want to stress the the the gateway to

understand if I will start to see the

the rate limits um and the the 4 to9

response from the gateway. Okay. So it's

sending the request.

It's taking a little bit longer than I

was expecting. Okay, here we go. [cough]

So, I start to see that I have

successful responses, right? So, I get

the the response as it's supposed. I

configured a low uh rate limit just

1,000 tokens. So, that after a while, I

should now start to see the 429s. Okay,

so token limit is exceeded. Try again in

4 seconds. In this case, I'm not using

the open AI SDK because the open AI SDK

has a retry logic automatically

and with the retry logic, I will not see

this error. Okay. So, it's important to

have this kind of experimentation, this

kind of environment for me to be uh

absolutely sure that the

API is working as expected. And now I

can quickly analyze

the behavior and I can see that when it

was close to uh reach the threshold that

I have configured I start to receive the

4 to9 to avoid consuming more tokens

than it's allowed. Okay.

In uh in the another instance I have

configured the semantic caching and the

semantic caching works like Andre was

describing. We will um use the

embeddings models to to calculate the

embeddings for the prompt and then we

will check in the vector store in this

case manage radies to check if the if

that uh uh prompt is already in the

cache with a simulator uh similarity uh

threshold. In this example, uh

I'm just providing different prompts

that they have different text so that if

we apply the built-in caching mechanism

from uh from Azure API management that

just that is a key value it will not

work because the the keys are different

right the text is different but if we

apply the semantic caching uh it's

different because the the text is

different but the meaning is very

similar. Basically, we are I'm asking

very similar questions. Okay. So, and if

I run this,

I will start to to get the response.

Okay. And uh it will took more than 3

seconds the first one and then you can

see that the the next ones were pretty

fast. And if analyze the performance I

have this right more than three seconds

the first request and then all the

subsequent requests they didn't consume

tokens right uh because they took the

value that was already stored in the

cache we can all connect to the radius

cache and see uh the cache hits and

cache misses that I got so that I can

debug and understand exactly the way

that it works for this kind behavior. If

I do the same tracing with API

management, I can also see if the policy

uh gave me a cash hit or a cash miss so

that I know exactly the way that this

policy is working and that does it's not

just magic. I have full control on the

way it works. Okay.

So now let's uh move to this next part

with the prompts and completion locking.

Do you want to add something here Andre?

>> No, I think that's that's a perfect demo

that demonstrates the one of the main

values, right? Because you have similar

problems but at the same time like you

save what what was that pretty much like

10x improvement on latency for

responses, right? So

>> yes,

>> so that's that's pretty cool if you're

especially if you're building um like

chat applications that respond to like

for example if you have like a website

and you have like an FAQ type of thing

where customers ask a lot of similar

questions that is especially where it's

it can be useful because yes you

generate responses based on some

knowledge

>> but at the same time there's going to be

like a lot of repeated questions

formatted in a different way. So that

that is a perfect way to to optimize

your your intelligent app in this case.

Um so yeah and another way to to

optimize your intelligent application is

actually to log prompts and completions

because that is the only way for you to

understand how your application or in

that in that particular example your

chat application performs. What kind of

the best sort of like prompts and

completions pairs are out there. So that

then you can use that to select perhaps

a cheaper model that can perform in the

same way but obviously like tokens will

cost less for you. So that is

essentially like a way for you to

evaluate or perform evaluation um of

different models and see how they

perform in your application. Um so this

is one of the features that we have um

in API management and with that I think

we have a demo for you right Alex?

>> Yeah exactly exactly. So uh remember

that uh we started with the model

selection we deploy different models

right then we imagine that this API this

inference API that I've just configured

that is now running in in uh in

production so I will have logging about

this API but then I should apply this

continuous model evaluation process to

ensure that I'm using the best model for

the job because as we know new models

are coming like every week. Uh and uh

new models they came with better capab

capabilities sometimes more effective in

terms of pricing and in terms of uh

token usage. So we we should apply this

continuous model evaluation and then

having like simple processes to adapt

and to change the model. The API

management uh um by sitting in the

middle between the model consumers and

the model providers allows us to also uh

to tweak imagine that the the the an old

model is still being used and we might

translate to a new model automatically

so that we don't need to change the the

client application and do this kind of

uh translation. Okay. So in this case in

this demo the model is already uh live

is being served. I have an application

that is consuming the model and I'm I

want to observe the logs and the metrics

and uh then I want to export those logs

and um maybe configure some alerts to

guarantee that everything is working and

then import those logs into AI foundry

to do evaluations automatically. Okay,

so with that let me switch again to the

API management portal.

and in this case to this instance where

I have already disconfigured.

So

uh in the previous demo I didn't show

this part here

with um that is very important with the

with the Azure monitor. I just need to

come here and say enabled for the log

LLM messages and then optionally

I might log the prompts and the

completions. Okay, even if this exceedes

the the size that is here uh it will

split in different rows in the loggings

and then we have ways to concatenate

everything.

Important to mention as well

[clears throat] that uh this mechanism

works also with the streaming enabled.

Streaming is like a a nice mechanism

that allows to to stream chunks um of

the response directly to the to the

client. And um in some chat applications

it's important to have that so that the

user understand that the model is

already answering and they can see um

you know the the answer from the model

coming and is more interactive. But in

terms of handling this uh the server

side events that is the technology

underneath sometimes it's hard in order

to combine features like logging for

example. Okay. uh but the team has made

a great work to simplify this and now

with just this checkbox I can log the

prompts and completions in certain

scenarios I don't want to log the

prompts and completions because they

might have sensitive information and I

might not want to log that so it's it's

an option that is here available so I've

already disconfigured and I've run a

couple of uh requests already and now if

I go to the monitoring part and I click

on analytic ICS I have here a bunch of u

information that I can see. Okay, I can

see the timeline of requests. I can see

the the different APIs. In this case, I

just have one but I could have multiple

ones for different providers. And of

course, the Azure API management

instance is being used as an AI gateway

but can be used to serve other APIs that

I have in the enterprise and aggregate

all those requirements in the same API.

uh the concept of subscriptions that I

uh that I just described. So each

subscription is like a consumer of the

of the model and I can see here the the

requests the which ones were successful

etc. And then very important

[clears throat] go through this language

models

um option where we have uh this specific

dashboard where I can see information

for a certain time period right and then

uh have like a aggregated view on how

much prompt completion tokens were

consumed total requests and then having

a drill down of the details. In this

case, I want to see per subscription

what was the model that was used, how

many tokens were consumed, how many

requests, etc. Here I have a

distribution on the average duration for

different models, API versions, etc.

Okay, so all this information is

available uh for me to to use with um

with the with the locks. Okay, this is

exploring the the logging um the logs

from the API management that I can

uh go here and and uh uncheck directly.

Okay, let me replace this just with um

with a with a table. Okay, this is an

Azure monitor table that is uh

[clears throat] something very standard

within Azure that then I have a lot of

options uh to handle this information.

It uses this uh custoquy language so

that I can build advanced queries uh

with this data. And here is the row

data.

uh and I can see that I have important

information like the the region, what

are the number of the prompt tokens,

completion tokens, what was the model

name, if it was streamed or not, and

then information uh details about the

request and um and the response. Okay.

And then I can correlate this with other

things. So a lot of options here

available. And uh when I have this um

this kind of information then I can

extend and customize this for additional

scenarios. one that I like to describe

is this one where we can have uh we can

provide like an analytics dashboard with

a phenops view within the enterprise

because at a certain point it might be

important to really understand uh how

much tokens are being used but then to

uh express or to show the amount of

dollars or euros or any other currency

uh and translate that to that to to that

currency. See and and maybe per each

subscription each subscription might be

an individual developer or might be a

team or might be an application and I

might want to set budgets for these

different model consumers. So in this

case for example you can see here that

uh I can see the total cost and I can

see the cost quarter the budget that was

uh assigned. Okay. And this is using a

nice feature from API management that is

the the the products the product

concept. And here I'm applying different

quarters for different products and then

I'm just assigning the subscriptions for

the different products. Okay. And um and

also a nice thing about this

architecture is since it's using Azure

monitor, Azure monitor allows me to send

alerts. Alert might be a simple email,

right? But that's nothing new. Of

course, we can send an email, but we can

also do remediate actions. Like for

example in this case I'm calling a logic

app automatically and the logic app as

an action will disable the subscription

to prevent uh you know continue to

consume the model. In some use cases it

might make sense to cancel the

subscription. In others you'll just

inform send an email. So it's it's

easily customizable to these kind of

things. So for each one of these uh kind

of uh charts I can go here

and enter in this uh custo language and

adapt the query so that I extract the

information that I need uh so that I'm

top uh on what's happening with the

model consumption and I can uh easily

adjust if needed. Okay. So uh with this

let me change to yeah to this to this uh

to this query

that in this case is used

um in order to to extract the input and

output because I'm using the LLM logging

and I said that I want to to save the

prompts and the completions and um and I

put this in a format that then I can

quickly go here

and uh export to a CSV. Okay, I can also

export to PowerBI, Excel. Uh I can also

save this to a dashboard. So plenty of

options here. But in this case, I will

just export to a CSV. Very easy. The the

file was stored and now I will switch to

the to the same AI foundry resource. And

here I have a new option that is

evaluation. Okay, that is still in

preview. We should uh release this to to

G soon. And from here I will create a

new evaluation

for everyone using models. You know that

doing evolves is very important. It's

like a a a very good best practice. So I

will select this from an existing uh

data set. So now I will upload my data

set.

Okay, that is basically the file, the

CSV file that I was just extracted. I

will see a preview from this data set

here in a moment that is being loaded.

Okay.

Okay, looks good. This data set is like

for a couple of prompts uh to ask

information about um uh uh telco uh

telco service provider like what is my

plan uh how can I pay my bill these kind

of things nothing uh spectacular and

then I have an output text that was the

one that API management um track and uh

it was saved. So from here I will

provide a text criteria. I've been

providing uh some of them as templates

that I can then quickly customize here.

Okay. And then I will uh do the testing

with a selected model I will put here.

And here in the bottom you'll see that

to the user input it will be the item

out output and the out and the output

will be the item output text. This is

the way I have configured.

I can put additional messages if needed.

For now I will keep it simple. So I will

hit next. Okay. And then I will submit

my evaluation so that I start the job

and [clears throat] uh and the

evaluation will run. So this will cue

the the evaluation. It will take some um

some seconds. In this case, I don't have

a lot of rows. Okay. So depending on the

size of the data set, this can take

longer. And um the the model is already

evaluating this and will provide me a

score. Okay. So in this case I can see

that it passed 100% uh and I can go to

the data tab

and see details about this. Okay can see

that it passed. Why it passed? Let me

check. So I can see that this was the

the system prompt that was provided with

the model evaluation.

This was the input that was provided

from the data set and then I have here

a response why it was evaluated this way

and the result. This is easily

customizable with this instruction uh

from the from the beginning and I can

also use the SDK not to do this in the

UI but to do this continuously in an

automated way. Okay. So with this then I

can decide to change the model that I'm

using for a particular application

because I might get to the conclusion

that I have a model that is more

optimized uh for my use case. Okay. So

with this let let's talk also about the

other um other uh models and other APIs.

>> Yeah. So in and a nice thing about this

evals and logs um as we also see on the

slide and as I mentioned at the

beginning that we not only support Azure

models and not only model catalog from

Azure foundry we also support other

models which are compatible and the

bedrock which has kind of a unique um

API format to their API. So in that case

you don't really need to chase all of

these different logs and inputs and

outputs on different platforms. what you

can do is you can collect them on the

gateway level and essentially that

simplifies a lot of things for you. So

that was what I wanted to share here is

that we all of the policies that we

demonstrated so far all of them are

supported for other models not only

Azure models. Um and now the the next

thing that we wanted to share as well is

um the content safety.

So one of the key advantages of models

in Azure foundry and in Azure

specifically is that they have content

safety filters in place right but you

might have other models as I mentioned

you can have you can use hugging face

you can you use other providers you can

deploy them um locally and ideally what

you want to have is you want to have

like a single content safety

configuration for all of your models

across across your your environment

regardless of their on if they're on

Azure or they're on your own

infrastructure for example or any

thirdparty provider. Uh so we we're

actually integrating with Azure content

safety service uh and we have a policy

to to do that in a very simple way. So

essentially for every prompt coming to

API management to the gateway we can

send it to Azure content safety service

to check for three things. First we can

check if there is like any harmful

content such as like um uh hateful

content, sexual content, violent content

and so on. We can also you can also

create a block list uh in the content

safety service. For example, uh going

back to our uh example with the chat

application on your website, you

probably don't want your model to

respond to any question about your

competitor. So what you can do is you

can create a block list with your

competitor names. So in that case, this

chat application will not respond uh to

any questions regarding your

competitors. And the last thing which

becomes also important is that you can

configure it to shield you against any

prompt attack, prompt attacks, jailbreak

attacks and so on so that your model is

not misbehaving uh in in production

environment.

Um so yeah with that let's let's see how

it works.

>> So it it works in a very seamless way. I

could go here and let me add a new API.

In this case, I will not select this uh

AI foundry option because the model is

not hosted on AI foundry. Okay. Instead

is hosted on a face and I'm using uh an

uh an inference provider directly from

phase and that's what I will use to to

configure. So I will start here to to to

copy the URL that is being provided here

and I will select this language model

API this option. Okay, that is very

convenient for me because it simplifies

the process. I will uh give it a name.

In this case I will call it deepseek

because it will be the model that will

be provided. I will provide this URL. I

will use the deepseek path to expose

this API and then I have here these two

options an open a AI API compatible or

just a pass through that it doesn't care

what is the format it will accept all

the formats and uh we can use this

option as well so then I will pass uh

the authorization bearer token that in

this case is this uh token that face is

expecting

So that allows me to consume the model.

And here I can just configure this in

the other. I can use uh Azure key vault

to store these keys in a secure way also

if I need it. So now I will skip this uh

this policies because I want to

configure the content safety. Okay. So I

will start to select an endpoint an

existing endpoint for the Azure AI

content safety service that Andre just

described. And then I have different

options here that I can configure.

The the first one is to enable text

moderation where I have these two

options of uh four and eight severity

levels and then for each one of the

levels I can configure the policy

threshold. Okay. like for example

between zero and uh and six and just

configure the values. Okay. And uh then

I can also detect unblock prompt uh

prompts with uh with specific keywords

like the PI data like credit card

information these kind of things so that

I can lock this or competitor names like

the the example that Andre was

describing and then I can also prevent

jailbreak attacks to the model with just

a single checkbox. Okay. So I've

configured this. I will review what we

what will be configured in the policy

and I will create. Okay. Similar

experience that I did before in this

case not connecting to foundry but

connecting to a face and uh to the to

this inference provider that is being uh

shown here. Okay. So the the API will be

created

and um now as the next step I will copy

paste

an example that I put here with a

message content. You can see also that I

have here the model information as part

of the of the payload and I will use

this information. So my my deepseek API

was created. I can also go here and do

um the complete

operation. It's not a complete sorry

completion.

Okay. going to do a completion

and then uh I will replace here with the

payload uh that is being provided as as

an example. Okay, I will eat send

and let me check the answer that I have

and I see that it was successful and um

and then I can see the the the response

that was provided. Okay, simple ask

question and I can see the model

response and also I can see again usage

information etc. And uh and then after

this of course I can test with different

uh uh content in order to test the

content safety. I can put here a jail

break attack. I can use uh you know

nasty words about uh hate and those kind

of things so that I can check the the

content safety being applied uh as well.

I can do the tracing

and I can see then the response from the

content safety policy with details

exactly if it was blocked or if it was

allowed. Okay.

So basically in this then I can I can

ensure that my API is safe.

>> All right. Um so so far we talked a lot

about how to secure manage and govern

models. Uh but now we know we all know

that we are kind of an enentic error,

right? And we now have these agents that

not only just send prompts and receive

completions from models, but they also

consume tools, right? And and the one of

the main protocols that we see nowadays

being used for tool invocation is model

context protocol or MCP. And we've built

a lot of different capabilities for MCP

servers. You can create MCP servers on

top of APIs. You can proxy servers, MCP

servers and apply policies. You can uh

create on behalf of authorization

authentication flows for MCPs. Uh but

this is something that we will not be

showing today. We'll just consider it as

kind of a sneak peek to MCP capabilities

that we have in energy pay management

today. Uh I believe we have a full

onehour session next week um about MCP.

So please join that to learn more about

all of the great features that we have

for MCP. But for now, let's just focus

on like one one scenario. Um so

essentially what we're going to do in

this demo is essentially we're going to

build build a simple very simple agent

that consumes MCP on top of existing API

and API management. Uh but first we need

to figure out what's the best way to

build an agent. So at Microsoft we have

essentially can think about like three

different ways to build agents. One is

the simplest one uh but it provides like

not a lot of different like

configuration options which is copilot

studio. You can think about as kind of a

SAS where you have an agent you can

connect the tools you can uh infor

improve it with knowledge and so on.

Then we have kind of a middle ground

which is pass approach or Azure founder

where you have agents which are hosted

on behalf of you but at the same time

you have flexibility to to kind of

configure and um and adjust the behavior

of this agent pretty flexibly with a lot

of different options. And then the the

last option is kind of I option where

essentially you have all of the

different frameworks out there to build

agents but you also need to host them

you need to code them and so on such as

like uh curi since we mentioned hugging

phase small agent framework and then

also we had for a while two different

options at Microsoft to build agents. we

had semantic kernel and we had um

autogen um and with that uh recently I

believe at the beginning of the month we

actually announced that we are

introducing a Microsoft agent framework

uh which is essentially going to be like

a convergence of these two great

frameworks that we've had so far. So

semantic kernel was more about

supporting a lot of different languages

more enterprise ready with a lot of

different capabilities surrounding

production while autogen was more on the

kind of a research side of a house of

Microsoft where we were like building

new capabilities and kind of everything

from research were kind of pouring into

autogen. Um so right now these two

frameworks are merging into Microsoft

agent framework which is currently in

public preview. Um and in the next demo

uh Alex I believe will be showing how to

use Microsoft agent framework with with

MCP.

>> Yes exactly. So uh we have enabled this

MCP experience within API management and

we will start with that and then uh

there's this tool the MCP inspector is

very popular that allows us to to

understand if the MCP is is working to

do like a simple test then using this

agent framework uh and the and the AI

foundry agent service to use the the to

put it all together right to use the

inference API I to use the MCP tools to

do meaningful work in the demo will not

be very meaningful because we will just

play with the weather uh information but

then you can change to your own MCP

tools and do you know uh more

interesting work and then I want to

trace the agent behavior very important

right I want to understand what the

agent did which cools tools it called

and uh within AI foundry I can uh see

all that Okay. So again I will start on

the my API management instance but in

this case I was in the APIs blade and

now I will switch to this new a uh blade

that is the MCP servers currently in

preview. So here I configure different

MCP servers. It's very easy to configure

new ones. I can configure from an

existing API or for from existing MCP

servers. And uh I have here a bunch of

them already. Okay. So, for example, in

this case, I've configured a weather MCP

that if I go to the MCP inspector,

the tool that um the MCP community

provided to test the MCP.

I can put here the URL.

I can connect. I'm connected. I can see

the tools. Okay. I can click in one of

the tools. I can put here city name run

the tool

and I will see the response. Okay,

important for me to debug to understand

what's uh how it's working and then of

course I can play with advanced things

like authentication etc. But we will see

all that in the in the in the other

session. So uh from here I'm confident

that my uh API is uh my MCP is working.

I will switch to uh Visual Studio Code

again because I want to use these tools

with uh with an agent. Okay. And here

you can already see this agent framework

libraries being used. Okay. And uh this

is a simple request to the using the

libraries. In this case I'm not using

the NEMCP tool. I'm just routing the

request to Azure API management. And

here I'm showing that uh Azure um that

the model the the new Microsoft agent

framework works with uh with the API

management. So I have a response. But

now things will get more interesting.

Okay. Because here I will combine

my uh Microsoft learn MCP one of the

tools that I have configured that goes

through API management and then I will

ask what is Microsoft agent framework

and uh and this agent will automatically

call the the MCP tool to provide me this

answer. Okay, the MCPs they are very

powerful but of course with great powers

comes great responsibility. So sometimes

we need to have like um

we need to have an approval mode so that

we can allow or disallow to to call the

the tool. So I have here yeah uh it's a

very popular tool the MSL learn [snorts]

MCP. I think I have too many requests on

that uh on that MCP tool. Let's see

again if it works.

Okay, too many requests. No, no worries.

Okay, we we will test with um with the

weather. Okay, so uh let me test with

the weather. That is basically do the

same thing but in this case I will ask

what's the weather in Lisbon, Lisbon,

Cairo and London. Okay. So in this case

I will use the weather uh MCP tool

with a simple snippet. All these

snippets of code they are in the GitHub

repo that you can then reuse uh to do

your own experiments. So I can see that

uh the agent is ceued. Okay. And uh

and then it switched to in progress and

then in a few seconds I should have a

response. Okay. And as I expect in this

case, I put it a lot a lot of debugging

information because I want to see the

the output from each tool and in the end

the agent with the model right provide

me a summary of what's the weather in

these three different cities. What's

important here to understand if I switch

back to

if I switch to the to AI foundry is that

I have here an agents uh blade where I

can see information about the my agents

and my threads that just run. Okay. So I

will refresh. I will see that I just add

one execution

and I can try it in the background to

see details.

And here I can see the

the conversation history, right? And uh

I can see the response that was provided

and I can see the thread locks.

And as I expect three different MCP

invocations were performed and I can see

each one of them to understand the MCP

uh invocation for each so that I can

trace and debug what's happened and I

can see all the metadata information

etc. Okay, this is very powerful because

uh agents will have autonomy and I want

to control exactly what each agent uh

did. Okay. So with this let's provide a

summary.

>> Yeah. Thank you. Thank you for the demo

and and again for kind of a more

in-depth uh MCP content uh as I see that

yeah our session is going to be on

Tuesday next week. So make please make

sure to to join that to learn more about

MCP and API management. Right. So just

to summarize what you've seen today is

that we provide a lot of capabilities in

Azure management through AI gateway to

control AI agents. Uh so first of all on

security side we've seen uh keyless

authentication we're using managed

identities. Um there is a way to enhance

even that to have authorization

authentication flows on top of API

management with jaw token validation

also. Again next week you will learn

about MCP authorization flows. uh we

have content safety to ensure that u no

harmful content is provided to the model

or sent to the model or generated by the

model. We didn't really show a lot about

like load balancing but uh this is

something uh that is we have in API

management through backends feature. So

you can enable like due redundancy with

load balancing. You've seen the logging

capability to help you not only track

token consumption but also uh build

evaluations for your models to optimize

your intelligent workloads. Um for cost

efficiency we have again login

capabilities to track cost but at the

same time you also have control over

cost because you can assign token limits

per different teams. You can assign

token quotas and as we've seen with kind

of extension of the login capability

which is a phenops framework that that

Alex just showed you can create alerts

based on budgets you can actually

disable subscriptions based on budgets

which is really cool for you to get your

cost under control. Um also caching is

important right not only improving

latency but also saving cost on sending

the u the uh the prompts to the model.

Um and then uh developer velocity um u

we shown how easy it is to bring new uh

AI APIs to API management through

different options. So if it's Azure

Foundry, you just use import from Azure

Foundry. If that's the model deployed

elsewhere such as hug interface, you can

just use um LLM model import. Uh that

works with any model out there. Um and

essentially we as I mentioned we have

like developer portal in API management

where you can still publish those APIs

even though there there's there's APIs

for the models there are still APIs so

you can treat them as as such and then

on the governance side of course all of

the capabilities to monitor to audit

logs across all your agentic workloads

to see uh what kind of a tool

invocations were happening during the

agent agent call and so on. uh also

allows you to have full control across

all of your different components of of

your agentic application whether it's

like the agent itself tool or or the

model. Um so yeah with that thank you

very much for joining. Uh that's all we

wanted to share and again uh make sure

to check our out our MCP session next

week. Thank you.

>> Thank you.

[snorts]

Thank you all for joining and thanks

again to our speakers.

This session is part of a series to

register for future shows and watch past

episodes on demand. You can follow the

link on the screen or in the chat.

We're always looking to improve our

sessions and your experience. If you

have any feedback for us, we would love

to hear what you have to say. You can

find that link on the screen or in the

chat. And we'll see you at the next one.

[music]

>> [music]

[music]

[music]

[music]

[music]

Securing and Scaling AI Workloads with AI Gateway in Azure API Management

Microsoft Reactor

77 days ago

1:02:48

OpenAI SDK & Frameworks

Rank #2

Description

Building AI-powered workflows is only the first step making them secure, reliable, and scalable is where most developers run into challenges. The AI Gateway in Azure API Management provides enterprise-grade tools to expose AI workloads safely and efficiently. In this session, you’ll learn how to manage token quotas, semantic caching, safety policies, and authentication, ensuring that your AI services perform reliably under load while staying secure. We’ll demo how to wrap AI services in API Management, apply policies for rate limiting, monitoring, and cost control, and optimize AI workload performance in production. By the end, you’ll have practical patterns and examples for turning AI capabilities into secure, production-ready APIs that your teams can confidently consume. 📌 This session is a part of a series! Learn more here - https://aka.ms/AIS/series Chapters: 00:06 – Welcome & Housekeeping 01:00 – Introducing the Speakers & Session Overview 01:54 – Challenges in Securing AI Workloads 03:05 – Why AI Needs an API Gateway 04:20 – The AI Gateway Pattern Explained 05:02 – Key Features: Security, Token Limits & Content Safety 06:07 – Load Balancing & Session Awareness 06:34 – Semantic Caching for Cost & Latency Optimization 07:03 – Supporting OpenAI-Compatible & Third-Party Models 07:58 – Developer Velocity & Observability Features 08:42 – Demo: Deploying Models in Azure AI Foundry 10:21 – Demo: Serving Models via Azure API Management 13:00 – Setting Token Limits & Logging 15:02 – Testing Rate Limits with Python SDK 17:01 – Semantic Caching in Action 20:06 – Logging Prompts & Completions 23:07 – Cost Monitoring & FinOps Dashboards 26:04 – Continuous Model Evaluation with Azure AI Foundry 30:04 – Content Safety Integration with Azure API Management 33:52 – Demo: Blocking Unsafe Prompts & Jailbreak Attacks 36:00 – Agentic Workloads & MCP (Model Context Protocol) 38:00 – Demo: Building Agents with Microsoft Agent Framework 41:00 – Tracing Agent Behavior in Azure AI Foundry 43:00 – Summary & Key Takeaways 45:00 – Upcoming Session on MCP Deep Dive #MicrosoftReactor #learnconnectbuild [eventID:26309]

Video Details

Category

OpenAI SDK & Frameworks

Featured Date

January 12, 2026

Quality Rank

#2

AI Recommended