Loading video player...
All right, how is everybody doing today?
I'm Sebastian and I lead product for Foundry Observability.
I'm looking forward to talking to you about all things
Foundry Observability today.
We all know that agents are non deterministic, creating reliability
and consistency challenges for developers and operators.
This is why reliable AI agent development needs observability to
elevate performance, quality and safety, monitor, debug and remeded issues
and optimize agent performance.
From that perspective, we're excited to announce in public preview
observability in Foundry Control Plane, which provides visibility monitoring optimization
across the full AI agent lifecycle.
It starts with building reliable agents early without of the
box evaluations, tracing for debugging in the agent playground.
As you transition to code, you then incorporate these capabilities
into your CICD workflows.
And finally as you get to production, you get fleet
wide visibility and control.
Now let's get into high level overview of our announcements
so you can see what's coming.
Before we start with demos.
First, we're excited to pre announce that our evaluation platform
will be generally available shortly after Ignite for models and
data sets.
We are also introducing several new observability capabilities for agents
including tracing, new evaluation capabilities, production monitoring with alerts and
new features that power optimization and finally new agentic safety
risks in our AI red teaming agents.
Now let's dive a little deeper into some of our
key announcements starting with tracing.
Tracing for multi agent systems is now in public preview
with enhanced OTEL semantics, the power observability for any agent
hosted anywhere for many of the most popular agent frameworks.
As part of our commitment to open standards to enable
seamless interoperability with the Microsoft Foundry, we actively partner with
the OTEL community to continue to evolve the OTEL standard
to enable continuous monitoring, tracing and debugging to keep up
with industry advancements.
Next, let's highlight some of our key announcements for our
evaluation platform.
As mentioned earlier, evaluations for models and data sets will
be generally available shortly after Ignite.
This concludes all evaluators currently in public preview.
We're also announcing public preview of evals for agents including
new risk and safety evaluators, agent specific evaluators for tools
and multi agent systems.
And underpinning all of these capabilities are a flexible LM
as a judge and code based custom evaluators, which you
can use to create and run context specific evaluations for
your agents.
Now let's walk through what this looks like in practice
using a weather agent.
I know the weather here in San Francisco can be
unpredictable in the fall.
And I want to make sure that my I'm fully
prepared and have, you know, all of the right gear
with me so that I don't get wet and I
don't get a sunburn, which might not happen at this
time of the year.
To ensure that it does so correctly, we provide an
intent resolution evaluator.
So if I, you know, interact with my weather agent
and I ask it about the weather, the intent resolution
helps me understand whether or not it the agent has
resolved the intent correctly.
Once the agent has figured out the intent, it then
needs to call the right set of tools to provide
a response.
Here we provide several tool called evaluators and operational metrics
to measure quality and success.
Finally, as we get to Step 3, we can use
task adherence and other evaluators to measure whether or not
the agent provided a high quality response.
Now let's get into specific launches and demos.
We're going to start off with the first phase of
agent development and to build, you know, reliable and agents
early.
And, and we're going to kick things off with your
core set of capabilities and enable you to get comprehensive
visibility into agent behavior.
So when I'm creating a new agent, you know, I
need to know that it's correct and safe.
In the Foundry reports, we provide several new capabilities that
enable you to measure the quality and safety of the
agents before you take it to production.
This includes custom evaluators that you can tailor to your
use case, synthetic data set generation so you can get
started with evals within minutes, human evaluations so that you
or your team can evaluate your agent, and automated red
teaming support so you can probe for safety and security
risks.
Now we're going to show these new capabilities in action.
So for to do this, I'm going to demo a
Zava Outdoors catalog agents.
So Zava Outdoors is launching a new outdoors line.
This is a support agent that answers questions about the
product catalog.
As you can see, I've attached a an index that
contains all of the products in our catalog.
And in order to test this agent, I'm going to
use the evaluation metrics that are provided in the agent
playground to see whether or not it's doing what it's
supposed to do.
In this case, I've selected task adherence and intent resolution
to make sure that the agent is responding as intended
and adhering to the task at hand.
So I'm going to ask the agent, can you help
recommend a jacket for San Francisco in the fall?
And let's see, let's see what happens.
So ideally, the agent should respond and give me a
set of jackets that are appropriate for the current time
of the year.
And yes, it did that successfully.
As you can see, there's an evaluation that just ran
and both intent resolution and task adherence passed.
I can click into a debug window and I can
see the full agent execution.
I can see the input and output.
I can see the evaluation score along with a set
of explanations for why the specific scores were picked.
Again, these both passed.
So five out of five for intent resolution and a
task adherence passed as well.
So now I have one data point that tells me
how my agent is doing.
But how do I know that my agent is performing
well at scale?
This is where our evaluation platform comes in.
One thing that I can do is if I want
to run an evaluation at scale is I can go
into the evaluations page, I can create a new evaluation.
I select the agents that I want to evaluate as
a target.
In this case is of outdoors catalog.
I can either generate synthetic data based on the agent
context.
That's a great way to get started within minutes with
the evaluation or I can pick an existing data set.
In this case I'm going to going to pick an
existing data set just in the interest of time so
that you can see what that looks like.
I get a preview that shows me all of the
the queries in my data set and the ground truth
response and context.
I can select a judge model for evaluation.
I can select from pre suggested evaluators and finally I
can run the evaluation in the interest of time.
I'm not going to not going to do that.
I'm just going to show you a completed run so
you can see what that looks like.
Here's an example.
And I can see that in this case, you know
everything passed, which is a good sign and I can
move forward and take take my agent to production.
Right now we're going to hop back and we're going
to move on to the next stage, which is monitor
and optimize production.
And I'm excited to introduce Sam, my engineering counterpart to
take it away.
Thank you.
Sebastian, hello everyone.
Now that we've seen how to develop reliable agents early
in the life cycle, let's shift our focus to what
happens once those agents are in production.
This is where observability becomes absolutely critical, not just for
uptime, but for continuous monitoring, improvement, and trust in your
AI systems.
Before I start, quick show of hands, how many folks
have actually built an agent and your agent didn't exactly
do what you expected?
OK, so this is going to be very relevant and,
and, and, and actually interesting to talk about.
So in production, monitoring and optimization are one time aren't
one time tasks, they're ongoing processes.
While agents are non deterministic by nature, we need to
ensure their behavior can be continuously tamed as it can
change over time, sometimes in unexpected ways.
As you observed, to ensure quality, safety and efficiency, we
need robust tools to observe, evaluate and optimize agents as
they operate.
Our new agent monitoring dashboard in the Foundry provides comprehensive
insights across multiple dimensions, including continuous evals of production traffic
where we continuously take a sub sample of queries and
run your choice of evaluations on them to give insights
into the performance of ongoing requests to your agents.
Schedule evals allow for custom set scheduled runs for drift
detection and monitoring, red teaming to probe for vulnerabilities, adherence
to policies, and to offer insights into level of protection
against attacks.
And finally, Azure Monitor powered eval alerts flag operational issues
with evaluation results tied to traces to simplify debugging.
It's worth noting that as you develop more complex agents,
one may leverage multi agent schemes and this can have
detrimental impact on overall execution of the agent since errors
can be compounded across multiple calls to sub agents.
Thus it becomes critical to be able to debug and
trace these agents and pinpoints.
Low scoring traces obtain full visibility into the execution flow
of each agent and their respective evaluation scores.
Our observability stack relies on data and telemetry from apps
and agents, as well as the AI platform itself, where
we host models to provide intelligent observation, insights and control.
In addition to tight integration with Azure Monitor, by being
a core component of Foundry, you can now have a
comprehensive monitoring solution spanning not just your agents and AI
platform, but across your data services as well as Azure
Infra.
So let's talk about a demo here.
We're going to build up on what Sebastian, my partner,
introduced on Zava Outdoors.
And as you can see, what I'm showing you is,
is our monitor tab where you see all your operational
metrics in one place.
So as you can see, we can see evaluations.
There's scheduled evaluation runs and any other metrics that we
want to dive into.
Specifically, I have the ability to set up continuous evaluations
as I discussed setting the sub sample how many runs
per hour I want to run.
I can do scheduled evaluations based off a scheduled runs
that runs at a specific time as well as scheduled
red teaming runs and evaluation alerts seems to be loading
and the evaluation alerts as well.
We can set up thresholds as to when it is
I want to be notified for my alerts and this
can be quite helpful.
In handy to be able to proactively see where the
issue areas are and for me to jump into them.
So as you can see here, we seem to have
an issue on task adherence.
Let's dive deeper here.
So I can click on view details and I can
see that there's something wrong with the task adherence for
our evaluations.
I can quickly click on the view traces that would
take me to the Traces tab.
Doesn't seem to be loading today.
And through the traces I'm able to eventually see what
are the problem areas with regards to these alerts.
So let's wait for that to load.
And so here I can quickly see that my trace
runs through time.
I can sort them by evaluation metrics that we discussed
and be able to deep dive into those components.
We seem to be having a little bit of slowness.
This is part of demo process but inherently looking at
conversations.
I can click on various conversations and be able to
deep dive as to why something is having issues.
And for interest of time I will come back to
this.
Since this demo is not working, let's go back.
So, to optimize your agents, we have built several useful
utilities to dive deeper into your agents and obtain valuable
insights.
These include the ability to compare evaluations, perform cluster analysis
to quickly pinpoint and group issues by their type and
nature, as well as an agentic chat feature that allows
you to ask an assortment of questions to Microsoft Foundry,
get insights, and even control aspects of your projects, such
as curated model deployments and upgrades.
Let's take a peek at some of these features together.
So here we can see that I have my evaluations.
So here I can quickly load my evaluations and be
able to compare them.
So in this case, I'm selecting 2 and I'm actually
comparing these runs.
I seem to be having connectivity issues again.
So when I compare these runs, I'm able to do
run and run comparison to see what were the key
metrics that I'm falling short on.
A specific was a task adherence.
What was the delta between those two runs and be
able to make progress through continuous iteration to improve on
my agents going back to the issue that you all
observed, how to tame the beast.
So here we have various tools to do that.
One of them is our cluster analysis tool, for example,
that allows me to hone in on the various key
areas.
And this is a very useful tool because you have
a lot of agent runs, you have a lot of
different evals, and you want to be able to hone
in a key problem areas.
So in this case, we can cluster those areas and
see that, hey, some of the issues were due to
hallucinated responses.
Hallucinations happen all the time.
So here I can deep dive into unsupported fabricated issues
and be able to look at the details of those
and what happened and what are the AI suggestions here?
And the AI suggestion helps you to continuously improve upon
your agent development process That gives you insights on what
it is you need to modify to make your agent
even adhere to your net end goal, whether it be
accuracy, adherence to task relevance, or anything else in specific.
I can also have a conversation with the system.
This is something we're really proud of and it's quite
exciting.
I can click on our Ask AI and say, hey,
give me a let's go ahead and upgrade our main
model from GPT 4.1 to something better.
So this not only allows you to ask questions, get
analysis on the fly, but also be able to control
facets of foundry through just conversation for interest of time.
I will show how that looks like.
Essentially, you can see that it's taking me through the
process to to upgrade the current model.
It's identified the current model.
It is is indeed GPT 4.1.
And in parallel I can see that it's provided several
options as to what my alternatives are.
I can even go to the detailed model page and
read about that model cart.
But also in parallel, I can finally approve this and
be able to by one click upgrade my model through
a curated process.
Awesome.
Let's go back.
So as a next step, I wanted to introduce one
of our key partners and Mr.
ABI.
Please welcome ABI.
He's the Vice president of data and AI engineering at
CarMax.
ABI is a proven leader with a track record of
delivering transformative data and AI solutions, and driving innovation is
now focused on shaping the future of intelligent experiences and
unlocking new possibilities through emerging AI.
Thank you, Sam.
Hey, good morning, everyone.
I'm excited to be here and share Carmax's AI journey
about innovation partnership at Microsoft.
And I and I have responsible use of AI.
Let's get into it.
So when we think about AI at CarMax, AI isn't
new to us.
For over a decade, we've been built using various AI
techniques such as supervised learning, natural language processing, computer vision,
and process automation to improve our customer experience and also
make our core processes more efficient.
By leveraging our data and AI, we have created a
powerful advantage, one that truly sets us apart in our
industry.
But here's the thing, AI is changing so rapidly, and
today we are in this era of generative and agentic
AI that helps us redefine what is possible.
Through Microsoft Foundry, We've experimented, we've learned, and we've taken
3 noteworthy use cases from prototype to production.
They are search, which is a conversational search that you
can use to find cars on CarMax.
Second is Knowledge Management, which provides prompt, fast and accurate
responses to queries or information.
And the third one is Sky, which is what we're
going to talk about today.
So what is Sky?
Sky is a virtual assistant.
It's purpose is to personalize and elevate the customer experience
by empowering and engaging customers at the right moment, no
matter where they are in their shopping journey with us.
So with our goal in 2020 we launched the first
version which was powered by natural language processing and NLP
did great at the time.
It detected the customer's intent and used pre programmed scripted
flows to direct the customers.
This approach had its clear advantages.
It was predictable, controllable and reliable.
However, as we looked at customer expectations and how they
were evolving, we noticed that our NLP powered Sky was
reaching its limitation.
It felt more rigid and it felt more less intelligent
to responding to customer questions.
So with generative AI, we saw the opportunity to completely
reimagine Sky, making it more smarter and responsive to our
customer needs and setting us up for the future.
But here's the thing, while we had worked and built
expertise in generative AI, such a complex generative AI solution
or an experience had never been deployed to our customer
base until now.
We partnered with Microsoft to redefine what used automotive detail
can do for customer experience and together we are Co
creating the most intelligent, scalable and personalized AI powered experience
in our industry.
We're truly reimagining how we can seamlessly guide both our
customers and associates through every stage of their journey.
So here's what the updated to our OS sky looks
like.
And you will see in this, a customer is shopping
for a car and it's going to ask guy, hey,
can you help me find similar vehicles, but maybe at
a different price point.
And, and Sky does a great job of showcasing multiple
different options to the, to the end shopper or the
customer.
So let's talk a little bit more about the results
as you will see that they truly speak for themselves.
And we looked at the results in two, two parts.
On the experience side, we saw a 10 point improvement
in the net positive feedback score, which is basically telling
us that our customers are more satisfied or are getting
a better experience through the new Sky.
Second is on the efficiency side, we saw an increase
of 25% in containment.
Now, containment alone can be misleading, but when we look
at that with experienced or customer satisfaction, we feel more
confident that customers are getting their questions answered through Sky
without the need for escalation.
And when there is a need for an escalation, Sky
can seamlessly connect customers to our customers experience Center for
more hands on support.
So when you look at this and as you think
about how did we approach our deployment or scaling and
the answer is a very we approach it carefully And
the reason for that is trust.
For us, every Sky interaction matters.
Every response reflects our brand.
We are a trusted brand.
And it's no surprise that our approach to AI is
guided by those core values of honesty, integrity and transparency,
which is why we saw a need for a comprehensive
framework for responsible AI.
And that's what led us to creating this evaluation framework
for Sky 2 dot O, which we created using Microsoft
Foundry and Log Analytics.
And think about it, this has two parts to it.
The first one is what do we get out-of-the-box from
Microsoft Foundry.
And 2nd, we had to build our own custom evaluators
to supplement our use cases and the need for how
do we measure or evaluate our AI.
So let's talk a little bit more what we are
using and getting out-of-the-box from Azure.
So we get, we think about them as three parts.
The 1st is runtime guardrails, second is evaluation such as
safety and jailbreak, and then for AI specific tracing.
So when you're using AI, you want to know what,
hey, what was the input prompt?
What is the output prompt?
What sort of tools were called from my LLM for
that AI specific tracing?
We are using a combination of open telemetry as well
as Azure Monitor.
So with that, let's see, let's look at a demo
of how we are using these tools.
To give an example of how we use Azure Monitor
and Log Analytics, we're going to go back to the
example we showed earlier.
Hey Sky, can you help me find cars similar to
this one, but a little cheaper?
And as you can see, Sky replies to this helpful
car carousel and some text.
Let's take a look at what that looks like in
telemetry.
So if we take a look here, all of this
gen_AI telemetry traces are actually built in with open telemetry
into semantic kernel.
And this is what we use for monitoring and also
occasionally for evaluations.
So you can see here, this is the user message.
Hey, Sky, can you help me find cars similar to
this one?
If we go to the next message here, you can
see the generative agent uses a tool call here.
This is where it's searching for the vehicle.
You can see it's actually looking for that specific vehicle
that the user was talking about.
So it can like figure out the details on that
and then get more information from there.
This is the results of that tool call.
If I click show more here, you can see a
massive Jason object of the entire result.
And then once Scott has a couple more tool calls,
as you can see here, it gets back to the
response that we were just looking at.
Here are some options.
Now moving on in the conversation, I say.
So it gave me 3 options.
What are the differences between 2:00 and 3:00?
And you can see here Sky gave me a very
helpful like bulleted list of the differences between each.
Like the mileage is different, the color is different, you
know, things like that.
And that looks very, very similar.
So if we go back here again, this is the
user message and then we get the assistant response.
Now, in this case, it's a one to one message
to response because in this case, Sky already had all
the context it needed with the prior messages, the prior
tool calls to make those comparisons.
Before we had, you know, multiple tool calls.
Here we just had the one.
Great.
So let's talk a little bit more about what what
are we doing for customization.
So we got a lot of good out-of-the-box capabilities, but
for our use cases, we knew that we had to
build some more custom LLMS, judge evaluators to help us
with few other things that we wanted to measure and
monitor like legal adherence.
So every conversation that Sky would have with our customers,
we want to make sure it's in line with our
legal guardrails.
Same thing with intent detection.
We want to make sure that Sky is truly understanding
the intent of the conversation with the customer.
And there are a few others that are listed over
here.
All of these evaluators we built using the ecosystem in
Azure, such as Machine Learning Pipelines, Log Analytics Monitor, Azure
Monitor, Workbooks, and Azure Dev OPS.
So let's look at a demo of how all of
these come to life for us.
We utilize Azure AI Foundry and CICD in a couple
of key ways.
One of the main ways is running daily evaluations.
So every single sky response that a customer sees is
omitted as log analytics telemetry.
And once a day, we have a Python script that
runs that collects these, generates a data set, and then
runs an evaluation incident foundry.
So that's what this pipeline looks like here in Azure
DevOps.
As you can see, it runs every single day and
has been running for a while now.
We move over to AI Foundry.
This is what it looks like there.
These are all of the legal adherence runs within our
evaluations blade here.
And if we click into one of these, you can
see all of this very useful metadata that comes involved
with this.
So we have the actual sky version.
We have the Git hash of the eval code that's
being run against it, and we also have the data
set name and the evaluator type.
Now for evaluators, we use AI Founder to store those
as well.
There's a lot of built in Microsoft ones.
We've also had to build a few of our own.
And just to dive into the adherence evaluator for a
second, we actually had specific guidelines, legal guidelines we wanted
Sky to adhere to before putting it in front of
a customer.
So an example of that here.
These are three example guidelines, three example Sky responses, and
then what the LLM judge thought.
So for example, did the assistant avoid applying that a
car is safe?
This is what the user sent to Sky.
This is what Sky responded.
And then the LLM judge had to decide if this
response passes this guideline.
And it did.
And the reason the LLM judge gave was because the
response is not implied that the car is safe.
So it gives it the status of successful.
The other way we utilize Azure AI Foundry and CICD
is in the Sky deploy pipeline itself.
So after we deploy to QA, we see here we
have this blocking evaluation gate step.
And so every time we deploy Sky, we run our
full suite of evaluations up against it.
And that step looks like this nice little Python table
here with a visual indicator as to whether or not
our scores are passing.
Now some of these scores fluctuate a little bit.
Run to run.
So that's why this is not an automated step.
It's a human reviewer has to look at these scores
to aid in that process.
We also have all of these scores saved and these
expected thresholds saved in a marked on file in GitHub
just for easy access.
And then at this point, a human can decide whether
to approve or reject the Sky release.
Great.
So let's talk a little bit more about what's next
for us.
So as we think about the future of AI or
CarMax, we are going to build more agents and as
you have more agents, you need to improve orchestration of
the agents.
And that's this is all in line with how do
we enable more agentic AI for us.
We talked a lot about the new feature functions from
Microsoft Foundry that are coming on Observability.
We are excited to explore that and see what we
get out-of-the-box and what do we need to build as
custom evaluators.
And lastly, I do anticipate that as our use cases
expand, we will need to build more evaluators.
This is in line with ensuring that we still make,
we are still using our AI in a responsible way.
So our journey from traditional AI to generative AI has
fundamentally changed how we serve our customers and the results
speak for themselves.
We are seeing better, better experiences, improved efficiency, and we
now have a foundation for innovation.
But truly, what excites me is that we're just getting
started.
There's so much more for us to do.
And with our robust sets of evaluation, I feel really
confident that we are positioned to deliver even greater value
in the years to come.
And with that, I hope that our CarMax journey has
given you the insights and the confidence that you need
to be successful in your AI journey.
And with that, I would like to now call Sam
and Sebastian back on the stage to wrap us up.
Awesome.
Thank you, Avi.
That was fantastic.
It's great to see CarMax incorporate observability into every step
of their development life cycle.
It's you have really, really fantastic how they've used custom
evaluators and you have customized those to their use cases
and scaled out their, their sky experience.
Now we're going to showcase how you can scale agent
fleet management with observability and governance.
Let's get into the final phase of the AI agent
life cycle.
As more and more agents are deployed across your organization,
it's absolutely critical for management and oversight to be centralized.
Our newly announced Foundry Control Plane is your destination to
manage, observe, and govern your agent fleet.
So if you go from one or two agents you
know, to hundreds of agents, that's the place where you
can go and view all of your agents, see what
they're doing, and govern them from that perspective.
We're excited to announce a bunch of new capabilities in
the Foundry Control Plane that are powered by observability, including
our new Agent fleet dashboard that shows key performance metrics
and alerts and assets inventory for all of your agents,
models and tools, and a registration flow that enables you
to register and observe any agent built using many of
the most popular agent frameworks.
So now I'm going to hand it off to Sam
and he's going to demo and show you what it
looks like.
Awesome.
Thank you so much, Sebastian.
OK folks.
So let's talk about operation.
So as we discussed our operate tab experience is quite
holistic.
It allows you to see your active alerts in one
place.
We can see for this is Java Outdoors project, for
example, we have 83 agents.
They're all operational in one shot.
I can quickly see in my estimated cost and trends
in specific, how am I trending week on week or
month on month.
And in specific, one of the awesome things that we're
proud of to showcase is Microsoft Foundry is not just
one place where you can see your agents, but you
can also see non Foundry agents here as well.
So this is truly exciting.
So I can actually go in and register an agent
that was built, say on Google Cloud, on Google Vertex
or on AWS Agent Core or anywhere else and actually
register and bring it in here.
And for me to have one holistic view that I
can see my operate tab and provide protection to all
our agents.
So here I won't for interest of time walk through
this, but you can basically bring in your agent URL,
your endpoint URL, your hotel specifications, define it here and
set up your agents.
And once it's set up, it'll basically look like this.
You'll see in your assets view a set of agents.
So I can actually go through various agents that have
been brought in from outside.
So in specific non Foundry agents and be able to
see their traces, be able to deep dive and eventually
have some level of protection and support for them.
So here we can see that there are several agents.
If I even change the source to custom, we can
see our agents have been developed and built on GCP.
So mind you connectivity are a little slow and then
we'll come back.
By the way, the previous example that I was not
able to show and show that.
So these are agents that were built in GCP or
AWS Vertex, and I can quickly go and click on
one of them here, for example, and see the traces
within that.
This is the beauty of what we've built.
These traces are actually coming from Google Vertex and I
can see that, for example, this one, there's all of
this complexities being done.
I can see what was input and output and sort
of evaluation scores across them.
So inherently we've gone ahead and somebody's built an agent
outside and we're managing it within our operation tab here
in which is awesome.
So does that use the hotel correct?
Use the open telemetry semantics to bring in those sort
of traces and logs and have one place to view
them all.
That's right.
So I actually want to take a minute and go
back and showcase our traces which I was not able
to show prior.
So in this case, as you recall here in the
settings I have ability to showcase continuous evaluations, scheduled evaluations.
I can set up scheduled runs, you know, whether it
be run hourly or daily.
I can do scheduled red teaming runs, same thing with
a schedule as well as set up my evaluation alerts.
And so here we can go into specific set of
traces that due to connectivity I was not able to
show.
And here I can quickly look at, say, a particular
conversation or something that is not doing so well.
So here we can see that for this one, the
task adherence was inherently 0.
Let's try to deep dive and see what's going on
here.
So when I look at these evaluations, we can see
here, oh, it says the assistant claimed to have explored
tents available and provided detailed product information.
But this is a clear example of hallucination.
How would it able to answer what jacket to suggest
when it doesn't have the right tool call?
It's hallucinating clearly.
And so here we can learn from this and be
able to inspect what was the input and output.
It says, do you have an explorer tent?
That says, yes, we have an explorer tent.
It's listed, I'll put explorer, blah, blah, blah.
But it didn't actually make a tool called it's hallucinating.
So you can actually leverage our utilities here to hone
in on key issue areas within the operation of your
agent to make better, more decisive decisions in that element.
OK great.
So we talked about operation, we talked about management of
three P agents and now we'll go back.
Yeah, that was awesome.
That was a really, really great demo and exciting to
see it all come together in the Foundry control plane.
We have a bunch of related sessions that we encourage
you guys to attend, specifically BRK 202, which starts right
after this session.
That session is going to go even deeper into the
Foundry control plane covering compliance, security, and a bunch of
the the governance features.
We also have several related sessions and labs that incorporate
many of the features that were shown today.
We're, you know, excited for you to explore all of
these.
You can learn more, you know, from our documentation.
Check out what's new in in the Microsoft Foundry and
we're looking forward to your feedback as you as you
try out the new capabilities.
And with that, we'd like to thank you for attending
the session and we're looking forward to seeing you all
at the Foundry control plane observability booth.
Thank you very much.
Thank you so much and have a great Ignite.
Ready to manage every agent, everywhere? Don't fly blind, get total visibility into your agent fleet. Foundry Control plane and Observability gives you the dashboards, diagnostics, and optimization tools to run AI with confidence. If you care about uptime and impact, this Foundry session is for you. To learn more, please check out these resources: * https://aka.ms/ignite25-plans-ManageGenAILifecycles š¦š½š²š®šøš²šæš: * Abhi Bhatt * Sebastian Kohlmeier * Sam Naghshineh š¦š²ššš¶š¼š» šš»š³š¼šæšŗš®šš¶š¼š»: This is one of many sessions from the Microsoft Ignite 2025 event. View even more sessions on-demand and learn about Microsoft Ignite at https://ignite.microsoft.com BRK190 | English (US) | Innovate with Azure AI apps and agents, Microsoft Foundry Breakout | Advanced (300) #MSIgnite, #InnovatewithAzureAIappsandagents Chapters: 0:00 - Details on evaluation platform and new risk and safety evaluators 00:03:04 - Weather agent demo showing intent resolution and evaluation process 00:08:00 - Sam introduces production observability and continuous evaluation tools 00:10:05 - Comprehensive Observability Stack and Integration with Azure 00:15:35 - Introduction of Partner CarMax and VP of Data & AI Engineering, ABI 00:26:14 - Deployment gate process ensuring each Sky release passes AI evaluations before rollout 00:26:51 - Scores and thresholds stored in GitHub for easy access 00:26:59 - Human decision on approving or rejecting Sky release 00:34:19 - Session wrap-up, related sessions announced, and closing remarks