KDD '25 AI Reasoning Day keynote: Improving AI Reasoning through Intent, Interaction, and Inspection | DailyDevLists

Loading video player...

Full Transcript

8,890 words • EN

In 1995,

you come up with a problem not even

born.

When I was sitting college,

I had two choices

to study electrical engineering at IIT

Kur which was close to the my hometown.

One study computer science at a far away

college.

I really wanted to study computer

science but even more than that I wanted

to stay closer to my parents. So the

point was here take electrical

engineering at

now in the first year in the intro

programming course

an instructor told us that you can

multiply two 2x2 matrices

using seven multiplications instead of

eight.

That was very

I wonder how can you do that?

At that time there was no Google and

much less casually.

So I went back to the instructor and

asked how you do this

and his response

that premium material only taught to

rights

>> that upset me enough that I rewrote the

joint entance examination and secured a

better rank to be able to study computer

science at the very same institute.

I followed that up with a PhD in

computer science at Berkeley

and then when I got trained as a

computer scientist,

one of the first problems that I decided

to study at Microsoft research

was how you can automatically synthesize

such complicated algorithms.

And it took me five years to get my

revenge. Then we had a paper that could

do just this.

And as the title of the paper suggests,

we first had to develop program

verification techniques.

Techniques that can verify the

correctness of such algorithms

and then we translated that into

techniques for automatically

synthesizing such programs.

In fact, this paper was the most

influential paper award few years ago.

and life came back full circle for me

when the very same institute

honored me with the distinguishedness

award.

Now I'm so glad that didn't exist at

that time because if it did and if you

ask it this question it will actually

give you the correct algorithm

not just this it can give you algorithms

for many things. How about you multiply

two 2x2 matrices using six

multiplications

which we know is mathematically

impossible

>> and maybe it's time now to go back to

program synthesis

to be able to check correctness of

programs

and only when we do this we will be able

to increase reliability and robustness

of these AI methods that generate these

programs and algorithms

and also this is what is going to

facilitate age even discovery of new

algorithms to begin with.

Now in this talk you'll see many

different techniques to reduce

hallucinations.

But let me ask you this.

What is the simplest thing you can do

to check that this program is incorrect.

Exactly right. test it out on few input

output examples.

Now these input output examples

are a very interesting form of this

specification

because they can be validated. In fact,

the only form of specialation that you

can validate easily.

Next, I'm going to show you some domains

where input output examples

are the easiest specifications to

provide and even more natural than

natural languages.

So, back to 2009,

I was returning from a conference

And there was a lady sitting next to me

in the airport. She was very impressed

to know that I have a PhD in computer

science and that I work for Microsoft.

So I go to help her with the task that

she was studying with.

She opens up a laptop, points up Excel

and shows me a column of names

and asks me how can she for these names

by giving me an example of what she

wanted.

Now at that time I had no idea about the

programming model underneath Excel. so I

could escape myself out of the

situation.

But after I got back home, I searched

for a solution to this problem on Excel

helps.

And that's when I realized that many

many people

struggle with simple repetitive tasks

like that of the airplane lead

and they would communicate their intent

to an expert on a help forum through

input output examples

and this is what inspired me to develop

a feature called flash

that can automate tasks like these.

So you give just one example of your

intent

and it will generalize this example into

a program and run that program to

automate the task for you.

And the key breakthrough here was the

ability to synthesize these programs

in.1 second

often from just one example.

So let me show you how the overall

underlying architecture works there.

So the first idea was to restrict the

search to an underlying domain specific

language.

So in case of string transformations

such a language would consist of

operators like regular expressions,

substring operator, concatenate and with

some limited form of conditionals.

So instead of searching for a program

over a full programming language like

Python, we will search a program over

this domain specific language.

Now one idea to sort for programs over a

domain specific language would be to

encodate

all programs in order of increase in

size

and then check which of these is

consistent with the examples.

But this is not going to scale because

even though this DSL is relatively small

compared to Python, it is still info has

lots of concepts.

So the second key idea was to do a bold

directed search instead of an

enumerative search

where we will tune many parts based upon

logical reason.

So essentially what you do is you take

input output examples

and you back propagate them through the

structure of the gravel

using inverses for various operators

that you have in the ground

and then this symbolic back prop back

propagation

yields multiple goals at each substep

and then these goals can be explored in

the order of their likelihood to

succeed.

And we use machine learning techniques

to predict which board to explore first.

And sometimes this provides a 10x

improvement in SP.

Now at the end of this process, you have

many programs which are consistent with

the small number of input output

examples that the user has provided.

Now we need to rank these programs in

order to pick the one that the user

likely indented.

And one of the important ways to do this

ranking is to look at program features.

We prefer programs that are small that

few number of constants.

And another interesting idea that we

discovered was to actually run these

programs on the remaining inputs in the

spreadsheet.

and see which program generates outputs

that are more uniform and that program

is more likely to be done.

So this is almost 15 years old work but

it still stands the test of time.

The idea of using domain specific

languages to restrict the output is very

relevant in the days of LM as well. So

last year OpenAI released this API where

you can enforce that the output actually

belongs to some structure that you

specify.

And this overall template of generating

lots of candidates and then ranking them

using tools that are not available

during the generation time is also a

very popular idea with use of LM.

So flash became very popular made it

into middle school computing textbooks.

But the best that I ever received for

this work

was this quote from my late father.

Okay. So the user experience of flash is

so inviting that it doesn't prevent

people from trying to run tasks that it

was never meant for. For example,

date and number transformations.

So there was this tweet that went viral

when someone gave an example mapping DC

to December

and flash autocomp.

Some people came to my rescue claiming

that this is how we should have named

months in the first place.

Then I also came across a shittiing lab

where someone had incorporated flash as

part of process automation.

But you know the best thing about this

kind of customer feedback

is that it tells you

what are the next set of problems to

work on.

And so we extended support for flash to

much more sophisticated class of

transformation including data number

transformation.

For instance, look at this task where we

want to map the date in the system

to the corresponding quarter port and

just a couple of examples

>> flat out.

But not only that, it can actually

generate readable formulas in many

different languages including the Excel

formula language

which was one of the most

asked speakers so that people can get

transparency into what was happening

underneath.

Can someone note that this is almost 10x

better than flash?

But the most interesting YouTube video

that I saw was that people realized that

this is not based on chat. It works so

fast.

So if you want to read a paper that is

not about LLMs or using machine learning

but it's still relevant today, I would

recommend this one. Now one question

that you can ask is

how about using LMS

for this simple task?

In fact, I gave an entire keynote on

this topic couple of years ago where I

pitted flash versus GV4

and this is what we identify.

So consider the task of extracting the

first three characters a task as simple

as the first three characters and if you

give it to GP3 it does it easy.

But what if you have to extract four

characters?

It can't do this even with four

examples.

because the bias for standard one names

is that strong.

But what about GBD4?

It's a beast to say the least compared

to GP.

It can do this task with two examples.

But what about extracting the first five

characters?

It could not learn how to do this even

with four examples.

And then my prediction was that G5

which I haven't got yet.

But the interesting thing here is to

ponder about that how are examples

different from natural language in terms

of their capability to describe intent

and why do elements find it hard.

So what we observe with these elements

is that the code that they generate is

often approximately correct.

But when you have examples as a

specification,

you want to be able to execute those

examples

to check whether the code is exactly

correct or not.

And unless the understand the execution

semantics in great detail, they won't be

able to do this. So I will actually show

you a small trick

as to how you can make the understand

the execution semantics

and Another thing to observe here is

that programming examples is really a

search problem. You're trying to figure

out the common logic that is consistent

across various examples.

It's not a translation problem as in

language translation.

But again the trick is how can you make

do because examples can be validated.

That's the beauty of this form of the

specification.

So let's take this task. Then my goal is

to extract the

last name

from cryptographic end.

And if I have a middle name, I want to

extract that as well.

So if you give four examples to the LM,

it still does not get it right. It

produces a program, but it still does

not learn how to handle the case when

there's a middle name.

So what can you do? You can run this

program on the examples, figure it out

where it fails, try to repeat that loop

with the LM and then just pray that it

will converge. But you can do something

in a much more controlled fashion. And

here's the idea.

So you let LM generate the first

statement, but then take the control

back. Then you execute this statement on

the various inputs that you have and

show the outputs. Show the effect of

this statement.

to the LLM.

Now you let the LLM decode and generate

the next statement. Again, you take the

control back, execute this statement on

inputs that you got and put down an

output and you create this class and

then lo and behold the LM actually gets

it right.

It's one of the interesting tech you can

use.

So instead of asking the LM to give you

one choice

for the program statement at each step,

you can ask it to give multiple choices

like two or three

and then you can do some variable

remaining and put down all these choices

in the program.

And now what we have really done is to

trick the LM into doing a first search.

So relatively simple but a quite

powerful idea.

Now let's look at another application

domain where examples are in natural way

to express.

This is an assignment taken from a data

science class where I have a custom text

file and the instructor requires the

students to extract a structured table

out of this and the instructor provides

the students a script to build on top

of.

But now let's see how the programming by

example experience makes this much

easier.

So I wrote the same custom text file

into my playground and all that I have

to do is to give examples of the fields

that I want to extract. So let's say I

want to extract a championship. I give

one example and then another one. And

now you can immediately parsing logic to

be able to extract more such instances

from this custom text file.

Suppose I want to extract another key.

Now I'll give one example and the tool

understands my intent from this one

example.

The more I work with this document, the

more of a structure I'm imposing on

this.

Now let's say I want to extract a new

score.

So when I give one example, the tool

ends up learning a program

which does something wrong in the third

report because it's in a slightly

different format.

Now what if this record was not in my

report? I have tens of thousands of

records and this report was somewhere in

the middle.

So it would not be easy to figure out

what went wrong.

In fact, if you are programming this

task yourself or even using an LLM to

program this,

all bets are off. You will not be able

to easily figure out which of the 14

tens of thousands of reports that we

have went wrong.

But in case of programming by examples,

you can do something very interesting.

So recall that we don't just have one

program that we generate from the few

examples we have. In fact, we generate a

family of programs.

each of which is consistent with the

example that the user has provided.

Now what we can do is we can take

multiple top ran programs and run them

in parallel on this input

and you can watch out for the places

where they differ in their execution.

And if you do this on this scenario, you

will figure out that the top two rank

programs give different outputs on this

third record.

And you can submit that to the user and

ask it what output do you want or maybe

you can specify your own output

and when the user specifies the right

output will convert to the right output.

So this idea is called distinguishing

inputs.

Essentially helping the users provide

the right examples for the many quarter

cases that might be there in your data.

In fact, this paper often see the most

influential paper. Again, a simple but a

very powerful idea

and you can figure out such part cases

using other techniques as well such as

using different kinds of padding.

So this was one form of intent that the

can specify which is different from

natural language.

Let me show you another form of intent.

We call this temporal context.

Now suppose I some code

and I have written this. Now I can let

take over and will do the completion for

used quite impressive earlier but now

it's overall

but now let's say I have this part of

code.

Can you predict what is the edit I want

to do in this part of the code?

Probably not.

But what if you saw what I was doing few

minutes ago?

I had this other part of the code base

where I went and changed one of the

expressions that converts Fahrenheit to

centra

by a method called I method.

I did a related change at another place

and now you should be able to guess what

I want to do in this snippet at the

bottom which is to make a similar change

here as well.

Now you can actually liken this to flash

for code

where the user can give me examples of

the code fragments that they want to

transport

and then I will learn a program overs

that can do these kinds of coding.

So here is the user experience that we

initially designed for this. The user

copy pastes

all the code fragments that they want to

transform in the top left box.

Then they give an example of the

transformation in the bottom two boxes.

And then they press the magic button and

then they get all the transform coes.

But there's a big problem with this user

experience.

I guarantee you that this is what we

ship.

It requires lot of switches between

windows, lot of copy paste

and it leads to what is called the late

awareness problem.

When the starts doing such a task then

it might not even occur to them to use

this tool and they're already in the

middle of the task.

So this is the user experience that our

team re-shirt thinks about.

Let the developer edit the code as

usual.

But now we're going to watch every

single action that the developer is

taking.

And within this noisy sequence of key

strokes,

we are going to figure out some repetive

examples of the edits that the user

likely wants to do.

And once we can infer that,

then we will learn a program

that can do related edits in other parts

of the code bas.

And we will provide your solutions to

the user.

Now this user experience looks a little

little bit like KP

but we don't want it to become like

that.

The user goes and changes to integers to

strings. We don't want to recommend user

to change every occurrence of integer.

So we leverage many additional signals

from the user as well including the

cursor location

and then we did many more iterations

added a richer class of patterns not

just repeated edits also associated

edits we innovated on user experiences

as well even th into the mix and the

result was this feature called intell

suggestions in visual studio which

became very popular develop ers love it.

It allows them to edit the code very

fast with much few pure tes

and this slide attempts to capture the

impact of this feature.

You can see the usage examples the

latest tweets.

The most heartening of this picture

is not this statistics.

It's a story of people

whose life was truly changed because of

this.

Let me show you an example video that I

came across by see one in four computer

users will develop a chronic overuse

injury. Having one hand just means I ran

into it twice as fast.

Suddenly, every click set pins and

needles shooting up my arm. From the

minute I woke up to the moment I fell

asleep, if I could fall asleep, the pain

would not stop. Every doctor I saw

agreed that the root of the problem was

the thing I loved the most, my computer.

Computer use requires a lot of

repetitive gestures. I learned more than

any one part of the human body is

designed to do long term. A paraplegic

who uses voice control will develop

vocal strain. We've all probably met

someone in our field hands or lower back

hurts at the end of the workday or we

aren't or we will be in 50 years. I

needed a way to reduce those painful

gestures.

I'm going to talk about two tools that I

honestly use literally every day that

help me do just that. The first you

might already be a little familiar with

and that's Intellode for Visual Studio.

That's an irresistible story and this is

what inspires me to do my job at.

Okay, now we need to move on to the

second part of the talk.

a different dimension that can improve

AI reasoning

that of interaction.

So what we'll notice here is that the AI

can interact with the user in many

different forms.

But also what is very important to

evaluate

is how good a given interaction is so

that can improve upon that.

Last year

we wanted to add copilot support for

debugging.

It's one of the most painful things the

developers do the place where they spend

most amount of their time.

So now when you get a runtime exception

you'll see this ask copilot button and

if you click it copilot tries to explain

why this exception arose and we try to

suggest you a fix for it.

So how do we develop this feature?

Well, we try engineering first.

Different kinds of product instructions,

different kind of labing information

that you can stuff into the problem.

So how well did it work?

You know, we figured out only 25%

success. Not very well.

So what really went wrong?

So what we observed was that copilot was

not following the right debug process

when it was not confident about what the

error was. It should do some

investigation should try to figure out

the root cause but instead of figuring

that out it was leaping back to the user

with the proposed fix. This might be

happening.

And the second challenge was that he was

too eager to close the conversation with

the user

saying that you know you can now handle

this case uh yourself.

So this leads the burden of delegation

and reopening the conversation with the

user and that's the challenge.

So we solve these issues with a very

interesting agentic pattern.

But before I go into that, let's step

back a bit.

So what do these do today? You ask them

a question Q,

they will come back with an answer A

even when they are not confident

and we hallucinate.

Why do they come back with an answer A?

Because that's what we have trained them

for. No instruction following. I give

you any instruction, you get back to me.

So turns out that we can try to fix this

behavior by just using a very simple

trick.

So you prompt the LLM

that if you do not know the answer to a

given question

then you should do some investigation

instead of trying to lectate the answer.

That's what you're doing. So in this

particular feature the prompt for our

agent said

when the user gives you a bug to

investigate

you need to do more investigation to

come up with an answer and if the answer

is no then you respond back to the user

and you can tell what was wrong. But if

more investigation is needed

then you ask the can it do that

investigation on its own using the

context and the tools that it has at

hand and if so you know go ahead and do

it but if the user input is needed then

we get back to the user with a clarified

question.

So how did this simple aentic pattern

work?

Let's look at this gang chart

which shows where the developers spend

their time using copilot for debug.

On the left you see the experience with

the old copilot chat and on the right is

the experience when the copilot chat is

extended with this investigate and

respond again.

So on the right side you see more solid

bars which means the user was working

more with the copilot

and you also notice that there's lot of

blue before green. So blue means that

the agent was trying to investigate do a

root cause analysis

and green means it was now in the face

of suggesting a fix to the user which is

the right debugging

and our success rate went up to 88%. Of

all the products that I ever worked on,

we are happy with 10 to 25% improvement.

But this is the only one that I have is

success rate

very significantly.

But you know what was my most heartening

this time around?

The second one. These are sometimes stop

the users invested.

The user said that by using this I was

not only able to do my task

but I learned about the right ways of

debugging code and the next time I spot

this kind of error. I feel empowered to

be able to do these things on my

that's what goods should do. empowered

users.

Now this study was done more than a year

ago and since then models have evolved

and recent models have emerged.

So we set up to study what changed. So

we did another recent study.

This time we got 20 developers

and get them one hour to try agent of

their choice with cursor.

We get a real open GitHub issues

something that they were familiar with

and were really motivated to fix that

one is what they had.

We observed that they were around 50%

satisfied or 50% successful of all that

they were trying to do.

But the story gets really interesting if

you look under the hood.

We observe that users were using two

different kinds of patterns when they

were working with the cursor agent. One

pattern which we call the slingshot

pattern. We will give the entire

description to the agent in one shot at

the beginning hoping it will do it all

and small number of cases it does and

then it's magical

but when it does not it's very difficult

to understand what it is and

here we're underutilizing the user's

potential

and then when these users hit a wall

switch to work completely over the

strategy.

In fact, even with the slingshot

strategy,

the interesting challenge is

that the model is very confident in

whatever it suggests

and all these cases it go wrong.

So when the hit this wall and they move

to a different strategy which we call

the staircase strategy

where they themselves divide the problem

into small subtasks

and then work with the model to

accomplish each of those subtasks

and this leads to a much more small

convergence.

But here we are under underutilizing the

potential of the agent because the agent

can also help you with planning can also

do more sophisticated tasks.

So we need an approach that is somewhere

in the middle.

So on one hand agent is actually quite

powerful

much more powerful than being able to do

a small subtask

and on the other hand the user has a lot

of tacid knowledge

which only comes from lived experiences

their knowledge of the all conventions

or undocumented assumptions.

And the way to do this is for the agent

to know

what is the right question to ask of the

user at the right time.

And this is exactly the pattern that I

showed you before the investigate and

respond pattern. So it turns out that

this behavior still has not emerged

inside these models but it needs to be

explicitly programmed as part of the

identity.

We learned another interesting insight

in this user state.

So cursor allows people to backtrack

but many developers did not do that

feature.

because they fear that they might lose

their progress.

It is not clear what to prefer to

backtrack or to continue to iterate with

the agent if it is not converging onto a

solution and most readers prefer the

ladder even if it was inefficient.

So what will be a safe backpacking

options for the user?

Something that will allow them to safely

explore different alternatives.

We have quoting conversations

easier unusually

power of the human AI collaboration in a

much more powerful

Now let's look at these two

conversations

on the left. This is a conversation with

the old co-pilot chat which does not

involve good conversation principles.

And you see that it is

making many assumptions, not doing any

clarification with the user and fails to

provide the right answer.

On the right hand, in contrast, you see

an agent that is very curious to

investigate more,

ask us right questions

and has helps converge more ideas.

So the question to ask here is

how can we ensure that models are more

likely able good conversations like the

one that you see on the right side.

So how can we program these models

these agents so that we can do good

conversations of the one that you see on

the right.

But before you can influence the

behavior you need to be able to measure

it first.

And before you can measure, you need to

define what constitutes a good quality.

So what is a good qualification?

So to answer this question we look into

social sciences nutrition

and here we found wisdom

from philosopher

Paul

who had suggested these four mems for

effective communication between human

beings.

Quantity

you should be neither speaks nor more

quality

whatever you say should be truthful

provide evidence answers

relevance

stay on track relevant to the user's

goals and mealism

being clear not using jles or

But the catch is that these methods need

to be interpreted in the context of the

underlying domain.

So what we need are rubrics

that reflect these maximums

but for the kind of task that the user

care to do. Are they doing bug fixing or

are they doing data analysis? This is

what the rubrics need to be able to look

at.

Now we generated these rubrics

using a semistructured

semi-supervised approach.

So we start from labor conversations

with a thumbs up or a thumbs down

and then we extract detailed reasonings

that explain why a conversation is

successful or not.

And this reasoning is expected using

this prompt.

So as you read this through this prompt

you will notice two interesting aspects

in this prompt.

One is the information about the

underlying domain. In this case, it was

sporting

and the second one

is the assigning the game responsibility

in a fair manner.

The PI can be not but the user can also

do that. So we need to take that into

now once we have these detailed

reasonings.

Now we use Gian exams

to convert these reasonings into the

right rubrics. The right form of Gian

exams that are specialized to the

underlying domain.

So how well do these rubrics work?

So what we observed was that our rubric

set

compared to the one generated from the

other state of the art technique which

does not use do sensitive does not use

the same methods. So our rubric set is

much sharply able to separate the

positive conversations from the negative

conversations.

A score of 27 versus three much much

sharper and the precision also goes from

58%

prediction of seven.

So now we understand how to evaluate the

quality of a given conversation.

But now let's try to think what will it

take to evaluate the quality of an agent

with some configuration.

So for that we will need to be able to

generate lots of these kinds of

conversations

automatically.

So in our case the domain that we were

interested in

was opened tasks for data analysis.

Task require multiple joins.

So first we needed to build a data set

of such tasks.

So what we did was we went to sites such

as

tidy Tuesday

notebooks

even research papers

and from there we try to extract tasks

like this. We figure out what the right

data set was and some meaningful queries

over those data set and also the brows.

Now once we have a collection of such

tasks, let's see how would an agent work

on any one of these tasks.

The orange agent might just be the

answer in this case saying that there's

no correlation between the budget size

and the ratings of low budget horror

films under this assumption. That's a

possibility.

But the green again can come back with a

question, oh where is rating located?

And the user might clarify, oh the

rating is located in the review rating.

Or the yellow might say, what should I

do with missing entries

or missing in the budget part and the

might say better those entries.

So all of these are reasonable

interactions,

but this is what we want to automate so

that we can evaluate the quality of an

agent. We want to be able to automate

this user input.

So we need to build a user proxy

and what should this proxy do? This

proxy should be able to answer all the

clarifying questions that the internet

is asking.

Now in the case of data analysis, you

cannot really answer these clarifying

questions by just looking at the output

42 right what does it tell you how to

drive 42 from the input.

So our idea was to generate supporting

code to evidence the output from the

input data set.

And this code actually makes it very

transparent

what the data cleaning choices were

made, what the analysis choices were

made so that you can reliably translate

the data set, the input data set into

the right. And using this code, the

proxy is now able to answer the

questions that the agent asks.

Another interesting challenge here is

that it's a proxy should simulate who

Excel has a huge diverse user base with

different expertise and business. A

college might prefer a bar chart or a

pie chart may not know much about

formulas. On the other hand, a business

analyst might want beautiful dashboards

with full automation.

So the idea was to use persona specific

docking recipes

so that the user proxy can act like a

specific person.

And the ultimate test of a good user

proxy should be that [snorts] it should

be able to amplify the performance gap

between different agents

and be predictive of their performance

in the game.

Now let me talk about a very different

way to handle every

not by bouncing

fine questions back to the user

or by generating an incorrect answer as

you see in this case.

So here again does not know what the

selected score is

but still it goes and writes score one

thinking that it is a selected score for

the right.

So what we can instead do instead of

getting an incorrect form like this or

asking questions of the user we can just

return back to a partial response like

the one that we see on the wall.

where I can use placeholders in this

case a question

really denoting where the ambity is and

the drawing input.

So now how can you train a model to be

able to generate such case?

So here is the ser realization.

The data that you used to typically

quantity models

actually had issues of the kind that you

see at the top.

It's noisy. It's not entirely correct.

It's not very high quality. And that's

the challenge in real life. This is what

often lead the models to learn in.

But in our case, this is a blessing

because you can use it. You can use it

to actually train the model to identify

such activities

and default them

instead of having. So how do we do this?

Don't change the data. Explain data. All

that you do is to change the loss

function.

This is the

so the green part is the standard cross

entropy loss. It punishes wrong tokens.

The blue part reduces the punishment if

you decide to do a question mark,

but you don't want the model to cheat

and put question marks everywhere. So

you need a red regularizer.

And the gray part is simply controlling

how often should the model use a

placeholder.

So initially we start out with strict

training less placeholders and then we

relax it.

So how effectively does this work? So

let's look at the confusion

for the cases where we were predicting

the correct formula earlier.

We still predict them correctly 97.6% of

the time. It's quite good.

But look at the bottom row for the cases

where the model was wrong was making the

wrong prediction earlier.

Now with this new loss function

we are able to 90.8% 8% of the time

quite significant to really achieve what

we set out.

Now let's take this idea of placeholders

one step further

and I will do this in the context of a

very different application

that of synthesizing creative content.

So this is the version of the image that

I use on my title slide.

How did I generate it? I provided my

title and abstract to a

generated

correct image model

and the image that you see is the output

of that.

But the story starts here.

What does next

is to give us meaningful adaptations

of underlying prompt.

It puts placeholders in the prompt,

gives them meaningful names to those

parameters

and also gives me alternative choices

for those parameters.

So in this case it included the three

placeholders in the block and these are

the character names we provided.

So let's move to the central theme.

So what you see in the picture is a

branching decision tree with growing

neural links.

But the model provides me other options

for this center as well. Flow chart with

arrows and icons. abstract network of

years. They are chipped with growing

pathways and let's see what they look

like. Quite cool.

I probably would not have been able to

pull out these trials.

Let's look at another parameter color

scheme and

actually I tried to change the color of

this title slide when I got this out

first

but I didn't good option. But look at

the options that the model is that the

is

the most surprising was the parameter at

the top. It's just a boolean parameter.

All I can do is to flip from a no to

yes.

But see what what happens on

now you can see the shadow of

spreadsheets and code in the picture

which is quite meaningful in fact my

aspect does talk about applications to

spreadsheets and code and this is

exactly what the model is

okay now I'm moving on to the last small

part of my talk which is the third

pillar that of inspection

and the idea that What I really want you

to take out of this part is

that if you do this inspection which is

the process of checking the output or

reasoning of a model in structured ways

using code in a systematic way then you

can get much more robust. So I'll give

you a couple of examples.

So let's consider the problem of

extracting tables

from PDF documents and different kinds

of reports.

When you do copy paste or when you use

OCR it flatters the structure

and you get lot of mistakes as a result

and people spend huge amount of time

trying to fix these mistakes.

If you use an position today, many of

the mistakes still persist.

In fact, even worse values,

you know, one of the values quite bad.

So here's the new symbolic solution we

design which works more effectively.

So we start with the table generation

for an LLM and we get an initial table.

Then we run some sanity checks on this

symbolic checks that search for

different kinds of syntactic and

semantic issues

and we pass that input over to a

calendar which generates a nice feedback

and we repeat the process until we get

something more satisfying. So let's see

how that works with an example.

So we copy paste copy pasteed data from

and we put it into the prompt of

and what we get is a table that you see

on the top left the yellow and it has

issued rows that have been ind

cells and so on. Now we run a symbolic

check and it identifies the issues that

you see in the blue box

and these issues are fed into

which generates some real actionable

feedback for the original to retry and

then the retry leads to the nice green

table that you see.

So what do these tests do?

two different actors that they look at.

They see whether the all the entity all

the data in a given column matches some

entity or some underlying regular

pattern. Are there any special

characters that are used in consistent?

It also sees whether what you have in

that table can every content traced back

to the source input to the source text.

If not it means some

and the other way around also how much

of the input text is actually preserved

in the

in one of our benchmarks we observe

almost a 15 more 15% lift when we do

this kind of feedback in structure ways

using code as opposed to just asking

another to look at the output I provided

and that's the the point that I'm trying

to that whenever you can do things in a

systematic way using code. You should

prefer that it will lead to more robust.

Now the final example in my talk is

about uh different kind of feedback than

you can do on a model output

in this case a model and for a different

purpose.

So in this case what we did was we took

lots of different reasoning strategies

and we passed these strategies into

different constituent elements a plan an

action that the model is taking an

observation an effect

and then we try to do some indirect

analysis over there

that what the model tried to do what I'm

to do is did it really was it really

successful in doing that or not and if

not we do some root cause analysis to

figure out what might have been the

reason for failure or success.

And here's the interesting part. We can

actually draw insights when you do

listening of these places along many

dimensions. These insights can be

abstract or more concrete. They can

focus on smaller segments of the

trajectory or the entire end trajectory

and they can be about positive

observations or negative observations of

traje

and then you can experiment with

importing different combinations of

these insights

for whom your crops

so that the behavior of the model

becomes more robust.

And on one data set we observe that if

you encode negative end to end insights

you get better precision

as opposed to encoding all end to end

insights including positive because the

positive behavior apparently turned out

not was not important to be reinforced

but it was important to point out to the

model what was and the decision went up

from 75.

It's a very ongoing work, very famous

work, but I still want to share.

So I conclude now.

So the first pillar that I talked about

for improving AI was to leverage intent

which can be much richer than just

natural language.

Especially when you have input output

examples.

Not that you are not just it's not just

possible to use the examples to validate

the output of the but you can actually

use that to force correct generation in

the first place as I showed you earlier.

Another interesting place where a lot of

us intent is hidden is temporal context

the action that the has been doing in

the recent past. In fact, this idea that

I show you for intellations is being

picked up by many different products

including

cursor you know Excel copilot that is

all looking at what is happening in the

recent past to be able to make better

sations for

the second dimension that I talked about

was interaction. I showed a different

way of interacting with the user, asking

questions through a sketch, suggesting

different choices for a parameter in the

sketch. And the interesting part here

was to leverage some communication

principles of social science and also

how we evaluate this kind of

interaction.

But one subtle thing that I did not go

into more detail is interaction and

suggest of what

we often think of productivity use cases

which is probably what you might have

seen mostly here where the user know how

to do the task but AI just makes them

much more effective

but sometimes the case is learning

the goal is for the user to learn about

new things and to deepen their

understanding and we saw GL of that in

the gig area for Visual Studio where the

developers felt that it was also a

learning experience for them not just

productivity

and then another fundamental use case is

creativity

where neither the user nor the model

knows the final destination you're

exploring together

and the third pillar was of inspection

you can statically model output the

listening traes they produce

to not only do it refinement for that

particular task instance but also to do

product updates.

But the idea that I really wanted to

leave here was to do this in a

systematic manner using code possible

and the fourth dimension

which probably saw on the last slide

which can make all of this much better.

So let me conclude with another shot.

In the movie

50 first dates

the character grew very doesn't have

shortterm memory. Every morning she

wakes up and forgets what happened

again.

hero sand

has to win our heart all over again.

Now this is sweet and funny in a

romantic comedy but not so much in a

human interaction.

So without memory every interaction is

like a first date. You have to teach it

the same preferences

how to cor the same kind of mistake over

and over again.

But memory changes the game

with the possibility to turn every

interaction into evolving relationship.

The very active area of research going

on there huge opportunities here how to

represent memory how to update it and to

do that in a safe manner with the having

full control of them.

So these are the four pillars that I

wanted to inspire you with intent,

interaction, inspection and memory. And

together they will convert AI power tool

that reacts

to a collaborator that evolves.

Thank you.

So before we thank you very much that

was very inspiring. Um I have a quick

question in terms of how the agent and

human interaction can evolve based on

the work that you described today. From

what I observed today guiding the agent

the right path with human guidance leads

to better results. However, the way I

look at it is just that search space

evaluation can deep how far are we from

the agents that we have the agents can

do that search space themselves is LM

the right speed for that or is that a

model that we can look at

I'm not able to hear you clearly but let

me try to repeat your question in part

and then you can probably make your mic

a little more up it might help

>> is that better

>> yeah this is much better so uh let me at

this interaction between human and AI

how can we use it to

ensure that the interaction gets better

in the future that the guidance was

provided by the human to say now let's

go do this and if for example you you

build a better prompt went to

neuroscience figure out a better

approach and then you integrate that

into the human agent interaction I'm

trying to figure out how far the agents

are do that figure out that guidance by

themselves and is lm the right seat for

So how would LLM are from being able to

learn from these past interactions are

they the right tool? So I think it's a

very interesting question. So today when

we think about let's say fine an LLM to

a given particular domain right that's

one of the tools that we often use that

if LLM is not working very well for the

task that you care about for the domain

that you care about then we talk about

Python we focus our entire energy on

collecting lots of data sets and doing

evaluation so that we can teach NLM to

do the sophisticated task in the domain

well enough in one shot And that's the

opportunity.

However powerful these elements become,

if they're able to do tasks in one shot,

the aspirations of humans will keep

rising. So we always want to think of

environments, collaborative environments

where jointly they can accomplish much

more. And this needs to be reflected as

part of it. This is what we are not

doing today. So what today what we are

doing is that if not working very well

for your task at hand for your domain at

hand you try to collect lot of data and

find the real but we need to make this

interaction as part of finding which is

something that I have not yet seen

happening as much this will probably

start to happen once we actually get

that kind of data right today the alms

are based on all the data that exists in

the world and when people are trying to

use it then interactions are starting to

evolve once we collect more of this kind

of data then we will have the right data

to finune the alms on those interactions

and then they will have this inherent

capability. Until then we will have to

use external mechanisms like memory to

be able to do that. And the second thing

is personalization is an aspect which

cannot get into another. It has to be

handled outside

at the level of each specific user or

the level of a specific team because

different people might have different

preferences for exactly the same kind of

task.

Hello. Uh, thank you so much for the

talk. That was really interesting. I had

a question. So, we have a similar

problem to some of the examples you

showed where we are extracting text from

tables. the other way around and uh we

are fine tuning the model a lot of

information but we still sometimes get

the values extracted incorrectly or

inate values and I was wondering a lot

of the talks talked about in this

conference talked about critical agents

as well as using reinforcement planning

as well if you've come across any of the

strategies that could help this so

repeat your question

So in the task of converting text into

table or table into text you know

transforming data from one format to

another format there are many techniques

being developed in order to improve the

accuracy of these kinds of tasks.

uh what is some value that I can add on

top of that right so many I can probably

I use actually the pill of interaction

to talk about how I can do more advances

in this but as I said that these

quarters will continue to improve in

their power but for them to be able to

do more and more sophisticated tasks

that exceed their capability you have to

fall back onto interaction so what can

be some good interaction mechanisms in

this task of uh let's say data

conversion from one format to another

one that We simply started exploring

was to look at multiple models, multiple

systems that might be incomparable power

and run them both

uh on the same input task and then see

which parts of the output do they

provide a different answer on

use that to light up a heat map for the

user so that they can pay their

attention to those parts where there is

uh less confidence or more discreet. I

think building systems like these would

actually be very helpful. Another thing

that you can do is that let's say if you

want to generate a table and that table

looks completely wrong because

everything got shifted by one then maybe

you can start to build some edit tools

maybe these edit tools can be done by

the LM itself that can fix the kinds of

errors that actually happen. So also

focus on building good interaction

experiences. A for the user to help spot

the discrepancy again working with the

alien and b how you can quickly fix

those. So those are two pillars that I

will offer in the ve of interaction to

make systems.

>> Thank you. Uh, let's see about

KDD '25 AI Reasoning Day keynote: Improving AI Reasoning through Intent, Interaction, and Inspection

Microsoft Research

52 days ago

1:06:50

Ai Whitelist

AI Whitelist

Rank #1

Description

https://ai-reasoning.github.io/ AI models are increasingly capable of solving sophisticated tasks that require reasoning. But how do we improve the quality of that reasoning, especially when the models operate as black boxes? In this talk, Sumit Gulwani shares practical strategies for improving AI reasoning in the domain of code and structured tasks. A first idea is to capture richer forms of user intent. Input-output examples not only enable post-hoc validation, but also guide the model toward correct generations up front. Temporal context (such as recent user actions) can help infer evolving intent and keep users in flow. Secondly, we can give the model an escape mechanism, allowing it to abstain or initiate collaborative interaction when it lacks sufficient information. This raises new challenges in evaluating interactive workflows, which we address through rubric-based assessments of conversation quality (grounded in principles like the Gricean maxims) and automation using simulated user proxies. Finally, we can strengthen reasoning via automated inspection. Symbolic checkers or programmatic validators can uncover hallucinations and inconsistencies in both online and offline settings. These signals can then guide the model through iterative refinement or prompt updates. Sumit illustrates these ideas through real-world applications spanning spreadsheet tasks and software development, highlighting how AI reasoning can be improved using structured intent, collaborative interaction, and systematic inspection.

Video Details

Category

Feed

AI Whitelist

Featured Date

January 9, 2026

Quality Rank

#1

AI Recommended