Loading video player...
In 1995,
you come up with a problem not even
born.
When I was sitting college,
I had two choices
to study electrical engineering at IIT
Kur which was close to the my hometown.
One study computer science at a far away
college.
I really wanted to study computer
science but even more than that I wanted
to stay closer to my parents. So the
point was here take electrical
engineering at
now in the first year in the intro
programming course
an instructor told us that you can
multiply two 2x2 matrices
using seven multiplications instead of
eight.
That was very
I wonder how can you do that?
At that time there was no Google and
much less casually.
So I went back to the instructor and
asked how you do this
and his response
that premium material only taught to
rights
>> that upset me enough that I rewrote the
joint entance examination and secured a
better rank to be able to study computer
science at the very same institute.
I followed that up with a PhD in
computer science at Berkeley
and then when I got trained as a
computer scientist,
one of the first problems that I decided
to study at Microsoft research
was how you can automatically synthesize
such complicated algorithms.
And it took me five years to get my
revenge. Then we had a paper that could
do just this.
And as the title of the paper suggests,
we first had to develop program
verification techniques.
Techniques that can verify the
correctness of such algorithms
and then we translated that into
techniques for automatically
synthesizing such programs.
In fact, this paper was the most
influential paper award few years ago.
and life came back full circle for me
when the very same institute
honored me with the distinguishedness
award.
Now I'm so glad that didn't exist at
that time because if it did and if you
ask it this question it will actually
give you the correct algorithm
not just this it can give you algorithms
for many things. How about you multiply
two 2x2 matrices using six
multiplications
which we know is mathematically
impossible
>> and maybe it's time now to go back to
program synthesis
to be able to check correctness of
programs
and only when we do this we will be able
to increase reliability and robustness
of these AI methods that generate these
programs and algorithms
and also this is what is going to
facilitate age even discovery of new
algorithms to begin with.
Now in this talk you'll see many
different techniques to reduce
hallucinations.
But let me ask you this.
What is the simplest thing you can do
to check that this program is incorrect.
Exactly right. test it out on few input
output examples.
Now these input output examples
are a very interesting form of this
specification
because they can be validated. In fact,
the only form of specialation that you
can validate easily.
Next, I'm going to show you some domains
where input output examples
are the easiest specifications to
provide and even more natural than
natural languages.
So, back to 2009,
I was returning from a conference
And there was a lady sitting next to me
in the airport. She was very impressed
to know that I have a PhD in computer
science and that I work for Microsoft.
So I go to help her with the task that
she was studying with.
She opens up a laptop, points up Excel
and shows me a column of names
and asks me how can she for these names
by giving me an example of what she
wanted.
Now at that time I had no idea about the
programming model underneath Excel. so I
could escape myself out of the
situation.
But after I got back home, I searched
for a solution to this problem on Excel
helps.
And that's when I realized that many
many people
struggle with simple repetitive tasks
like that of the airplane lead
and they would communicate their intent
to an expert on a help forum through
input output examples
and this is what inspired me to develop
a feature called flash
that can automate tasks like these.
So you give just one example of your
intent
and it will generalize this example into
a program and run that program to
automate the task for you.
And the key breakthrough here was the
ability to synthesize these programs
in.1 second
often from just one example.
So let me show you how the overall
underlying architecture works there.
So the first idea was to restrict the
search to an underlying domain specific
language.
So in case of string transformations
such a language would consist of
operators like regular expressions,
substring operator, concatenate and with
some limited form of conditionals.
So instead of searching for a program
over a full programming language like
Python, we will search a program over
this domain specific language.
Now one idea to sort for programs over a
domain specific language would be to
encodate
all programs in order of increase in
size
and then check which of these is
consistent with the examples.
But this is not going to scale because
even though this DSL is relatively small
compared to Python, it is still info has
lots of concepts.
So the second key idea was to do a bold
directed search instead of an
enumerative search
where we will tune many parts based upon
logical reason.
So essentially what you do is you take
input output examples
and you back propagate them through the
structure of the gravel
using inverses for various operators
that you have in the ground
and then this symbolic back prop back
propagation
yields multiple goals at each substep
and then these goals can be explored in
the order of their likelihood to
succeed.
And we use machine learning techniques
to predict which board to explore first.
And sometimes this provides a 10x
improvement in SP.
Now at the end of this process, you have
many programs which are consistent with
the small number of input output
examples that the user has provided.
Now we need to rank these programs in
order to pick the one that the user
likely indented.
And one of the important ways to do this
ranking is to look at program features.
We prefer programs that are small that
few number of constants.
And another interesting idea that we
discovered was to actually run these
programs on the remaining inputs in the
spreadsheet.
and see which program generates outputs
that are more uniform and that program
is more likely to be done.
So this is almost 15 years old work but
it still stands the test of time.
The idea of using domain specific
languages to restrict the output is very
relevant in the days of LM as well. So
last year OpenAI released this API where
you can enforce that the output actually
belongs to some structure that you
specify.
And this overall template of generating
lots of candidates and then ranking them
using tools that are not available
during the generation time is also a
very popular idea with use of LM.
So flash became very popular made it
into middle school computing textbooks.
But the best that I ever received for
this work
was this quote from my late father.
Okay. So the user experience of flash is
so inviting that it doesn't prevent
people from trying to run tasks that it
was never meant for. For example,
date and number transformations.
So there was this tweet that went viral
when someone gave an example mapping DC
to December
and flash autocomp.
Some people came to my rescue claiming
that this is how we should have named
months in the first place.
Then I also came across a shittiing lab
where someone had incorporated flash as
part of process automation.
But you know the best thing about this
kind of customer feedback
is that it tells you
what are the next set of problems to
work on.
And so we extended support for flash to
much more sophisticated class of
transformation including data number
transformation.
For instance, look at this task where we
want to map the date in the system
to the corresponding quarter port and
just a couple of examples
>> flat out.
But not only that, it can actually
generate readable formulas in many
different languages including the Excel
formula language
which was one of the most
asked speakers so that people can get
transparency into what was happening
underneath.
Can someone note that this is almost 10x
better than flash?
But the most interesting YouTube video
that I saw was that people realized that
this is not based on chat. It works so
fast.
So if you want to read a paper that is
not about LLMs or using machine learning
but it's still relevant today, I would
recommend this one. Now one question
that you can ask is
how about using LMS
for this simple task?
In fact, I gave an entire keynote on
this topic couple of years ago where I
pitted flash versus GV4
and this is what we identify.
So consider the task of extracting the
first three characters a task as simple
as the first three characters and if you
give it to GP3 it does it easy.
But what if you have to extract four
characters?
It can't do this even with four
examples.
because the bias for standard one names
is that strong.
But what about GBD4?
It's a beast to say the least compared
to GP.
It can do this task with two examples.
But what about extracting the first five
characters?
It could not learn how to do this even
with four examples.
And then my prediction was that G5
which I haven't got yet.
But the interesting thing here is to
ponder about that how are examples
different from natural language in terms
of their capability to describe intent
and why do elements find it hard.
So what we observe with these elements
is that the code that they generate is
often approximately correct.
But when you have examples as a
specification,
you want to be able to execute those
examples
to check whether the code is exactly
correct or not.
And unless the understand the execution
semantics in great detail, they won't be
able to do this. So I will actually show
you a small trick
as to how you can make the understand
the execution semantics
and Another thing to observe here is
that programming examples is really a
search problem. You're trying to figure
out the common logic that is consistent
across various examples.
It's not a translation problem as in
language translation.
But again the trick is how can you make
do because examples can be validated.
That's the beauty of this form of the
specification.
So let's take this task. Then my goal is
to extract the
last name
from cryptographic end.
And if I have a middle name, I want to
extract that as well.
So if you give four examples to the LM,
it still does not get it right. It
produces a program, but it still does
not learn how to handle the case when
there's a middle name.
So what can you do? You can run this
program on the examples, figure it out
where it fails, try to repeat that loop
with the LM and then just pray that it
will converge. But you can do something
in a much more controlled fashion. And
here's the idea.
So you let LM generate the first
statement, but then take the control
back. Then you execute this statement on
the various inputs that you have and
show the outputs. Show the effect of
this statement.
to the LLM.
Now you let the LLM decode and generate
the next statement. Again, you take the
control back, execute this statement on
inputs that you got and put down an
output and you create this class and
then lo and behold the LM actually gets
it right.
It's one of the interesting tech you can
use.
So instead of asking the LM to give you
one choice
for the program statement at each step,
you can ask it to give multiple choices
like two or three
and then you can do some variable
remaining and put down all these choices
in the program.
And now what we have really done is to
trick the LM into doing a first search.
So relatively simple but a quite
powerful idea.
Now let's look at another application
domain where examples are in natural way
to express.
This is an assignment taken from a data
science class where I have a custom text
file and the instructor requires the
students to extract a structured table
out of this and the instructor provides
the students a script to build on top
of.
But now let's see how the programming by
example experience makes this much
easier.
So I wrote the same custom text file
into my playground and all that I have
to do is to give examples of the fields
that I want to extract. So let's say I
want to extract a championship. I give
one example and then another one. And
now you can immediately parsing logic to
be able to extract more such instances
from this custom text file.
Suppose I want to extract another key.
Now I'll give one example and the tool
understands my intent from this one
example.
The more I work with this document, the
more of a structure I'm imposing on
this.
Now let's say I want to extract a new
score.
So when I give one example, the tool
ends up learning a program
which does something wrong in the third
report because it's in a slightly
different format.
Now what if this record was not in my
report? I have tens of thousands of
records and this report was somewhere in
the middle.
So it would not be easy to figure out
what went wrong.
In fact, if you are programming this
task yourself or even using an LLM to
program this,
all bets are off. You will not be able
to easily figure out which of the 14
tens of thousands of reports that we
have went wrong.
But in case of programming by examples,
you can do something very interesting.
So recall that we don't just have one
program that we generate from the few
examples we have. In fact, we generate a
family of programs.
each of which is consistent with the
example that the user has provided.
Now what we can do is we can take
multiple top ran programs and run them
in parallel on this input
and you can watch out for the places
where they differ in their execution.
And if you do this on this scenario, you
will figure out that the top two rank
programs give different outputs on this
third record.
And you can submit that to the user and
ask it what output do you want or maybe
you can specify your own output
and when the user specifies the right
output will convert to the right output.
So this idea is called distinguishing
inputs.
Essentially helping the users provide
the right examples for the many quarter
cases that might be there in your data.
In fact, this paper often see the most
influential paper. Again, a simple but a
very powerful idea
and you can figure out such part cases
using other techniques as well such as
using different kinds of padding.
So this was one form of intent that the
can specify which is different from
natural language.
Let me show you another form of intent.
We call this temporal context.
Now suppose I some code
and I have written this. Now I can let
take over and will do the completion for
used quite impressive earlier but now
it's overall
but now let's say I have this part of
code.
Can you predict what is the edit I want
to do in this part of the code?
Probably not.
But what if you saw what I was doing few
minutes ago?
I had this other part of the code base
where I went and changed one of the
expressions that converts Fahrenheit to
centra
by a method called I method.
I did a related change at another place
and now you should be able to guess what
I want to do in this snippet at the
bottom which is to make a similar change
here as well.
Now you can actually liken this to flash
for code
where the user can give me examples of
the code fragments that they want to
transport
and then I will learn a program overs
that can do these kinds of coding.
So here is the user experience that we
initially designed for this. The user
copy pastes
all the code fragments that they want to
transform in the top left box.
Then they give an example of the
transformation in the bottom two boxes.
And then they press the magic button and
then they get all the transform coes.
But there's a big problem with this user
experience.
I guarantee you that this is what we
ship.
It requires lot of switches between
windows, lot of copy paste
and it leads to what is called the late
awareness problem.
When the starts doing such a task then
it might not even occur to them to use
this tool and they're already in the
middle of the task.
So this is the user experience that our
team re-shirt thinks about.
Let the developer edit the code as
usual.
But now we're going to watch every
single action that the developer is
taking.
And within this noisy sequence of key
strokes,
we are going to figure out some repetive
examples of the edits that the user
likely wants to do.
And once we can infer that,
then we will learn a program
that can do related edits in other parts
of the code bas.
And we will provide your solutions to
the user.
Now this user experience looks a little
little bit like KP
but we don't want it to become like
that.
The user goes and changes to integers to
strings. We don't want to recommend user
to change every occurrence of integer.
So we leverage many additional signals
from the user as well including the
cursor location
and then we did many more iterations
added a richer class of patterns not
just repeated edits also associated
edits we innovated on user experiences
as well even th into the mix and the
result was this feature called intell
suggestions in visual studio which
became very popular develop ers love it.
It allows them to edit the code very
fast with much few pure tes
and this slide attempts to capture the
impact of this feature.
You can see the usage examples the
latest tweets.
The most heartening of this picture
is not this statistics.
It's a story of people
whose life was truly changed because of
this.
Let me show you an example video that I
came across by see one in four computer
users will develop a chronic overuse
injury. Having one hand just means I ran
into it twice as fast.
Suddenly, every click set pins and
needles shooting up my arm. From the
minute I woke up to the moment I fell
asleep, if I could fall asleep, the pain
would not stop. Every doctor I saw
agreed that the root of the problem was
the thing I loved the most, my computer.
Computer use requires a lot of
repetitive gestures. I learned more than
any one part of the human body is
designed to do long term. A paraplegic
who uses voice control will develop
vocal strain. We've all probably met
someone in our field hands or lower back
hurts at the end of the workday or we
aren't or we will be in 50 years. I
needed a way to reduce those painful
gestures.
I'm going to talk about two tools that I
honestly use literally every day that
help me do just that. The first you
might already be a little familiar with
and that's Intellode for Visual Studio.
That's an irresistible story and this is
what inspires me to do my job at.
Okay, now we need to move on to the
second part of the talk.
a different dimension that can improve
AI reasoning
that of interaction.
So what we'll notice here is that the AI
can interact with the user in many
different forms.
But also what is very important to
evaluate
is how good a given interaction is so
that can improve upon that.
Last year
we wanted to add copilot support for
debugging.
It's one of the most painful things the
developers do the place where they spend
most amount of their time.
So now when you get a runtime exception
you'll see this ask copilot button and
if you click it copilot tries to explain
why this exception arose and we try to
suggest you a fix for it.
So how do we develop this feature?
Well, we try engineering first.
Different kinds of product instructions,
different kind of labing information
that you can stuff into the problem.
So how well did it work?
You know, we figured out only 25%
success. Not very well.
So what really went wrong?
So what we observed was that copilot was
not following the right debug process
when it was not confident about what the
error was. It should do some
investigation should try to figure out
the root cause but instead of figuring
that out it was leaping back to the user
with the proposed fix. This might be
happening.
And the second challenge was that he was
too eager to close the conversation with
the user
saying that you know you can now handle
this case uh yourself.
So this leads the burden of delegation
and reopening the conversation with the
user and that's the challenge.
So we solve these issues with a very
interesting agentic pattern.
But before I go into that, let's step
back a bit.
So what do these do today? You ask them
a question Q,
they will come back with an answer A
even when they are not confident
and we hallucinate.
Why do they come back with an answer A?
Because that's what we have trained them
for. No instruction following. I give
you any instruction, you get back to me.
So turns out that we can try to fix this
behavior by just using a very simple
trick.
So you prompt the LLM
that if you do not know the answer to a
given question
then you should do some investigation
instead of trying to lectate the answer.
That's what you're doing. So in this
particular feature the prompt for our
agent said
when the user gives you a bug to
investigate
you need to do more investigation to
come up with an answer and if the answer
is no then you respond back to the user
and you can tell what was wrong. But if
more investigation is needed
then you ask the can it do that
investigation on its own using the
context and the tools that it has at
hand and if so you know go ahead and do
it but if the user input is needed then
we get back to the user with a clarified
question.
So how did this simple aentic pattern
work?
Let's look at this gang chart
which shows where the developers spend
their time using copilot for debug.
On the left you see the experience with
the old copilot chat and on the right is
the experience when the copilot chat is
extended with this investigate and
respond again.
So on the right side you see more solid
bars which means the user was working
more with the copilot
and you also notice that there's lot of
blue before green. So blue means that
the agent was trying to investigate do a
root cause analysis
and green means it was now in the face
of suggesting a fix to the user which is
the right debugging
and our success rate went up to 88%. Of
all the products that I ever worked on,
we are happy with 10 to 25% improvement.
But this is the only one that I have is
success rate
very significantly.
But you know what was my most heartening
this time around?
The second one. These are sometimes stop
the users invested.
The user said that by using this I was
not only able to do my task
but I learned about the right ways of
debugging code and the next time I spot
this kind of error. I feel empowered to
be able to do these things on my
that's what goods should do. empowered
users.
Now this study was done more than a year
ago and since then models have evolved
and recent models have emerged.
So we set up to study what changed. So
we did another recent study.
This time we got 20 developers
and get them one hour to try agent of
their choice with cursor.
We get a real open GitHub issues
something that they were familiar with
and were really motivated to fix that
one is what they had.
We observed that they were around 50%
satisfied or 50% successful of all that
they were trying to do.
But the story gets really interesting if
you look under the hood.
We observe that users were using two
different kinds of patterns when they
were working with the cursor agent. One
pattern which we call the slingshot
pattern. We will give the entire
description to the agent in one shot at
the beginning hoping it will do it all
and small number of cases it does and
then it's magical
but when it does not it's very difficult
to understand what it is and
here we're underutilizing the user's
potential
and then when these users hit a wall
switch to work completely over the
strategy.
In fact, even with the slingshot
strategy,
the interesting challenge is
that the model is very confident in
whatever it suggests
and all these cases it go wrong.
So when the hit this wall and they move
to a different strategy which we call
the staircase strategy
where they themselves divide the problem
into small subtasks
and then work with the model to
accomplish each of those subtasks
and this leads to a much more small
convergence.
But here we are under underutilizing the
potential of the agent because the agent
can also help you with planning can also
do more sophisticated tasks.
So we need an approach that is somewhere
in the middle.
So on one hand agent is actually quite
powerful
much more powerful than being able to do
a small subtask
and on the other hand the user has a lot
of tacid knowledge
which only comes from lived experiences
their knowledge of the all conventions
or undocumented assumptions.
And the way to do this is for the agent
to know
what is the right question to ask of the
user at the right time.
And this is exactly the pattern that I
showed you before the investigate and
respond pattern. So it turns out that
this behavior still has not emerged
inside these models but it needs to be
explicitly programmed as part of the
identity.
We learned another interesting insight
in this user state.
So cursor allows people to backtrack
but many developers did not do that
feature.
because they fear that they might lose
their progress.
It is not clear what to prefer to
backtrack or to continue to iterate with
the agent if it is not converging onto a
solution and most readers prefer the
ladder even if it was inefficient.
So what will be a safe backpacking
options for the user?
Something that will allow them to safely
explore different alternatives.
We have quoting conversations
easier unusually
power of the human AI collaboration in a
much more powerful
Now let's look at these two
conversations
on the left. This is a conversation with
the old co-pilot chat which does not
involve good conversation principles.
And you see that it is
making many assumptions, not doing any
clarification with the user and fails to
provide the right answer.
On the right hand, in contrast, you see
an agent that is very curious to
investigate more,
ask us right questions
and has helps converge more ideas.
So the question to ask here is
how can we ensure that models are more
likely able good conversations like the
one that you see on the right side.
So how can we program these models
these agents so that we can do good
conversations of the one that you see on
the right.
But before you can influence the
behavior you need to be able to measure
it first.
And before you can measure, you need to
define what constitutes a good quality.
So what is a good qualification?
So to answer this question we look into
social sciences nutrition
and here we found wisdom
from philosopher
Paul
who had suggested these four mems for
effective communication between human
beings.
Quantity
you should be neither speaks nor more
quality
whatever you say should be truthful
provide evidence answers
relevance
stay on track relevant to the user's
goals and mealism
being clear not using jles or
But the catch is that these methods need
to be interpreted in the context of the
underlying domain.
So what we need are rubrics
that reflect these maximums
but for the kind of task that the user
care to do. Are they doing bug fixing or
are they doing data analysis? This is
what the rubrics need to be able to look
at.
Now we generated these rubrics
using a semistructured
semi-supervised approach.
So we start from labor conversations
with a thumbs up or a thumbs down
and then we extract detailed reasonings
that explain why a conversation is
successful or not.
And this reasoning is expected using
this prompt.
So as you read this through this prompt
you will notice two interesting aspects
in this prompt.
One is the information about the
underlying domain. In this case, it was
sporting
and the second one
is the assigning the game responsibility
in a fair manner.
The PI can be not but the user can also
do that. So we need to take that into
now once we have these detailed
reasonings.
Now we use Gian exams
to convert these reasonings into the
right rubrics. The right form of Gian
exams that are specialized to the
underlying domain.
So how well do these rubrics work?
So what we observed was that our rubric
set
compared to the one generated from the
other state of the art technique which
does not use do sensitive does not use
the same methods. So our rubric set is
much sharply able to separate the
positive conversations from the negative
conversations.
A score of 27 versus three much much
sharper and the precision also goes from
58%
prediction of seven.
So now we understand how to evaluate the
quality of a given conversation.
But now let's try to think what will it
take to evaluate the quality of an agent
with some configuration.
So for that we will need to be able to
generate lots of these kinds of
conversations
automatically.
So in our case the domain that we were
interested in
was opened tasks for data analysis.
Task require multiple joins.
So first we needed to build a data set
of such tasks.
So what we did was we went to sites such
as
tidy Tuesday
notebooks
even research papers
and from there we try to extract tasks
like this. We figure out what the right
data set was and some meaningful queries
over those data set and also the brows.
Now once we have a collection of such
tasks, let's see how would an agent work
on any one of these tasks.
The orange agent might just be the
answer in this case saying that there's
no correlation between the budget size
and the ratings of low budget horror
films under this assumption. That's a
possibility.
But the green again can come back with a
question, oh where is rating located?
And the user might clarify, oh the
rating is located in the review rating.
Or the yellow might say, what should I
do with missing entries
or missing in the budget part and the
might say better those entries.
So all of these are reasonable
interactions,
but this is what we want to automate so
that we can evaluate the quality of an
agent. We want to be able to automate
this user input.
So we need to build a user proxy
and what should this proxy do? This
proxy should be able to answer all the
clarifying questions that the internet
is asking.
Now in the case of data analysis, you
cannot really answer these clarifying
questions by just looking at the output
42 right what does it tell you how to
drive 42 from the input.
So our idea was to generate supporting
code to evidence the output from the
input data set.
And this code actually makes it very
transparent
what the data cleaning choices were
made, what the analysis choices were
made so that you can reliably translate
the data set, the input data set into
the right. And using this code, the
proxy is now able to answer the
questions that the agent asks.
Another interesting challenge here is
that it's a proxy should simulate who
Excel has a huge diverse user base with
different expertise and business. A
college might prefer a bar chart or a
pie chart may not know much about
formulas. On the other hand, a business
analyst might want beautiful dashboards
with full automation.
So the idea was to use persona specific
docking recipes
so that the user proxy can act like a
specific person.
And the ultimate test of a good user
proxy should be that [snorts] it should
be able to amplify the performance gap
between different agents
and be predictive of their performance
in the game.
Now let me talk about a very different
way to handle every
not by bouncing
fine questions back to the user
or by generating an incorrect answer as
you see in this case.
So here again does not know what the
selected score is
but still it goes and writes score one
thinking that it is a selected score for
the right.
So what we can instead do instead of
getting an incorrect form like this or
asking questions of the user we can just
return back to a partial response like
the one that we see on the wall.
where I can use placeholders in this
case a question
really denoting where the ambity is and
the drawing input.
So now how can you train a model to be
able to generate such case?
So here is the ser realization.
The data that you used to typically
quantity models
actually had issues of the kind that you
see at the top.
It's noisy. It's not entirely correct.
It's not very high quality. And that's
the challenge in real life. This is what
often lead the models to learn in.
But in our case, this is a blessing
because you can use it. You can use it
to actually train the model to identify
such activities
and default them
instead of having. So how do we do this?
Don't change the data. Explain data. All
that you do is to change the loss
function.
This is the
so the green part is the standard cross
entropy loss. It punishes wrong tokens.
The blue part reduces the punishment if
you decide to do a question mark,
but you don't want the model to cheat
and put question marks everywhere. So
you need a red regularizer.
And the gray part is simply controlling
how often should the model use a
placeholder.
So initially we start out with strict
training less placeholders and then we
relax it.
So how effectively does this work? So
let's look at the confusion
for the cases where we were predicting
the correct formula earlier.
We still predict them correctly 97.6% of
the time. It's quite good.
But look at the bottom row for the cases
where the model was wrong was making the
wrong prediction earlier.
Now with this new loss function
we are able to 90.8% 8% of the time
quite significant to really achieve what
we set out.
Now let's take this idea of placeholders
one step further
and I will do this in the context of a
very different application
that of synthesizing creative content.
So this is the version of the image that
I use on my title slide.
How did I generate it? I provided my
title and abstract to a
generated
correct image model
and the image that you see is the output
of that.
But the story starts here.
What does next
is to give us meaningful adaptations
of underlying prompt.
It puts placeholders in the prompt,
gives them meaningful names to those
parameters
and also gives me alternative choices
for those parameters.
So in this case it included the three
placeholders in the block and these are
the character names we provided.
So let's move to the central theme.
So what you see in the picture is a
branching decision tree with growing
neural links.
But the model provides me other options
for this center as well. Flow chart with
arrows and icons. abstract network of
years. They are chipped with growing
pathways and let's see what they look
like. Quite cool.
I probably would not have been able to
pull out these trials.
Let's look at another parameter color
scheme and
actually I tried to change the color of
this title slide when I got this out
first
but I didn't good option. But look at
the options that the model is that the
is
the most surprising was the parameter at
the top. It's just a boolean parameter.
All I can do is to flip from a no to
yes.
But see what what happens on
now you can see the shadow of
spreadsheets and code in the picture
which is quite meaningful in fact my
aspect does talk about applications to
spreadsheets and code and this is
exactly what the model is
okay now I'm moving on to the last small
part of my talk which is the third
pillar that of inspection
and the idea that What I really want you
to take out of this part is
that if you do this inspection which is
the process of checking the output or
reasoning of a model in structured ways
using code in a systematic way then you
can get much more robust. So I'll give
you a couple of examples.
So let's consider the problem of
extracting tables
from PDF documents and different kinds
of reports.
When you do copy paste or when you use
OCR it flatters the structure
and you get lot of mistakes as a result
and people spend huge amount of time
trying to fix these mistakes.
If you use an position today, many of
the mistakes still persist.
In fact, even worse values,
you know, one of the values quite bad.
So here's the new symbolic solution we
design which works more effectively.
So we start with the table generation
for an LLM and we get an initial table.
Then we run some sanity checks on this
symbolic checks that search for
different kinds of syntactic and
semantic issues
and we pass that input over to a
calendar which generates a nice feedback
and we repeat the process until we get
something more satisfying. So let's see
how that works with an example.
So we copy paste copy pasteed data from
and we put it into the prompt of
and what we get is a table that you see
on the top left the yellow and it has
issued rows that have been ind
cells and so on. Now we run a symbolic
check and it identifies the issues that
you see in the blue box
and these issues are fed into
which generates some real actionable
feedback for the original to retry and
then the retry leads to the nice green
table that you see.
So what do these tests do?
two different actors that they look at.
They see whether the all the entity all
the data in a given column matches some
entity or some underlying regular
pattern. Are there any special
characters that are used in consistent?
It also sees whether what you have in
that table can every content traced back
to the source input to the source text.
If not it means some
and the other way around also how much
of the input text is actually preserved
in the
in one of our benchmarks we observe
almost a 15 more 15% lift when we do
this kind of feedback in structure ways
using code as opposed to just asking
another to look at the output I provided
and that's the the point that I'm trying
to that whenever you can do things in a
systematic way using code. You should
prefer that it will lead to more robust.
Now the final example in my talk is
about uh different kind of feedback than
you can do on a model output
in this case a model and for a different
purpose.
So in this case what we did was we took
lots of different reasoning strategies
and we passed these strategies into
different constituent elements a plan an
action that the model is taking an
observation an effect
and then we try to do some indirect
analysis over there
that what the model tried to do what I'm
to do is did it really was it really
successful in doing that or not and if
not we do some root cause analysis to
figure out what might have been the
reason for failure or success.
And here's the interesting part. We can
actually draw insights when you do
listening of these places along many
dimensions. These insights can be
abstract or more concrete. They can
focus on smaller segments of the
trajectory or the entire end trajectory
and they can be about positive
observations or negative observations of
traje
and then you can experiment with
importing different combinations of
these insights
for whom your crops
so that the behavior of the model
becomes more robust.
And on one data set we observe that if
you encode negative end to end insights
you get better precision
as opposed to encoding all end to end
insights including positive because the
positive behavior apparently turned out
not was not important to be reinforced
but it was important to point out to the
model what was and the decision went up
from 75.
It's a very ongoing work, very famous
work, but I still want to share.
So I conclude now.
So the first pillar that I talked about
for improving AI was to leverage intent
which can be much richer than just
natural language.
Especially when you have input output
examples.
Not that you are not just it's not just
possible to use the examples to validate
the output of the but you can actually
use that to force correct generation in
the first place as I showed you earlier.
Another interesting place where a lot of
us intent is hidden is temporal context
the action that the has been doing in
the recent past. In fact, this idea that
I show you for intellations is being
picked up by many different products
including
cursor you know Excel copilot that is
all looking at what is happening in the
recent past to be able to make better
sations for
the second dimension that I talked about
was interaction. I showed a different
way of interacting with the user, asking
questions through a sketch, suggesting
different choices for a parameter in the
sketch. And the interesting part here
was to leverage some communication
principles of social science and also
how we evaluate this kind of
interaction.
But one subtle thing that I did not go
into more detail is interaction and
suggest of what
we often think of productivity use cases
which is probably what you might have
seen mostly here where the user know how
to do the task but AI just makes them
much more effective
but sometimes the case is learning
the goal is for the user to learn about
new things and to deepen their
understanding and we saw GL of that in
the gig area for Visual Studio where the
developers felt that it was also a
learning experience for them not just
productivity
and then another fundamental use case is
creativity
where neither the user nor the model
knows the final destination you're
exploring together
and the third pillar was of inspection
you can statically model output the
listening traes they produce
to not only do it refinement for that
particular task instance but also to do
product updates.
But the idea that I really wanted to
leave here was to do this in a
systematic manner using code possible
and the fourth dimension
which probably saw on the last slide
which can make all of this much better.
So let me conclude with another shot.
In the movie
50 first dates
the character grew very doesn't have
shortterm memory. Every morning she
wakes up and forgets what happened
again.
hero sand
has to win our heart all over again.
Now this is sweet and funny in a
romantic comedy but not so much in a
human interaction.
So without memory every interaction is
like a first date. You have to teach it
the same preferences
how to cor the same kind of mistake over
and over again.
But memory changes the game
with the possibility to turn every
interaction into evolving relationship.
The very active area of research going
on there huge opportunities here how to
represent memory how to update it and to
do that in a safe manner with the having
full control of them.
So these are the four pillars that I
wanted to inspire you with intent,
interaction, inspection and memory. And
together they will convert AI power tool
that reacts
to a collaborator that evolves.
Thank you.
So before we thank you very much that
was very inspiring. Um I have a quick
question in terms of how the agent and
human interaction can evolve based on
the work that you described today. From
what I observed today guiding the agent
the right path with human guidance leads
to better results. However, the way I
look at it is just that search space
evaluation can deep how far are we from
the agents that we have the agents can
do that search space themselves is LM
the right speed for that or is that a
model that we can look at
I'm not able to hear you clearly but let
me try to repeat your question in part
and then you can probably make your mic
a little more up it might help
>> is that better
>> yeah this is much better so uh let me at
this interaction between human and AI
how can we use it to
ensure that the interaction gets better
in the future that the guidance was
provided by the human to say now let's
go do this and if for example you you
build a better prompt went to
neuroscience figure out a better
approach and then you integrate that
into the human agent interaction I'm
trying to figure out how far the agents
are do that figure out that guidance by
themselves and is lm the right seat for
So how would LLM are from being able to
learn from these past interactions are
they the right tool? So I think it's a
very interesting question. So today when
we think about let's say fine an LLM to
a given particular domain right that's
one of the tools that we often use that
if LLM is not working very well for the
task that you care about for the domain
that you care about then we talk about
Python we focus our entire energy on
collecting lots of data sets and doing
evaluation so that we can teach NLM to
do the sophisticated task in the domain
well enough in one shot And that's the
opportunity.
However powerful these elements become,
if they're able to do tasks in one shot,
the aspirations of humans will keep
rising. So we always want to think of
environments, collaborative environments
where jointly they can accomplish much
more. And this needs to be reflected as
part of it. This is what we are not
doing today. So what today what we are
doing is that if not working very well
for your task at hand for your domain at
hand you try to collect lot of data and
find the real but we need to make this
interaction as part of finding which is
something that I have not yet seen
happening as much this will probably
start to happen once we actually get
that kind of data right today the alms
are based on all the data that exists in
the world and when people are trying to
use it then interactions are starting to
evolve once we collect more of this kind
of data then we will have the right data
to finune the alms on those interactions
and then they will have this inherent
capability. Until then we will have to
use external mechanisms like memory to
be able to do that. And the second thing
is personalization is an aspect which
cannot get into another. It has to be
handled outside
at the level of each specific user or
the level of a specific team because
different people might have different
preferences for exactly the same kind of
task.
Hello. Uh, thank you so much for the
talk. That was really interesting. I had
a question. So, we have a similar
problem to some of the examples you
showed where we are extracting text from
tables. the other way around and uh we
are fine tuning the model a lot of
information but we still sometimes get
the values extracted incorrectly or
inate values and I was wondering a lot
of the talks talked about in this
conference talked about critical agents
as well as using reinforcement planning
as well if you've come across any of the
strategies that could help this so
repeat your question
So in the task of converting text into
table or table into text you know
transforming data from one format to
another format there are many techniques
being developed in order to improve the
accuracy of these kinds of tasks.
uh what is some value that I can add on
top of that right so many I can probably
I use actually the pill of interaction
to talk about how I can do more advances
in this but as I said that these
quarters will continue to improve in
their power but for them to be able to
do more and more sophisticated tasks
that exceed their capability you have to
fall back onto interaction so what can
be some good interaction mechanisms in
this task of uh let's say data
conversion from one format to another
one that We simply started exploring
was to look at multiple models, multiple
systems that might be incomparable power
and run them both
uh on the same input task and then see
which parts of the output do they
provide a different answer on
use that to light up a heat map for the
user so that they can pay their
attention to those parts where there is
uh less confidence or more discreet. I
think building systems like these would
actually be very helpful. Another thing
that you can do is that let's say if you
want to generate a table and that table
looks completely wrong because
everything got shifted by one then maybe
you can start to build some edit tools
maybe these edit tools can be done by
the LM itself that can fix the kinds of
errors that actually happen. So also
focus on building good interaction
experiences. A for the user to help spot
the discrepancy again working with the
alien and b how you can quickly fix
those. So those are two pillars that I
will offer in the ve of interaction to
make systems.
>> Thank you. Uh, let's see about
https://ai-reasoning.github.io/ AI models are increasingly capable of solving sophisticated tasks that require reasoning. But how do we improve the quality of that reasoning, especially when the models operate as black boxes? In this talk, Sumit Gulwani shares practical strategies for improving AI reasoning in the domain of code and structured tasks. A first idea is to capture richer forms of user intent. Input-output examples not only enable post-hoc validation, but also guide the model toward correct generations up front. Temporal context (such as recent user actions) can help infer evolving intent and keep users in flow. Secondly, we can give the model an escape mechanism, allowing it to abstain or initiate collaborative interaction when it lacks sufficient information. This raises new challenges in evaluating interactive workflows, which we address through rubric-based assessments of conversation quality (grounded in principles like the Gricean maxims) and automation using simulated user proxies. Finally, we can strengthen reasoning via automated inspection. Symbolic checkers or programmatic validators can uncover hallucinations and inconsistencies in both online and offline settings. These signals can then guide the model through iterative refinement or prompt updates. Sumit illustrates these ideas through real-world applications spanning spreadsheet tasks and software development, highlighting how AI reasoning can be improved using structured intent, collaborative interaction, and systematic inspection.