Loading video player...
How can you evaluate all of the texts that AI spits out?
Traditional metrics might not cut it for your task, and manual labeling
takes a really long time.
Enter LLM as a judge or LLMs judging other LLM outputs.
If you've ever manually tried labeling hundreds of outputs,
whether it be chatbot replies or summaries, you know that it's a lot of work.
Now imagine an AI that can scale, adapt
and explain its judgments.
In this video, we're going to look at how LLMs evaluate outputs.
The video's gonna be split into three parts: LLM-as-a-judge
strategies some
benefits of using LLM as a judge and some drawbacks.
When it comes to reference-free evaluation, there
are two main ways to leverage LLM as a judge.
First, we have direct assessment,
in which you design a rubric.
And we also have pairwise comparison,
in which you ask the model: which
option is better,
A or B?
Let's start with direct assessment.
Suppose you're evaluating a bunch of outputs, summaries
for coherence and clarity.
If you're using direct assessment,
this hinges on designing a rubric.
So you might design a rubric where you ask:
is this summary clear and coherent? With two different options.
Yes, the summary is clear. No,
the summary is not clear.
Each of your outputs will be evaluated
based on the rubric that you've designed.
Now let's talk about pairwise comparison.
In pairwise comparison, your
focus is on comparing two different outputs
instead of assigning a standalone label like in direct assessment.
So in the clarity case, or if your focus is on clarity,
you're asking the model: which of these outputs is better?
Option A or option B? In the case where there's multiple outputs, you
can then use a ranking algorithm
to create a ranking of the overall comparisons.
Which of these strategies is better
for the task you're trying to accomplish? Well,
our user research on the newly open-sourced framework
EvalAssist showed that about half of the participants
prefer direct assessment for their ability to be clear
and have control over their rubric.
About a quarter preferred pairwise comparison,
especially for subjective tasks.
And the remainder of the participants preferred a combined approach
using direct assessment for compliance,
and then leveraging the ranking algorithm
that comes with the pairwise comparison to select the best output.
Ultimately, the choice was both task- and user- dependent. Now,
for some reasons why you might want to use LLM as a judge.
First it scales.
If you're generating hundreds
or even thousands of outputs with a variety of models and prompts,
you probably don't want to evaluate them all by hand. LLM
as a judge can handle that volume
and give you feedback and evaluations
in a structured way in a quick manner.
Second, LLM as a judge is also really flexible.
Traditional modes of evaluation are really rigid.
Rigid.
So let's say you build a rubric,
and you start evaluating a bunch of your outputs.
As you see more data,
it is really normal for your criteria to start shifting,
and you might want to make changes to your rubric. LLM
as a judge helps with the criteria-refinement process.
You can refine your prompts
and be really flexible in your evaluations.
And lastly, there's nuance.
Traditional metrics like blue and rouge focus on word overlap,
which is nice if you have a reference.
But what if you don't have a reference?
What if you want to ask a question like, is my output natural?
Does it sound human? LLM
as a judge lets you do these evaluations
on more subjective outputs without a reference.
But of course, there are drawbacks to using LLM as a judge.
Just like humans, LLM have their blind spots
and these are represented through different types of biases.
For example, there's positional bias.
And this means that an LLM will continue to favor an output,
even if the content is not necessarily better. So,
let's say, in the pairwise comparison case,
you're asking the model: which is better option A or option B?
And it continuously favors option
A regardless of what is represented by option A.
This means that it is expressing positional bias.
There's also verbosity bias,
and this happens when an evaluator
continuously favors output that is longer
regardless of its output.
Again, the longer output can be repetitive
or go off track, but the model will continuously favor it
because it sees length as quality.
This is verbosity bias.
There's also the case where a model might favor an output
because it recognizes that it created the output.
This is called self-enhancement bias.
So let's say you have a bunch of different outputs from different models.
And a model continuously favors an output that it created itself,
and the content is not necessarily better.
This is self-enhancement bias.
And so these sort of biases can skew skew your results.
For example, a model can favor an output because it's longer
or because it's in a particular position.
But it's not necessarily better.
But good frameworks are built to sort of
catch these mistakes.
For example, you can run positional swaps
and see if the judgment changes.
For example, changing one thing from position A to position B
and seeing if the model's output selection for which, which
is the best output changes.
Bias in LLMs doesn't mean that the system is completely broken. It
just means that you need to stay vigilant.
So if you're tired of manually evaluating output, LLM as
a judge might be a good option for scalable, transparent and nuanced evaluation.
Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam → https://ibm.biz/Bde2Mi Learn more about LLM Evaluation here → https://ibm.biz/Bde2Mj How does AI evaluate its own outputs? 🤔 Zahra Ashktorab explains how LLM as a judge can scale and refine evaluations with strategies like direct assessment and pairwise comparison. Discover how to tackle biases like verbosity and positional bias for accurate, scalable frameworks. 🚀 AI news moves fast. Sign up for a monthly newsletter for AI updates from IBM → https://ibm.biz/Bde2MZ #ai #llm #machinelearning