Loading video player...
Okay. So that was a pretty easy typing I
guess.
>> Yes. So don't for like the core idea
behind gra is just you organizing
for travel. So obvious to the next
question.
Obviously,
this one is still quite simple. So it's
like graph rag is basically target.
So that was the main
goal of
So Microsoft Square and check your
previous slides but basically the
people's trying to build a community
service and some previous techniques
That's very high. The others they
Oh, who's the red?
>> That's me.
>> Okay, somebody online.
>> Great.
Which subset?
or like structurization is the one that
formalizes the so decomposition is more
for like studying the task and then
extension is more like following the
edges of the and then the verbalization
is more about how the text itself is so
the correct answer is the truth.
So you mentioned verbalization. That's
not
>> Oh.
So in light rack so basically the
retrieval in right proceeds in two
different levels. So there's this one
lowle retrieval and one high level
retrieval. So to give some more concrete
samples. So to give some more concrete
example since like for low level trio is
more targeted such as there are more
entity or relation specific such as like
what is the relationship
that's for like those high levels so
there are more abstract oratic theories
for instance like how does AI inputs the
education in the current so this one is
broader and then that was aggregating
multiple edges but along. So then the
results like the the correct answer here
would be to drop sample set to use two
generated keys. One more uh targeted and
one more
So the uh algorithm by definition is a
finding or detection algorithm. So the
goal of community detection is trying to
maximize modularity and measure of how
well the network is partitioned in the
so that's the correct answer.
Oh the red is fair.
The last question.
Yeah. for organizer and generator. So
the row of organizer is maybe for
packaging but as the row for generator
is more also
here because like somehow like couple
doesn't allow a lot of text box. So here
discriminative and graph they refer to
the three ways for using for producing.
So why is based use to produce the
answer? One is discrimination based such
as non genetic models like GC or G
transformers that can some sort of say
classifier or then for graph based
approaches. So they will just generate
directly from the graph
without
so uh those three like
correlate with the three.
So this summarizes our quiz. So a round
of applause to the
>> go
at the last minute and Chris. Okay.
Congratulations. Thanks Jing for
narrating and creating our first day.
And uh with that I think our first
presenter is online.
So we can
see better
Yeah.
>> Yes. I'm just curious week 13, right? I
mean, there'll be a food after that,
right? So, there's no week 14. So,
>> there's only 14. Yeah. So, what we'll do
is we'll run cahoot at the very end of
that.
>> Yeah. So uh those of us uh whoever is
writing in creating uh the cahoot from
actually needs to work with the
lecturing staff to prepare the cahoot at
the same time. So um yeah I will try to
message it on the week
that we're clearing at that.
>> Okay we'll stop our screen share. Uh I
think our online participant saw
>> Yeah.
>> thing all the way. So the screen share
when you're ready.
>> Yeah. So before screen sharing like toh
say hi from Barrett also. So Barrett is
hi. We we're here at the conference but
we are both busy with our group duties
here. Yeah. So yeah. Great. Okay.
for working very hard while doing work
at NLP which is a top ranked conference
and
so yeah
>> okay uh can you guys hear me um and see
the screen uh everything good?
>> Yes. Um, okay. Yeah. So, I want to
apologize in advance if that I have to
uh have some 30 seconds interruption
because I see the securityities they're
trying to push us away. So, I'm I'm
trying to hang air uh hang here. Okay.
Right. Uh okay. So, um I'd like to first
give a a very quick intro of this
session. So, um then I'll leave the
floor to my colleagues real quick. I'll
go through the introduction and then we
have this agenda of um an uh uh going to
talk about uh provenence niku talk about
credibility
about plurism on sexuality and everyone
on aity and finally Z will close the
session and also give a lot of credit to
because she was the one helping creating
this agenda today. So yeah we have an
amazing agenda today. So u so my um my
path is actually very easy and and it's
uh um very intuitive
and so the idea is that um so when when
we are tracing the source of the we're
tracing the source of the news we're
trying to guess the credibility whether
we we can trust the source of the the
things we are referring to. Is it really
trustable? like you know it's like we we
we put in a small dough into a bigger
dough then a bigger dough into a larger
dough um yeah can we trust them so here
I like to share the quotes from my uh
internship advisor Dr. very short. So
she wrote this one like a simple
examples for how arrows can create into
our papers through LM use. I had
something in 2020 at Google the same
stats for 2023 AI overview talk about um
you know listing is from 2023 because it
only provides a link to support the
claim. So it's sort of like um the
credibility can be like refreshed by
just uh iterating by by just talking
about by just iterating year by year and
we still believe that the the thing um
is quite recent but in fact it's not
that recent. So as the first figure
tells here right we just put a small box
into a larger box but the core the cord
of thing hasn't got up updated and the
statistic for 2020 should also be
probably updated. So this is the general
idea about the temporal awareness about
source credibility. So this one is uh
very interesting and I also like to
share another thing about like the the
false uh the false negative uh things in
in the source and and the things that we
site that ideas that there could be
something we we should site and we
should have but unfortunately due to
some reasons and the citation doesn't
come as expected and there's a false
negative So here I like to quote a
conversation between me and my advisor
proof me. We we were at a a a launch
meet up with a a a staff a lady from the
elsephere publisher and they told us
that it's very hard to track to track
citations to do a really good literature
review of a field because whenever we
read the RW section of a work they site
a few papers. However, there are always
some missing citations. So like here
here are two robots that say I don't
want to site your paper and the other
robot said I want to set your paper
either. So it's like you know um the the
two robots they don't like each other
and they don't site their papers but
they actually they worth being cited. So
however when when someone's paper gots
prominence later so people will believe
us their related work section is pretty
well and they they will probably just
you know ignore the the proper citation
that's really needs to be properly
cited. So here's a false negative
example. I think the the example one is
the false positive because we we think
it's recent and example two is a false
negative signal that things should be
there but uh it's uh hasn't been there
due to some reasons that's among human
not really about machine. Okay. So uh
thanks to our uh lecture coordinator Jin
today I I get to read this uh amazing
survey and I would recommend everybody
in this room to read to read this survey
because it basically like a a a rude
umbrella for today's uh reading group.
So and the authors have brilliantly um
classifies the trustworthiness in rack
system into these six types uh
robustness and privacy in the generation
uh sorry in the retrieval stage and
fairness and transparency in the
generation stage and and factuality and
accountability in the checking stage. So
yeah, we we have this triangle here and
and different components being evaluated
in different stages of a a raich system.
So, and the way they they they did the
evaluation or I mean the the the
outstanding part of this survey is like
they're not only a survey but they also
uh have experiments and evaluation and
they do such evaluation by prompting
language models under prompts like this.
For example, when they're evaluating the
aspect of transparency, they have a
prompt of question and reference and
they say, "Please carefully think about
something and you know um to reflect
some intermediate answer step for for
the provided reference and I I I haven't
gone through the survey in detail. I but
I think they have some ground truth
answering step intermediate steps that
they can measure by by the overlap among
them. So yeah, here's their evaluation
setup and you're welcome to check in
this archive preprint about the full
details about this paper and finally is
their evaluation result. So we can see
that here's a a radar uh figure for the
six aspects that they believe is
important for the credibility of the
source and rack system. So yeah, we see
that uh models older models like llama
213b
uh which is still very uh relevant in
2024 but not very relevant today. We see
that it's not performing very well. We
see this um this um green small very
green shape here because it's not
performing very well in all aspects. So
it's area is very small. So which is the
best one? I would say on the yellow one
uh here oh sorry not very well so we can
see the the yellow one is GBT 3.5 it is
very well on transparency fair fairness
robustness factuality and accountability
but the aspect of privacy is uh dragging
down this overall uh area I think
according to the definition privacy
means that we we shouldn't um use um
private data from from any source to to
be used as the rack. So it so it means
that GBT 3.5 Turbo is not doing very
good job in this aspect. So which one is
the best? Uh I would say probably the
the the green blue one
uh the Sapphire one is the best. GBD4 is
the best when the authors uh compiled a
survey in 2024.
Yeah. Okay. That's me for my
presentation. Are there any questions
for this survey or the intro?
Any questions for you, S?
Okay. Uh maybe we go ahead for the next
segment uh given the time constraints
that we have.
>> Okay. Thank you.
>> So, uh who's on the next? Yeah. Okay.
>> Yeah.
>> Okay.
Share screen
and manage to use the pen. You're
welcome to use it.
>> Uh you need to share entire screen.
Maybe it's better.
Hello everyone. Uh I'm N and uh today
I'll be talking about uh provenence and
authenticity. Uh so basically this is
quite important because um we need
verification if the source is real or
not. And in today's world where almost
everything all all content we have this
doubt is it real is it fake and it's
it's very hard to tell with even AI
generated content. I believe this is a
very important area to to for us to know
about and the title says beyond text
because traditional uh text based
verification methods are now outdated.
they just don't work for existing
multimodal content and uh we need
something that's fundamentally different
and robust.
Um so the trust crisis right the first
issue is or the first challenge is
basically synthetic uh media is now
ubiquitous and it it's because of the
you know very fine line between real and
fake. It's increasingly convincing for
us to think that what is fake could be
real and we should we need to find ways
to tackle this issue. Uh the second main
uh challenge is the traditional
verification methods have failed. Uh the
methods used so far were to look for
image or video content was compression
artifacts uh checking metadata and
analyzing lighting u inconsistencies.
But modern generative models easily
surpass all of these tests for which we
need to be uh careful because right now
we're sort of flying blind on these
areas.
Um the third is users cannot distinguish
between manipulative and authentic you
know evidence. So even when I use
strategy so I work in research I work in
quantum computing research and it's
because it's such a niche field when I
there are a lot of things that I don't
know on a daily basis and when I ask
strategy to explain from scratch like
explain to me like on five of course it
does have those u citations that it
provides but later when I click on those
links I realize the explanation is not
very fitting to the the things that
they've provided which makes it very
easy to for me also to believe that you
know I not every one of us click on all
links right but when you click on them
you realize oh there's a huge
inconsistency with what links and what
the links are saying um so the stakes
are quite high uh with misinformation
spreading faster than corrections uh
according to statistics misinformation
is six times faster than a correction.
So if if there's already some
information out there that's not not
real, it takes six times the effort for
a correction to reach it to reach the
same audience, but you'll you're never
really closing the gap. Uh the second is
the erosion of trust in digital
evidence, be it journalism or courtroom
cases. Uh and then there are some
critical decisions that
um that that rely on unverifiable
content because in future if we're using
AI into like judicial systems or medical
systems, we do not want false data to be
the evidence for some um decision that
the AI system or the human is going to
make.
Um so what's the solution? The solution
is basically content credentials which
means being able to capture and edit
chains with possibly cryptographic
signatures. There might be other ways to
do it but I'm just highlighting one
because on the interest of time um so
what are cryptographic signatures? They
think of it like um like a blockchain
for content. So every time something
happens to a piece of media uh from the
moment it is captured to every edit that
is made along the existence of that
particular media the action gets logged
and cryptographically signed. So there's
a record of every change that's been
made. Um what exactly does this capture?
The device or the tool that's been used
to create the content. Um the exact time
stamp of creation. uh every operation
for example cropping, filtering, AI
enhancements, everything. So every edit
operation is tracked and most
importantly whether AI was involved in
the creation of content. Uh why does
this work? Uh because it's tamperproof
or tamper evident. If someone tries to
modify the content without updating the
the credential, then the signature
breaks and you immediately know that
something's wrong. There's there's like
an attacker. uh it provides chain of
custody proof which means you know can
correlate it to like um evidence in
criminal investigation you can trace the
entire history of that content and
crucially this isn't just theoretical
it's actually being followed and the
standard is called C2PA which is the
coalition for content provenence and
author authenticity uh and this has been
used in real time by big names like
Adobe Microsoft Canonets etc
um so The also what's what's interesting
is these credentials are both human
readable as well as readable which means
they can be used for transparency in
automated system.
Um so I'll be talking about attribution
evaluation u because we need technical
infrastructure right so even with uh
perfect provenence metadata there is
still a fundamental problem which is
evaluating whether claims are actually
supported by evidence and and this is
sort of hard uh a recent paper from u
called attribute bench which is
published in ACL 20.4 before uh
basically test this uh aspects
systematically. So they took uh
state-of-the-art models including GPT3.5
and fine-tuned specifically uh for this
particular task that they were doing
which is to classify they did a binary
classification of whether a certain
information is attributable or
nonattributable
and the result was only 80% um was there
was only an 80% score of macro F1 which
means one in five evaluations is wrong
and we do not on that kind of error bar.
Um so they did something really
interesting which is uh they analyzed
about 300 samples which are 300 error
case samples and to understand why the
model fails in the case that it did. Uh
so 66% of the errors came from what they
call fine grain information
insensitivity.
um which means which which contail uh
basically three uh three areas which is
missing nuanced details and uh not being
able to infer or summarize where it came
from and overlook like subtle
contradictions. Um what we also need to
note is these are very easily these are
errors that are very easily made by
humans as well. And the other 27% of the
errors was from information access
mismatch which basically is um so human
annotators can see the full web page or
the full document whereas for a model it
sees only extract snippets. So the
information given to a human versus the
model is itself different which also
makes the judgments different from the
the judgments made by the human and the
model.
Um this is basically how uh attribution
bench works. Um as you can see this was
fed into GP 3.5 and um the claim was the
population of Thailand is about 63
million and um however the reference
discusses certain connections with like
religion and stuff but it um it so what
um GPT 3.5 g passed the judgment to me
was this this information is
nonattributable but the ground truth is
it is attributable And the same in
another case as well. Um
so how is this different right? Like we
always use F1 score but what did these
people uh you do differently? Like the
authors of this paper uh they introduced
something known as macro F1 which is
basically taking unweighted average of
the F1 score of the precision and the
recall. So uh they calculated the f1
score for attributable classes and the
f1 score for non-attributable classes
and took the unweighted average of that
which is different from the usual f1
score that we talk about. uh why it
matters uh is because um
so we need a balanced data set and uh
it's because when you're training a
model we need to know that it's not uh
giving preference of one one kind of
data or one um one area or basically if
it's a binary classification should not
be giving attributable more weight than
nonattributable. So having this balanced
evaluation is very important. Um and um
it also achieved um
okay yeah so so it detects when the
claims are supported by evidence and
when it is not. So when they are
supported by evidence it labels them as
attributable and nonattri attributable
at the tax. uh some of the key takeaways
from this is B were so the benchmarking
was very important the the way they did
the whole thing and the way they studied
the 300 samples and the insights that
they got from that was very important.
So just as province needs verifiable
chains, evaluating attribution needs
very careful setup as well and design
choices also matter because if um you
don't have like a balanced class then
you're going to see a lot of u an
anomalies and what they did was they
basically took seven data sets four for
the ID and uh three for the OD. So
basically ID is the in u in distribute
indistinguishable or something like that
and out of uh out of I think I I don't
know what word but like
>> out of distribution.
>> Yeah, distribution. Yes.
>> In distribution which means it's trained
on those four data sets and out of
distribution was basically trained on
three data sets that is not given at all
in the testing. So in in the training so
they tested on a different data set all
together. So it basically does not just
take look at semantics or linguistics
but actually looks at if there are
patterns it's able to catch and also for
generalization purposes they chose to
not have the training data training data
set and test data set be
um this is um so there's something known
as the AIS framework which which brings
us um to okay so this is the paper from
Google uh and was published in
computational linguistics. It was a
brief um renown journal and it asks a
simple question which is can generated
text be verified against the sources
that it provides and it's a two-step
process. So step one is the
interpretability which means is the text
understandable on its own u as in are
there any grammatical errors it does
sound weird when you just read it
normally uh those are the things that so
very basic um
facts that it looks at like the how
ambiguous are the references and such
things so um the second uh step is
attribution test so basically can I say
that can I quote something can I say
that according to the source and make a
If I can't then the source is not
accurate.
Um the key innovation here is uh
something known as explicators uh which
is a formal model for meaning and
context. So basically it just handles
tricky cases where context is important.
Um, and a simple example of what an
explication would be is if I'm asking
when, if I'm asking a model, when was
the iPhone 17 released? And then the
system says it was released on 19th
September 2025. Basically, the it that
it's referring to, it's able to
contextualize for my question itself.
That's that's an example of what
expectation means. It already has an
idea of what it's talking about. Um, and
this has been done. So it's validated
across tasks like conversational QA
summarization table test text etc.
Um missing and forged credentials. So um
in a real world credentials are often
missing or forged and in in terms of
missing credentials there are four
things. So assume untrusted by default
which means no chain means no
verification. So you don't have
something tacked to it means cannot
verify. Um the second is flagging it
explicitly in the UI so that users know
what you're looking at and how
believable how believable it is or how
true it is. Uh the third is lower the
trust scores in your ranking and
retrieval system. Uh lower I mean it
should not be shown at a higher level
when people look for something that it
codes for. Um and then um the fourth is
it's important to have clear warnings
which means transparencies.
Um coming to the forged or manipulative
credential cases. This is the the
adversial case where attackers will try
and fake credentials. So um how do you
defend right? Uh so cryptographic
verification basically check the
signatures. If if simple math doesn't
work then it's rejected. that means it's
not it's not attributable and
certificate revocation like SSL
certificates you need a way to be able
to block uh compromised signing keys. Um
next is anomaly detection. So you need
to flag suspicious editing patterns. For
example, if you have some sort of media
and then if the media is edited 47 times
in three settings then you know that's
humanly not possible. So you flag that
as well. Uh then trust anchors. So
basically verify against known
authorities, legitimate device
manufacturers etc. And um the the best
practice is basically uh create
credentials as evidence not truth and
always combine with others.
Um so attribution is not equal to
authenticity. Uh we should try and get
this as clear as we can because it's a
foundation of everything else that
follows your own. Um so attribution
confirms that what the model says aligns
with the source and it is a linguistic
check basically. So does this claim
match the document that is that it is uh
quoting or you know basically the AIS
that we discussed previously and
authenticity what it does is it confirms
that the source itself and its entire
chain is trustworthy. So this is the
technical check. The attribution is the
linguistic check and
the gap here is basically missing a
forged provenence breaks the attribution
link. So you might have a perfect
linguistic alignment with a source but
if the source itself is fabricated then
the attribution is meaningless.
Um so which is why we need a unified uh
framework where we have attribution
provenence and credibility together and
uh the overlap is what the trusted AI
evidence is that's where you want your
AI
uh so naturally we we come across a
question of how do we build this right
so bridging the bridge to rag so
credibility signals uh first is metadata
enrichment so you add provenence to the
embeddings track credential status and
second retrieval reeranking I think
we'll talk about in this class as well.
So boost verify content penalize missing
credentials uh just simple math. Um
third is generation grounding. You wait
by provenence and trust the metadata in
context because that is already very the
fourth is citation and transparency. So
just be as transparent as possible. Your
user should not uh you should user
should not be confused.
Um
yeah uh this is the provenence aware
rack architecture. So it starts with the
ingestion and then goes on to the
verification and scoring and on the
indexing side just indexing retrieval
and the generation. So like it's quite I
think it just explains so the key
signals are credential present signature
valid uh edit um chain intact source
reputation and uh need to check for
missing data and edits.
Um the key takeaway is prominence is the
substrate for trust. So content
credentials provide for as I mentioned
attribution evaluation remains hard. So
even so models struggle with nuance
verification. They can make mistakes.
And third is treat missing or forged um
content like explicitly like no
credentials done with weight or give
them a pen and make trust trans. Um the
next session is by nil which would talk
about credibility aware.
>> Okay let's uh
>> hi.
So uh let's just spend a minute or two
thinking about what N has presented. So
um we have any questions from the
audience. There were two discussion
catalysts that are posted in Slack. So
you have Slack in front of you. You can
look at that. Uh it and myself have
already answered some of the questions
there.
So I mean credibility has been a problem
in search engines from day one. Uh and
one of the other problems that we see is
that we want to make certain sources
trustable and findable and accessible
right. So for example when we do rag you
might want to trace back to a open
source like Wikipedia or you know some
type of uh repository that gives
authenticity as uh Diane also pointed
out right but these resources they are
also encumbered you know operating a
website and being able to make it freely
accessible by all is a cost okay so for
example archive where we've read a lot
of papers from uh services billions of
requests, right? Uh millions every day.
It's not cheap to send all those
electrons all around. So that's why if
you go to Wikipedia archive, they're
always asking for money. And uh with rag
actually we're compounding that problem,
right? because every time we're asking
uh GPT or uh AI mode or whatever to come
up with summary, they have to go access
all those websites, right? They have to
pull that information. They summarize
it. They present it to you and what do
we ask you to do? Oh, go back to those
same websites and do it all over again.
Of course, that's necessary. But you can
see where this is compounding problems,
right? because we are we we need to
check on the providence of this work and
unlike a a search engine which is not
accessing those websites you know
they've just done it once they've
indexed it and they're not pulling it at
runtime right but when you make a query
to a rack system it is indeed most often
either pulling a cached very recent
version or actually live going to that
website in fact if you talk to many of
the providers of wiki wiki data okay
which I went to a uh a meeting behind uh
last month or two months ago about that
they're very concerned uh rag traffic is
uh a huge amount of cost the electricity
involved in serving rag users is 10
times the normal request okay and uh you
can see if everyone's going to use
chatbt their cost and balloons Right.
And you can see if we can't service and
be able to make these resources apparent
uh factchecking and credibility aware
systems be even more of a problem
because you know you can't get to
archive can't get to Wikipedia to check
and how you know that it's right. Right.
Okay. Um any comments from anyone?
>> Yes. very interested to heard about that
function that use blockchain for this.
Um is that the SAP credits or is it just
soft?
uh uh can also address that blockchain
uh the solution set
>> I mean it's just a cryptographic
protocol I just uh sort of try to give
an analogy with the blockchain because
to think about it like that
>> again but to set up
>> it doesn't exist it's just
>> that's my question
>> yeah yeah
>> thank you
so yeah blockchain is definitely a very
useful technology
it requires requires at least 50% of the
uh distribution network to be
distributed, right? So, that's also a
difficult issue because you need enough
you need enough players for a blockchain
to be robust, right?
>> Um and and getting that support uh from
worldwide is not that easy either.
>> Yeah. I mean, it's to provide a a a good
for the colonies, right? It's sort of
like water and air. I mean if you have
to pay for those resources it's not so
easy to get right and if one person
doesn't do it uh or uses it poorly it
suffers for everyone
>> and there's also the environment issue.
>> Can you elaborate more on that?
>> Yeah because uh using blockchain or such
technologies just increases the amount
of energy that you use even for like
tokenizing one one aspect of the thing.
So when you do it like billions of times
then the amount of energy is
exponentially high. So again it comes
back to already LLM is using a lot of
energy and then do you want to inject
something else that's going to increase
that
>> right
this is a very good discussion because
you can see the technology that we we
are creating creates more use of energy
right and so unless we have uh a pairing
on the other side to make more energy
efficient use of our resources we're
always ramping up our computational
requirements
you know uh those us uh uh 10 years ago,
you couldn't imagine the amount of
energy being consumed to do certain
basic things that we take as for granted
now right?
>> Okay. Nil, are you online?
>> Yeah.
>> Okay. Uh our screen share and I'll let
you take over.
>> Yeah, I'll share my screen. Just a
minute.
Sorry.
And you can share and when you're ready
you can start.
>> Is my screen visible now?
>> Yes.
>> Okay.
>> Presentation mode and it's all yours.
>> Yeah. Okay.
So, okay. So, hello everyone. So now I'm
going to present the section three of
the presentation. So like from section
two we can understand that we can now
verify the document authenticity and
track content origins. However, there's
a big challenge there. The challenge is
that not all retrieved documents are
equal. By equal I mean that not all
documents can be accurate. This means
that the RG breaks when the context is
flawed. By flawed it means that it
retrieves irrelevant, outdated or mis
misleading text. The generators may use
low credit credibility snippets if they
look on topic. The simple filters that
we put can drop useful evidence like how
it works that mixed quality documents go
into the system into the retriever.
The how an RJ pipeline works. The mixed
document goes there. Then it searches
for the relevant chunks. then it orders
by the evidence quality and then it
generates the grounded answer. So with
the lowquality documents there's a very
big impact that the performance of the
LLM's drop by 20 to 30%. Because of this
misinformation and as we can see that
one low docu low quality document among
many good ones can mislead the RA. So
the so the thing is that we have to
treat the sources and spans by
credibility and not only by relevance.
So the credibility problem in RA. So
what is the meaning of high credibility
medium credibility? So the thing is that
for a high credibility document can be a
recent news report on the event or a
peer-reviewed high impact literature. A
medium document can be a medium
credibility document can be a like a
general reference page from an
encyclopedia which is like more of
generic but still it it can be it can be
still credible. However, if we put a
outdated report for example a report
from 201819 which which is like 6 year 6
7 years old this this can be a low
credibility that's something was
different that time now has changed and
obviously the is AI generated
misinformation is always a low
credibility so the big problem is that
standard RA treats all four equally all
these type of four documents equally
leading to confused and wrong answer so
what what's the good solution of this
The best solution is credibility aware
RA. So weight the uh like weight the
documents by credibility which will help
to get the correct answer and explicitly
tag each document with credibility
level. Okay. So the credibility aware RA
it it basically means to place trust
inside the pipeline. So a standard works
on this formula works on this function
that llm function which has a query
query and documents and a credibility
away will take query
and we'll also consider credibility and
document alongside the query and the
east each document will be tagged with
credibility. So the credibility
dimensions what are dimensions of
credibility the relevance does it answer
the query the timelessness is it current
and the source obviously whether it's a
high impact peer-reviewed literature or
just a just a low impact literature or
anything like that so the key insight
don't hope the model figures it out we
have to tell it explicitly that which is
high which is low so the there are three
prediums like there are three different
paths to the same goal there are three
different types of methods one is rag
where it acts it acts like for retrieval
and aggregation
and and there's no training required for
it. Second is CAG which is in the
generation training phase and however we
need to train it and the cramp the
generation which is which happens in the
interface phase and it also doesn't
require training. So just giving a very
brief like a real word explanation of
all three before I go to technical I
would like to explain what is RA. So you
can think RA like as a bouncer in a
outsider outsider like a club or
something who's checking sources at the
door and weighs the trustworthy once
more that they follow all the
eligibility criteria that they above 21
they have a legal age and everything. So
you can assume like that the CAG is like
a student like teaching a student to
recognize which is a reliable source and
which is an unreliable source. So as we
are teaching it so we have to train it
and the CRM can be considered as a
glasses like we can adjust the glasses
to blur out the distractions and focus
on what's the what's clear.
So all the three integrate credibility
in different ways to achieve trustworthy
generation and they deliver double all
deliver double-edged grains over
standard RG in noisy conditions. So
explaining each of them technically. So
reliability aware retrieval augumented
generation. So technically speaking it
estimates per source reliability from
cross source agreement and uses it to
filter reank and aggregate evidence via
weighted voting. So if I break break
down its steps. So the step one would be
to estimate source reliability offline
one. So cross check 200 fast checking
queries across sources which sources
agree with consensus and there's no
manual facteing needed
and at the query time retrieve per
source. Get the top K documents from
each source separately. It can be top
five top 10 depending upon the data and
what we are doing with it. and then
select K reliable and relevant sources.
Only consult the top K sources that have
relevant info. If K is equals to four
like if you take the top four of the
thousand sources so it means there's a
99% token reduction more than that
actually. So
and then aggregate with weighted
majority. So the answer would be arc max
multiply by sum of reliability into
vote. So just giving a simple example as
was given in this this hank at all paper
of 2024. So fast checking query was what
causes the COVID 19. So the sources were
source a C the CNN model says the SARS
CO2 virus source B the conspiracy block
which said 5G networks source C my
clinic which says SARS COV2 virus and
Wikipedia also said CO2 virus. So
reliability estimation. So sources
agreeing weighted to high reliability
like we can see that three of the
sources CNN, Mayo Clinic and Wikipedia
said that we would go with the SAS cove
2 virus.
So hence we can see that as it was like
this was the highest one. So reliability
at query time the current so it will go
with this this ones
like the Sask 19. So at query time query
current and employment rate weighted
voltage. So this is just was another
example in the paper but this this is
the main example that why it would give
su because majority of people have said
that. So now going with the second one
the credibility aware generation.
So technically speaking it means to
fine-tune the large range of models on
data where documents are labeled with
credibility plus model to generate the
reasoning.
So it has a two-step data data criterion
like creation sorry. So the
multi-granularity credibility
annotation. So it means to use retriever
to score relevance classify into high
medium and low credibility and applied
at document and sentence level. And
second is credibility guided explanation
generation to prompt GBT 3.5 to generate
explanations that analyze credibility as
it was done in this paper this pan at
all paper and the example was the high
credibility routers article states X low
credibility block states I considering
credibility answer is X. So the answer
should come like this. So just giving
more about the example the C training
data example. So input to model who won
the first no nobel prize in physics a
high credibility as we taught it would
be that William roten won the first
Nobel prize in physics in 1901. So Nobel
which came from this Nobel prize
official website which we have given the
input and the low credibility document
can be the Albert Einstein won the first
Nobel prize in physics which is like
some some person who has blogged just
put a blog. So what would be the output
from the train model? The high
credibility document from the Nobel
Prize official website clearly states
that William Roten won in 1901. This is
authorative as it's the official record.
The low credibility document claims
Einstein but this contradicts the
official source. Einstein did win a
Nobel Prize but not the first one. So
the answer would be like this. So the CJ
model would give you an output in this
format as we can see.
Okay. So the third and the last one the
credibility aware attention
modification. So it adjusts the
attention at interface by downweighing
the tokens from low credibility spans
using calibrated hard scaling. As we can
see the architecture in this in this
figure which is given by dang at all. So
it's again a two-phase process.
Phase one to identify the gullible
heads. Use the 100 calibration example
with misinformation. Extended casual
tracing which which heads are most
affected. Select the top 100 to 300
influential head. Phase two modify
attention at the interface. For the low
credibility documents, scale down
attention. For the high credibility
documents, keep attention normal. Just
like what I said to keep make the
glasses blur for the distractions.
So don't so like don't be attentive to
the low credibility documents. It is
trained like that. only modify the
influential heads identified in phase
one. So and and hence we are not
importing putting the details to it. We
are not teaching it. So there's no
training required. It's a plug-and-play
method. So how crime works attention
modification example. So query who won
the Nobel Prize in physics documents
credibility a point8 credibility
document says Williamen won 1901. A
point.1 credibility says like it's here
Albert Einston won in the Nobel Prize.
However, the standard Rating William 30%
tension each and Albert Einstein get 20%
tension each. So, CRA's model attention
would be based on this formula and then
it does the renormalization
and from the reormalization we can see
that it adds the documents and model
focus 92% attention on the credible
document.
This is how it works.
Okay. So, which method to choose? How we
will come to know that? So first of all
we come to know that there's a need of
credibility aware reg because we we are
not sure if the document is credible or
not. So first thing we need first we
need to check that if do we have
identifiable sources identifiable
sources like the CNN BBC Twitter
accounts. If yes then the best thing to
use is RA because it has the best
explanability and it scales efficiently.
If we don't have the identifiable
sources then we have to check do we have
training sources like do we have example
can we train the model if yes then use
the c because it's one of the best
overall as we'll come we'll come to know
later and it generalizes pretty well
however if we don't have a training
sources then we have to go for this cra
model the good thing is that it will
give you immediate outcomes and so no
training required we as we discussed
before so it doesn't require any
training also So however it not be it
won't be the most accurate. We can see
the paper. So the performance comparison
from the all three method methods from
the papers that I put before the RA RA
multisource excellence we can see the
for the N natural questions it it had
the accuracy of 73.7%
versus the 63.4% of the standard RA. The
trivia questions question answer it had
the accuracy of like 91.3%.
And compared to standard which only had
81.2%. The real world val validation
correlation on political fasteing it had
a correlation of this 99 on fast
checking and they had given a full
example on that the scalability
it has a 99% token reduction with K R
SS. So the however and if I talk about
the second one we can see it for the hot
hot pot question CAG for the hot pawn QA
it had like a 50.9% accuracy which was
incre which was increment of like 82%
accuracy for the base one base model the
RG model the second one for time
sensitive QA it has like more than 91%
accuracy like 91% more accuracy than the
standard baseline RG model for the
misinformation it rather increased by
like 147. Still it was 44.2% but the
incre increase was pretty good and the
noise robustness
it had 89% accuracy at 80% noise versus
charge which only had like 77.3%.
So the third one cra which didn't have
any training as we it was not really
much trained and didn't have much
explanability and was very good. So its
accuracy was also really not very good
for national questions compared to what
we have for RA. So it was like 33.6%.
And
and for the previous QA it is like 59.9%
questions it beats the CAG
it still beats CAG by like by margin.
However for the AD adversarial it had
accuracy of 91.3%.
For 1 to 9 to 72.2%. 2%. Which was
equivalent to the Oracle performance
and so the sum summary comparison
finally so already required we have
discussed all these things before that R
doesn't require training it needs source
id it performance is very high we can
see that its explanability is again very
high the setup speed is medium the so C
it's it's it requires training it
doesn't require source ID its
performance is again high explanability
is high but the setup speed because we
are training it then testing it then it
speed reduces like a lot and the cra
model which doesn't require training
doesn't require source ID doesn't have a
very good performance it's a moderate to
high performance has a low explanability
but it's very quick so the limitation
and open challenges so all methods
depend on the credibility assessment
accuracy this this is a limitation that
garbage in and garbage out if
credibility labels are wrong what if the
credibility levels are wrong the all
methods will fail
So each method again has some specific
limitation. For example, this R the
source reliability can drift over time.
Periodic reestimation is needed. CG
performance depends on label quant
quality and explanation coverage. The
crab the attention head selection is
task specific and recalibration is
needed for new domains because we are
not really training it. So every time we
have to change it. So the biggest
fundamental problem how do we accurately
assess credibility in first place?
Misinformation evolves cons constantly.
Sources change quality over time.
Context dependent credibility.
But
but even with perfect credibility, what
if all credible sources share the same
blind spots? For that we'll come to know
in the next section in the plurism and
bias source selection. So thank you.
>> Okay. Um any comments?
So one thing I I'll just be very quick
because I know we we have a lot of
sections but uh uh the time dynamic uh
we haven't talked about a lot but it
definitely interacts with credibility
right things that are factual we know
are times persistent and somewhat static
uh we can do more credibility aware
verification so this goes back to our
again the very overview of the lecture
the hypo presented we two say that uh
when we know things are temporally
sensitive maybe we cannot assume that
credibility is that strong a
requirement.
Okay. Uh so Zahang is already online. So
uh Zahang when you're ready you can
share uh part four. After part four
we'll take a short break. I know it's uh
after seven already
when you're ready.
Yeah. Can you hear me?
>> Yes, we can hear you fine. Thank you.
>> Yeah. And see the screen?
>> Yes.
>> Okay. Uh just apology in advance because
I have uh got a cough today. So maybe
there are some interruption during the
presentation.
Um so in this section uh we will talk
about plurism and bios aware source
selection
and here is the the overview of this
section. Uh we will focus on three part
and first why credible rankings can
still lack diversity or fairness. Then
what kind of ranking behavior we
actually want and after that how this
can be imp implemented as a simple
plug-in module in the pip in the
pipeline and end up with some takeaways
and let's start with a very simple
iteration for why credible is not the
same as diverse or fair on the left side
picture I asked GB5 a factual question
in which year was Michael Jordan
It does the right thing in terms of
credibility. Besides Brit Britannical uh
wiki and maybe some major news, all of
these uh high credibility sources, which
means we have filter out lowquality
stuff.
But on the right side picture, we can
see the wiki disambiguation page for
Michael Jordan. There isn't just one
credible Michael Jordan. There is a
basketball legend but also a footballer,
actor, researcher and so on. All of them
backed by factual sources. So even if we
are inside the credible pool exposure
can be still be skilled. Almost all the
rankings and all the citations goes to
the basketball Michael Jordan and to the
tiny set of the dominant sides
and on this slide uh we will see their
differences in mathematical form. Uh
first given the assumption that for
every document D uh we already have uh
credibility or relevant score uh you
call UD and classical ranking then just
says pick a permutation pi then
maximizing the utility where VK is
position bias or user attention at rank
K. If we sort documents by UD we
maximizing we maximize the function UI.
So this give us a credible ranking but
it say nothing about diversity or
fairness.
And so far there have been many studies
on search result diversity and you can
click the link below to see this review.
And here we use uh intra list average
distance as an example. It only looks uh
it only looks at the distance uh pure
wise distance between items in the list
to measure diversity. You can see the
formula
and the most tradition and widely
adopted way for defining the pair wise
distance is use the coign similarity.
But neither of these two formulas
involves uh cred cred credibility score
UD.
Then fair exposure is also separate. We
model the ranking as a probability
matrix P and exposure of a document is
the sum over position K from one to K of
P DK multiply we by VK. So if we get the
ranking by maximizing up alone, it may
still produce very large unfair gaps in
exposure between groups.
And now let's talk about what kind of
behavior we actually want from the
ranking.
Uh first think about uh the plural
reason at the list level. Uh here we
reuse iOS
signal. It basically
uh the average distance between all
pairs of items in the list. So higher ID
means items are less similar to each
other. But pluralism is not only about
pair wise distance. It it is also about
coverage over viewpoints or subtopics.
Imagine that there is a global set of
stances or subtopics called S and each
document covers a subset SD. Then we
define the coverage of list L as how
many different stances from S are
covered by the document in L dividing by
the total number of stances.
Then a higher value simply means more
stances or sources are represented.
And finally at the bottom combine
objective we uh when we pick top K list
L we don't only maximizing uh we don't
only maximize the sum of UD we also add
these two terms uh with weights lambda
one and lambda 2 which will trade off
purely credible versus uh more puristic
rankings.
So after addressing the uh the issue of
diversity and let's switch to the uh
fair exposure which is about who
actually get seen in the ranking.
uh first incorporate probability into UP
instead of uh instead of the uh fixed uh
fixed uh fixed uh fixed ranking
up can be written in the matrix form uh
we transpose times P * U
and we also require P to be doubly
stochastic every row and every column
sums to one so it really behaves like a
soft version of a ranking
Then these forms allow us to link link
it to the exposure of the signal
document of a single document
here.
And then for a group G, we just average
over its members to get the exposure G
and U.
With those quantities, we can write down
fairness constraints. One example in the
paper is called demographic parity make
the exposure of group one equal to the
exposure of group two and another is a
desperate treatment which says the ratio
of exposures between groups should match
the ratio of their utilities. In other
word exposure should be proportional to
how useful or credible the group is.
So in summary, this framework let us
control how exposure is distributed
across groups while keeping the uh same
scores.
And next let's move on to the how and
show how this graph based adaptive
reanking works as a pluging.
We start for a very standard two-stage
two-stage retrieval pipeline. Stage one
is a fast retriever that give us a large
candidate set I and stage two is a
slower ranker that scores a smaller set
and produces the final ranking.
And in our cases, we can treat
credibility filtering as stage one
point.
So by the time we enter this box all
documents are already in the credible
pool
and in the box the extra the extra
structure is a corpus graph
uh CG built offline. In graph, each node
is a document and for each document, we
store its K most similar neighborhood
neighbors.
And online, we run the JR loops with a
frontier F.
The step one is to take a batch of
documents from the initial set I score
them with the reanking uh with the
reanker S and append them to the final
ranking R. And step two is uh for every
score document we look at its neighbors
in the corpus graph and push them into
the frontier and with some initial
prescore.
And step three uh take a batch from the
frontier with them and with the same
ranker s then appendant to the r and
again push their neighbors back to the
in front uh back into the frontier.
These loops will repeat until the
scoring budget is hit.
Then this line uh will show how the
policies actually steer the graph and
the frontier.
Uh first
uh every document is tagged with some
group label uh like uh source, region,
stance.
Then the first set uh are pre- retrieval
policies which control how we build the
neighbor set which which is a Z.
Uh policy one is diverse group linking
keeps only neighbors from other groups.
So each nodes neighbors already points
across groups and policy two is balance
neighbors kota. So each group can appear
equally often in the Z.
The second set are inprocess policies
which control uh how we pick documents
from the frontier or the initial set.
The policy three is frontier filter.
Uh when we add neighbors into the
frontier uh we jump save group
neighbors. So the frontier also remains
cross group.
Then the policy 4 tries to keep equal
group proportions uh whenever we build
the candidate pool or a batch.
Then policy five is a groupwise top
scoring.
So every group gets a fair share of the
batch
and the policy six is uh within each
group preferred neighbors discover
earlier in the loop then by the score
because those are more likely to be
relevant.
And finally uh when these policies
especially five and six we can choose
their uh per group top dogs uh we can
control pluralism by adding a diversity
score like the delta diversity. So at
the selection time we are respecting
group balance keep credibility and
pushing the list towards being more
diverse at the same time.
So, here are some quick takeaways and
that's all for me today. Thank you.
>> So, okay, we'll take a three minute
break uh and then we'll go on to our
next segment. So, uh if you need the
bathroom or anything, come back at uh
7:20.
Okay.
So J Chang is uh are you online?
J I think is on the the next segment.
I didn't see
>> I wonder how can we give the PDF to the
>> uh you you can mail it to them. I think
there are instructions on the on the uh
information that uh presenters were gi
but uh I'll double check that.
>> Yeah,
>> it looks like if you were to go there
and tell them all questions.
>> Yeah, they they will ask you to go to I
think support and c
double check this.
Oh, is it the the assist?
>> I think you can actually go here and
then ask them for others and put the
subject line then you attach your file.
But I'm going to double check that's how
they want it.
Um so I think uh if you got the
instructions from our steps coordinators
uh they should have attached some PDF
file giving you instructions for how to
do the poster printing but I'll just
double check
because
>> okay yeah no you you don't have to be
SOC staff you just need to be registered
as a poster presenter in steps yeah
there
So J, maybe you're ready.
>> Yeah, you can come up and present. I
will go find out the answer to J's
questions about posters.
But all the way on the right.
All the way on the right.
Right. Hello everyone. I'm J and today I
will talk about the fact and also the
attribution variant. So I will first
begin with the problem of the back field
citation po. So what is a fact field
citation? It is a a method of
attribution which is to citing some
sources but these citations are are
added after the generation process is
complete.
Maybe you can imagine these situations
when London models they're answering
some answering some questions about
a famous a famous person and he first
directly generate all the biography of
these persons and then it just began to
search for the internet and and randomly
add some relevant or unrelevant
citations to all of the answers he as of
all the answers he has generated. So th
this is different from the attribution
during the generation which is also also
a mop attribution that the L models just
first generate every sentence and then
get the attribution and verify it and
this is so the difference between these
two citations is that um the first one
is is is requires very low resources and
also it is very fast but the second one
is very about cost costly.
Um so maybe some large models don't want
to cost much and they just uh first uh
generate answer based on their internal
knowledge and then added some relevant
or or irrelevant uh citations. So the
drawbacks of the back build senses is
that first the models did not use the
sources which may causes some errors in
in the factuality and also the second is
a misributions of the truth. Maybe there
are some cases when although the logos
generates the right answers or the facts
but but the their citations is
irrelevant or the statistics not verify
what they answered. So so think of this
backfield citations and we we will try
to figure out it and also we will try to
avoid it this problems. So in this
situations the the factuality is very
important here and because if the factor
accuracy can't be satisfied in the L
models generation but do we really need
to uh care about the sitations or the
attributions actually I think we don't
need to care about it because because
there is some factual problems in the in
the in the answers so only so if the
citations is relevant of the citations
can verify the the answers. It it
doesn't matter or it doesn't make any
sense. So the next question is if the
models can answers something that is
that does not have many factor problems.
But this times the attribution
measurements is important here because
we need to verify whether the generating
statement is fully supported by the
cited reference because we maybe we need
to use this citations for some further
studies. So I will first talk about the
factorialities of of the long form text
generation. uh I I have to mention that
um this speciality u doesn't have
relations to the uh attribution variable
it just and verify whether the L models
generated generating generations has
factory problems or doesn't have a
doesn't have a factory problems. So
there are many famous or classic
classific um classic methods on it and
the first I will talk about the factor
factor score. The factory score is a new
evaluations
of the factories of the long form text
generation and look at this p look at
this picture. It asked the model to tell
a bio of famous people and the CHP
generates a long form text which
contains many facts of these people and
we can see that some of these facts are
right and also some of of these facts
are wrong according to the Wikipedia
but the former evaluations of the
factory only gives zero points to these
these two answers because it analyze the
answers in in a general or in or in a
form. So so the ideas of the fact this
paper is that it breaks the generation
of into a series of atomic facts and
also comput the percentage of the atomic
facts supported by a reliable knowledge
source. In this paper it is the
Wikipedia
and the good news is that it can
distinguish from different answers. For
example, in the first situations it has
it has more factor
factor scores because it has it has more
correct atom it contains more correct
atomic effects.
So this is the definition of the fact
factor which is very very easy to
understand. Now the problem is that how
to get the atomic effects from very very
long text and the answer in the paper is
that they just use the bang models to
generate the atomic effects which are
basically on every sentence. For
example, there are many sentence in this
generation and they divide it and also
ask a module to generate whether these
are the atom atomic effects. Usually a
sentence can be divided into many many
atomic packs and these atomic packs add
together and and and they they are
evaluated one by one
and and also they found that using using
human evaluation to to to verify these
facts these atomic facts is very cost
costive and time consuming.
So they they propose a method to do the
auto automatic score score generation
and it contains many parts and and also
many methods. The first they are they
are four method used here. First is a
nonconcept logical model which directly
judge whether atomic factor is right or
wrong and second is a retrieval and also
use a large models to generate. It
retrieves the top the top answers the
relevant passages from the Wikip and and
ask model to generate whether it is
right or wrong. And the third is the non
parametric probability. And the beauty
of this you you can see is in the
reference paper in the at the bottom. Uh
it can be seen as a pure uh retrieval
method which it mask the every token uh
in the in the generation and also it
uses retrieval to retrieve some relevant
answers and then it a results of every
mask token to to to
judge whether the generation is right or
wrong. And the and the third is inseam
method which uses the first uses the
retrieval and retrieve method and also
combine with nonparametric probability
to give a final answer and the the
results are there which is not very
important I think and also the paper
find that um about more than 30 of the
supported and also the unsupported
sentences have citations respectively.
So which means which means that the it
seems that the citations most of them
have no impact on the actuality. So
which which refers to the what we call
the back field citation problems because
no matter whether you you have citations
they are the the correct answers is
about is is near is the same. Uh so so
next paper is also about the factory and
of the long form text generation. it's
called the uh very score and and the key
idea here is that he find that in the
previous study like I have mentioned the
which we the previous the fact score so
the we find that not all claims are
varable uh we can see that from the from
the picture and from the left side of
picture and the claim two and also the
claim six these these answers are are
dependent on context and also some of
these claims they don't have a they are
not varable
for example
the the claim six uh because they have
no information on the VTP entity and
also the these things are not what we
call the fact and the the this paper
find that all clims are verifiable
and and and they then they also find
that the facts may depend on the
context. So there are many ambiguous
reference using the P for example in
this picture and in this sentence it's
large and starchy sweet tasting tubless
roots and this this this sentence have
it but uh but when using the pre
previous method and it only uh contains
the but but they don't know what the
refer to
like like the method text method in the
fact the fax for so this uh automic
factor generation is is not very good
because it dismiss the context in the
generation. So, so in in this method
they also add some prompting to to ask
the alarm to generate the atomic effects
by um considering the context beyond
this this sentence so that the it can be
can be referred to its real entity which
is which can generate more more better
atomic effects and also in the sport
calculations
The researchers found that if large
generates only a few facts or they they
even uh they even don't generate any
effects and they will have a very high
precision score. So they added a record
score to the to it and also which will
penalize the situations and the record
is a is very easy. The K here means the
average effects that a lot generates the
average network and also the SR means
the fact that they are they are answered
correctly.
Uh so so next I will talk about the
atomic attribution measurements. Uh so
in the in the in the in the in second
part of our lecture today actually they
have mentioned a benchmark called the
attribution bench which is is relevant
to what we talk here. It also the topic
is that manufacturing manu verifying
whether the generative statement is
fully supported by the sites cited
reference is costly and consuming. So
they are using the large level models to
to verify whether the generated
statement body and different from the
previous paper they they propose a new
uh classific classific classific frame
which contains two which contains three
labels and the first is attri attri
attributable which means that the
reference fully supports the generated
answer and the second is extra
repository which means that the
reference lacks sufficient information
to validate the gener generated answer
and the third is the contradictory which
means that the generated answer
contictates the information presented in
the reference. Uh let's see examples and
ask there are three categories in them.
The first is would be one which which is
called the attribut attributable and
which means that the the models answer
these questions and also they give a
reference to the uh to the answer and
when we open the reference which is the
website and we found that these f
are the things and we can attribute our
answers to it and the secondly is have
some problems because it doesn't answer
the reference or the citations doesn't
answer the answer the problems. It just
offer something that is relevant to the
larger length of the answer. And also
the third one is what we call the
contradiction which means that there are
some contradicts between the the
generated answer and and also it's
citations and this paper just promps the
models uh to to to with this this papers
give uh two method to solve these
problems and the first is that the the
pro models with a a clear evaluation
instruction to judge whether the
scientist really answers or really
really answers the the question. And
also and and the third and that they
fine tune some large models on set of
diverse
repos data sets
and um and they test the their methods
on the atomic attribution measurements
and found that there are some
mispecified examples which are very
interesting here. Uh and first the most
common problem is that uh when I asked
everyone to judge whether this citation
really verified the this answer and and
the citation to be very long so that it
contain many nuance information or the
finding information and the the logos
can't detected so that they they got a
misclassification
and the Second problem that's um the the
the the loging model misunderstanding
the task definitions and also the
logical relations implied by the labels.
And the third uh problem is that uh the
logical filling on the symbolic
operators.
For example, there are some mass
operations or mass calculations and all
and the this is then a lot of the
counter recognizes so that they judge it
wrong
and these are complementation to the
previous what we call the triion bench
bench paper and I as I shared here
okay that's all for my part and and do
you have any questions
Okay, let's check our speaker.
>> Okay, let's go directly to Eric then.
Yeah thanks.
>> I did want to say one small thing about
what J Chong presented. You'll notice
that uh in uh a lot of the research that
we do at N US there's a center for uh
trusted internet and community and uh
giant who is our coordinator for lecture
today is working very heavily on that
and uh some of what Chen presented was
about attributable
uh datatory and um got the last one um
>> contra contradictory. These are very
similar to what we do in uh factchecking
research. So in facteing um you know you
want to be able to say that a source
verifies a a a fact right um whether
there's enough information to conclude
it or not. So in certain cases we have
that third category
extrapolatory which we call in other
data sets not enough information or NEI.
Okay. And then there's the refs category
which says that there's information that
contradicts the evidence. So you may see
two sets of different terminology that
Chang presented but if you also read
factchecking papers those three
categories are also valuable.
Okay take it away. Thank you so much.
>> Okay so I'll be talking about answer
calibration and as extension here. Uh so
in this paper it talks about how
traditionally
um black blackbox methods are used or
white box methods are used to uh
determine the confidence of models. So
typically you look inside the model
internals inside the the probability of
index token actually being generated.
However it states that there are several
problems with this one being that many
models are actually closed source. So
you can't go back in terms of in terms
of the modeling and also uh high
probability of the next token being
generated does not mean like actual
uh confidence. So for example, if a
sentence says chocolate milk comes from
brown cows, uh the next word might make
sense to come after the other. However,
it's not uh it's not semantically
accurate. So this paper looks at uh
black box methods. So it involves
several stages. Uh involves prompting
the model to ask for confidence scores
as well as sampling it and using various
aggregation strategies to come up with a
final uh confidence score.
Yeah. So some of the prompting
strategies include just a vanilla
prompting strategy which is just ask me
what the confidence score is. Uh other
ones worth noting include uh self pro
which is inspired by the fact that like
if you ask a human to evaluate someone
else's answer it generally has uh
provides a more accurate evaluation of
the answer. So in this case involves
like uh asking a question in one chat
and then starting a new session and
asking uh the LM to NF to give a
confidence score for this answer. Uh
other prompting strategies include
multistep which is breaking down the
problem into K steps and then
aggregating the getting confidence at
each step and then aggregating the
confidence into one uh confidence score.
uh also includes topk which includes
which asks the model to provide multiple
uh guesses and it's it's at least it it
makes the model aware that there are
multiple different answers possible and
it uh induces this to produce the
confidence. So yeah,
so how the evaluation happened was they
use
uh
right uh so yeah so we'll talk about
aggregation strategies. So after
so one of the aggregation strategies
would be just consistency. So you have
an answer and then you ask it again and
again and then ask see how many of the
answers are actually the same as the
given answer. um also uses
average confidence which is uh similar
to consistency but it uses the actual
confidence generated for each answer.
Yeah. And then pairing uh basically
uh uses the ranking of the sample
answers and then creates a distribution
for each answer. Yeah.
So
yeah in terms of evaluation it takes uh
uses two main metrics. One is
calibration which evaluates how well the
confidence level aligns with the correct
level. So if a model says that it's 80%
confident of it answer then 80% of the
time it should produce the correct
answer and then it also uses failure
prediction which is that higher
confidence that's higher or more correct
answer should be given higher
confidence. So it just differentiates
between uh it's able to differentiate
between more
giving correct predictions and wrong
predictions. Yeah. So some of the
findings of the paper include that uh
tend to be overconfident and that the
different prompting strategies uh did
reduce overconfidence over the vanilla
uh prompting strategy. Um
yeah, then some other good some other
finds findings include uh that having
more samples does improve failure
prediction and that pair wise the pair
wise method that they use here uh had
the best calibration while average
confidence had the best failure
prediction probably because average
confidence takes into account the
weights of uh
take uses weighted confidence to
determine the confidence of the answer.
Yeah.
And then I'll talk about another paper
which talks about long form generation.
So in the previous uh paper that we
talked about many of the equations
actually use like y = to y
I right and that's like oh every answer
is either correct or wrong. So, but then
if you have a confidence course that
says that I'm 80% confident for long
form generation, it could either mean
that you don't know if it means that if
I I'm 80% confident that 100 answer is
100% correct or I'm 100% confident that
the answer is 80% correct. So, what you
want for long form generation is you
want to know if the model is x%
confident that the answer is uh y%
correct.
And then basically you have this
distribution and you want to check it
against the actual uh correctness
distribution. Yeah.
So how they generate how they elicit
confidence is uh somewhat similar. So
they have they use this method called
continuous self evaluation which
includes which involves just uh
evaluating an answer over and over again
and seeing how many times
uh yeah evaluating the same answer over
again and just generating a probability
distribution for it. And
another method they use is py
self-existency
which involves just
uh generating
which involves yeah generating answers
again and then comparing how similar it
is and basically producing a confidence
score based off of this. Yeah. So um
this are the just the various methods
they use to compare similarity which
which includes like just uh maybe
evaluating based on claims or named
entities. Yeah.
Yeah. And then uh they also talked about
they also use calibration similar to the
last paper and how it's different is
just that now instead of having uh the
probability that y goes to one uh you
have the probability that y equals to s.
Yeah. So that's that's all there is.
Yeah. That's all
right.
Well, we discussed in the parents part a
little bit about calibration. So, uh
does everyone understand what
calibration means here?
Can I invite somebody to give it a shot?
What does confidence calibration or or
things of this sort mean or error
calibration?
It should mean like the model should be
well aligned like it's verbal as
like if the model says that I'm 80%
confident will be the expect that it is
actually correct for this of sample
>> right okay so a a good domain that's
well calibrated is something like
weather forecasts right because uh when
we do weather forecasting we look over
like a 100 days where the weather was
exactly like this at this hour. We see
how accurate the weather forecast is two
hours later, right? And then based on
that percentage is then we can say that
80% of time it did rain or it didn't
rain, right? So those things are are are
based on uh forecast and then checking,
right? But we know humans and because
humans are not well calibrated that
means large language models because
they're trained on the basis of human
data right uh because uh when when they
they're asked to give percentages it's
basically a next token prediction task
right so um unless you have trained the
model and calibrated the model those
things would not necessarily be well cal
so those are are big problems I
Uh what what type of percentages do you
think humans are bad at u
being calibrated about?
>> Erin, maybe you can say a bit too.
>> Sorry.
>> Well, what percentages do you think
people are bad at uh determining? Like
do you think people really don't
understand? Let's say 100% correct, 50%
correct, 70% correct, 10% correct. What
What are some values that you think
might be off?
You know, I think 100% off might be off.
It's
Yeah like
maybe you just forget like the off
chance that something is like off. Yeah,
it's okay.
Right. So actually uh if you read a lot
of the uh cognitive sciences, people are
really bad at interpreting
probabilities, right? And um we're quite
irrational when we do that. Like uh
you'll see people avoid very very low uh
probability events just because they
think it might happen to them. You know,
you can think about uh being uh struck
by lightning or you know, crashing in
the airplane or things of that sort,
right? So low probability events just
like what Aaron said are very hard also
mid-robability events like exactly what
it means to be like 30% of the time
actually people have a hard time
calibrating that. So those types of
events language models by by proxy
because they're trained with large
language uh data are also going to be
pretty right. So this calibration step
that uh uh Erin was pointing out on the
last couple of steps of slides using the
map is actually very important to do and
we see that in many domains where uh you
actually have verifiable uh out outcomes
and you can roll them out you know for
an example in math reasoning and other
things and you can calibrate the u
certainty more properly in a basine.
Okay. Uh so I think our last segment is
from Chai. So uh Chai, over to you.
Thanks Aaron.
>> Okay, so this will just be a brief
discussion on like the system design
reporting standards from being design
systems with with credibility.
So there is no ground full of consensus
for this as I see. So I'm just speak my
own.
So like first let's talk about
credibility with regard to just the
retrieval process without actually touch
of the cell. So about the scope let's
talk about the retrieval index
presenting say how we fix the retrieve
and the problem and for the goal is like
we want the system to output answers
that are defensible and audit and
auditable by audience.
So here I'm trying to frame this from
five perspectives right what we are
trying to publish and maintain as
designers of the system. So it step it
starts with the step card like how do we
retrieve brand sites and then about the
index data sheet. So this is about the
data that we use in our retrieval
system. So what data do we index and how
do we contain such data or update such
data. The third part is evidence trace.
So when we see an answer as output from
the system do we know when evidence
actually supported the answer? And the
fourth part is the lit one
configuration. How do we make it
transparent and credible for the users
or say like stakeholders of the system
and final part is the kind of the KPI or
diagnostics of the system.
So like let's first talk about like some
sort of rack card. So this is more like
a model card for predility system. So
the purpose is to propose an audience
friendly document for the react layer.
So here I think one important thing to
do is to document the intended use cases
and then the output cases such as when
we do some health advice cases as
presented by the user then the model
should be more towards.
So for here like I see there is this
paper called model cost for model
reporting at the fact 2019 complex. So
they have this intended uh intended use
part here
when we report a model and I think it
should be also used when we report such
system.
So here for the credibility controls I'm
thinking about like how do we wait the
different parts of the viability. So
what are the caps of the diversity and
say like are the data reason is the data
reason how do we handle problems and
what are the threshold that models
sustain and then say there are also some
non failure most of the system such as
whether the can become after some time
or select balances
and finally the version and dependencies
of the embedding rank versions as part
of the red systems model
and then the second part is the index
data sheet. So this is more like the
data source from where like our rack
system is uh retrieving from. So here
like I'm thinking of say what we should
record or expose from this data set.
It's like what is the composition of the
data set such as the domains
institutions or say like what data do we
include what data do we ex and then how
do we maintain or say is how do we
maintain the data uh through say how do
we maintain the data through means of
curation such as cross cadence
retraction handling and say what window
do we set for data freshness and then
the third part is about like all the
license optimized robot policy you'll
see what is the policy of especially
this error of generated content the
fourth part is some sort of non so
biases so here I'm more thinking of like
not the social bias type of stuff but
more like the low resource languages
type of stuff for instance where the
purpose might be over reliant on English
text rather than text other languages
for cultural
and also the last updated time step. So
what is the knowledge cut off of our
data?
So then the third part is for the
evidence trace like can we make every
sentence of the answer traceable to
exact evidence and then make it
detectable when it's not like actually
linked to
so here I'm thinking of say for every
answer the system should be able to
provide some retrieve document ids r
source and where's the starting and
ending ids uh for instance like the
character of the or the bites in the
evidence so it's basically suspect from
the long reaching evidence text and then
like the second part is like I'm more
thinking of say some sort of remove the
answer check so this one is solely
motivated by the attribution benchmark
so in the attribution benchmark so they
are finding that in many cases they
exist a citation
but like the citations do not actually
car so that means like the citations are
more decorative than actually meaningful
for the answer. So I've been thinking of
say like
the remove the answer and thinking of
say uh what if we give the model the
evidence in the first pass then we
measure the answer quality and then we
remove each piece of evidence and then
we check the answer quality to show like
which part of the evidence is actually
say strongly used by the model and which
part is more on the decorative
perspective. So this one is just purely
my opinion like this some attribution
batch but I'm just basing attribution
batch here as a source
and then like there should also be some
steps for cited sources such as publish
that retrie
and then like there should visible
citations response to the except
and then this part is more about the
retrieval or ranking configuration. So
this is basically say uh how do we
document this tunable mix of relevance
credibility diversity and recency of of
this evidence for version and auditable
so for the invest
so I think it's just quite standard to
report all those model cars or versions
for reproility and transparency
and then like I'm thinking of say assume
we are designing a model that is like
actually well-rounded and credibility is
acting as part of it. Say like we have
four ways of
correlated to four different aspects
relevance, credibility, diversity and we
should also say like uh document how we
are tuning those ways in our release
system. The third part I think is quite
important is the profile. So what is the
preset of our system? Is it like the
most
condition on safety so that we should
set a high weight for credibility or say
is it conditional decency so we must
prioritize breaking means or say is it
more on pluralism and diversity that we
must prioritize diversity and set a
higher weight for that. And then there's
also like some few short talk K uh
techniques such as when we get list of
evidence or signals how do we combine
them as he said at the top of them
and then like finally there's this
credibility signal source so a very
brief note whether they say reliability
prior
for instance here we can say one example
uh the media sky spec website where they
actually have a bias score for each
block website. So whether it is
like what specific biases do their
reporting reflects. So I think those if
we are retreating from them should also
be reported as part of what our
credibility is condition.
>> Go back.
So here when you're talking about
profiles, are you talking about profiles
that a system would expose or profiles
that a user would expose when they're
using system? What what are profiles
attributed to?
>> Yeah, I'm more thinking of say from the
aspect of design because like
>> a designer of a system,
>> right? Like assume I am already
packaging this system and then I reuse
it. So it would be good to inform the
user like how are we setting different
ways for diverse aspects
but I think it can also be done from the
user's perspective. So it's condition on
whether the user has access to team
>> right. Yeah. You're saying like should
be on users intention whether it should
be
>> yeah I mean all these dimensions are
possible as a as a provider let's say
for example open AI they say I have
three personas you know safety breaking
knees or breath first you can select
them or you know you could condition on
the query right so you could say oh I
think this one is a static query just
like
what we've heard before about things
that might be temporally sensitive,
right? So, if they're temporally
sensitive, maybe you would choose the
breaking first profile that's more
likely to be the right profile for this
type of query.
I mean, uh it could be a combination
just like uh you can ask the user uh
what their preference is like. uh I've
given you breaking these first based on
your type of query but if you want me to
rerank or regenerate my response to make
sure that verified claims are first uh I
can do that for you right so many many
language models now prompt you okay
what's the followup action after
satisfying response right
>> yeah I think it could be like x
user and why I think
it resets
that should be user and if resetting the
chance of say the option should be
and then thinking of say some KPIs or
say like how do we monitor our
credibility
system so here I'm providing some
examples
For instance like like using fact score
like what percent of atomic claim
balance is factory supported by the
activated exped
source diversity. Say the degree to
which the top pays cover distinct source
groups and sectors or say in the third
part uh some uncertainty when we want
the model to explain uncertainty perhaps
we wanted more to capture quality such
as like how well the systems not enough
information or not enough credible
evidence actually match reality
summarized by so this could be
summarized by a numeric calibration
error such as your price
and then some sort of risk coverage
trade off the chin fashion. So then like
I'm also having some closing discussions
like I design those in case we had more
time but since we are in three minutes
over time like let's just keep those
questions in mind as a closing
discussion.
So this summarizes our lecture.
>> Okay, let's thank Ch for the
So uh maybe what we can do is take this
closing discussion and put it on general
channel and I invite all of you next
week if you are coming to our social
maybe to talk about some of these over a
snack or drink. Okay. Uh because they
are very important. Uh I think one of
the things that I uh think about a lot
is you know the loss of locus of control
for people using large language models
because every time we delegate to a
large language model that is uh
something we're not doing ourselves
right so u it's likely that people who
are overly reliant will become u
commoditized right they're they're not
useful anymore so what we want to do as
a society is bring people along with the
ride rather than just make large
language models more complex, right? We
really actually want large language
models to be getting better, but we
actually want our population to be
getting better, too. So, uh, a lot of it
may come to some of the discussion
questions that talked about, right? So,
um, is it possible to install instill
more critical evaluation of information
uh by our own users of LLMs, right? by
prompting uh an element to call
attention to I mean uh Twitter used to
be able to say you know in order for you
to like or reshare a post you must have
at least clicked on the source that
you're you're citing right you can't
share it without clicking on it um and
that turned out to be a terrible
intervention for Twitter because they
lost a lot of traffic so they removed it
but you can understand that that type of
intervention would actually help provide
more factuality and more checking and
would inspire more skepticism before
spreading potentially fake news. Right?
So, uh I'd like for you to think about
that as something as everyone in this
room is pretty much a designer of AI for
the next 15 20 years. Uh these are
problems that are really uh worth
solving you know and we are only
touching upon tiny bits in this class.
Okay. So, let's thank Jaying and all of
our presenters again.
Okay. And uh we'll see you next week. Uh
I just want to quickly uh note an answer
to Jane's question earlier. So, uh if
you go to our threads uh and you go to
projects,
uh you will see here uh information
about how to print a poster. So this is
in the projects channel. So um it tells
you actually uh when you can print the
post. So I I printed this out. This is
actually from the FAQ site within the
steps uh thing. So you go to 27 steps is
here and you see there's an FAQ link
here and then you can scroll down and
here it is information that I have
screenshotted for you in Slack. So you
have right there this is what you need
to do. You need to go to RT.cip.nus.edu
EU select uh your category, select
others and then say event related A1
coded close to printing for 27 steps and
then attach your PDF file to get it
submitted. Okay. So uh once you submit
the file it's gone be printed uh all
your typos and all. Okay. So then when
you uh uh have finished printing you
will get an email. So earlier I think JF
asked whether I need to print it uh pick
it up right away. No, you can just keep
it there. Um yeah during steps they have
like a hund posters there. So um it may
be a little bit hard to view. Uh but
they will do it and then up your poster
on the day of and then
>> so that means you may not even actually
have to physically go down. You can just
do this online in the com comfort of
your home air conditioner in your PJs uh
and then send the
>> So we just need to submit this by next
Tuesday on my can collect on Wednesday.
>> Yeah. So it says here 7th, 10th and
11th. So this is Monday and Tuesday I
think because this must be Friday.
>> Okay.
So um that's also here. So you can just
go through the slides for the
exhibitors. that to all of you and
another uh thing that tells you about
costs. Okay, so yeah, do do remember
high resolution because they are A1
size. They're bigger than your uh
monitors unless you have a really big
ass monitor. Um should be pretty. Okay,
so thanks everyone. Um so that's all for
today. Thank you.
(Lecture Starts at xx:xx) Slides for this session: http://soc-n.us/cs6101-t2510-w12 Video Playlist: https://soc-n.us/cs6101-t2510-playlist CS6101/DYC1401 Retrieval-Augmented Generation Week 12: 6 Nov Oct 2025 AY 25/26 Sem 1 (T2510) Summary by Zoom AI AI Credibility Survey Overview The meeting focused on discussing the credibility and trustworthiness of sources in AI-generated content. Miao presented an overview of a survey that evaluates the trustworthiness of AI retrieval and generation systems across six aspects: robustness, privacy, fairness, transparency, factuality, and accountability. The survey, which includes experimental evaluations using language models, found that GPT-4 performs well in most areas but lags in privacy. Content Verification in the Digital Age Niall presented on the challenges of verifying content authenticity in the digital age, highlighting how traditional verification methods are inadequate against modern AI-generated content. He introduced the concept of content credentials, which use cryptographic signatures to create a tamper-proof chain of custody for media content, and discussed a recent paper that evaluated attribution accuracy in content verification, finding that current models achieve only an 80% accuracy rate. AI Credibility and Energy Challenges Kan presented a framework for verifying and attributing generated text to its sources, introducing the concept of explicators for context handling and discussing methods for dealing with missing or forged credentials. The discussion highlighted the challenges of maintaining credibility in search engines and the high costs associated with RAG (Reinforcement of AI Generation) traffic, particularly for open-source platforms like Wikipedia. Credibility-Aware RAG Solutions Nikhil presented Section 3 of the presentation, focusing on the challenges and solutions related to document credibility in RAG (Retrieval Augmented Generation). He explained how low-quality documents can mislead the RAG pipeline, leading to decreased performance of LLMs. To address this issue, Nikhil introduced the concept of Credibility-Aware RAG, which weights documents based on their credibility rather than just relevance. He discussed three methods to integrate credibility: Reliability-Aware RAG (RAD), Credibility-Aware Generation (CAG), and Credibility-Attention Modification (CRAM). Nikhil compared the performance of these methods, highlighting their strengths and limitations. Credibility and Diversity in Rankings The meeting focused on discussing the relationship between credibility and diversity in ranking systems, with Fu presenting a framework for pluralism and source selection. The discussion covered how credible rankings can lack diversity, and introduced a new ranking model that combines credibility with diversity and fairness metrics. Fu presented a graph-based adaptive re-ranking method that uses a two-stage retrieval pipeline, incorporating policies to ensure balanced group representation and fair exposure across different sources. Factuality in AI-Generated Text Jason presented on factuality and attribution measurements in AI-generated text. He explained the difference between backfilled citations and real-time attribution, noting that backfilled citations are faster but may lead to errors and misrepresentations of truth. Jason discussed the importance of factuality in AI-generated content, particularly for technical topics, and introduced the concept of the fact score as a new evaluation method for measuring the accuracy of long-form text generation. Advancing LLM Evaluation Standards The meeting covered several topics related to the development and evaluation of large language models (LLMs). Kan presented on methods for automatic scoring of factual statements and the challenges of context-dependent claims. Jicheng discussed a benchmark for attributing statements to their sources, highlighting the need for transparent and auditable systems. Jared explored confidence calibration in LLMs, presenting various methods to improve model accuracy. Jai Ying proposed design reporting standards for credibility in retrieval systems, emphasizing the importance of transparency and traceability. The discussion concluded with thoughts on critical evaluation of information by LLM users and the need to balance AI development with human capability improvement. 00:00:00 Kahoot! from W11 00:07:31 00:17:20 Section 1: Introduction (Yisong) 00:42:32 Section 2: Provenance and Authenticity: Beyond Text (Nayanthara) 00:57:15 Section 3: 3: Credibility-Aware RAG Methods: Placing Trust Inside the System (Nikhil) 01:19:20 Break 01:24:05 Section 4: Pluralism & Bias-Aware Source Selection (Zihang) 01:45:25 Section 5: Factuality & Attribution Measurement (Jiecheng) 01:45:25 Section 6: Uncertainty, Calibration & Abstention in Credibility-Aware RAG () 01:45:25 Section 7: Discussion: System Design and Reporting Standards (Jiaying)