12: Source Credibility Assessment – Retrieval Augmented Generation (CS6101 WING.NUS) | DailyDevLists

Loading video player...

Full Transcript

16,468 words • EN

Okay. So that was a pretty easy typing I

guess.

>> Yes. So don't for like the core idea

behind gra is just you organizing

for travel. So obvious to the next

question.

Obviously,

this one is still quite simple. So it's

like graph rag is basically target.

So that was the main

goal of

So Microsoft Square and check your

previous slides but basically the

people's trying to build a community

service and some previous techniques

That's very high. The others they

Oh, who's the red?

>> That's me.

>> Okay, somebody online.

>> Great.

Which subset?

or like structurization is the one that

formalizes the so decomposition is more

for like studying the task and then

extension is more like following the

edges of the and then the verbalization

is more about how the text itself is so

the correct answer is the truth.

So you mentioned verbalization. That's

not

>> Oh.

So in light rack so basically the

retrieval in right proceeds in two

different levels. So there's this one

lowle retrieval and one high level

retrieval. So to give some more concrete

samples. So to give some more concrete

example since like for low level trio is

more targeted such as there are more

entity or relation specific such as like

what is the relationship

that's for like those high levels so

there are more abstract oratic theories

for instance like how does AI inputs the

education in the current so this one is

broader and then that was aggregating

multiple edges but along. So then the

results like the the correct answer here

would be to drop sample set to use two

generated keys. One more uh targeted and

one more

So the uh algorithm by definition is a

finding or detection algorithm. So the

goal of community detection is trying to

maximize modularity and measure of how

well the network is partitioned in the

so that's the correct answer.

Oh the red is fair.

The last question.

Yeah. for organizer and generator. So

the row of organizer is maybe for

packaging but as the row for generator

is more also

here because like somehow like couple

doesn't allow a lot of text box. So here

discriminative and graph they refer to

the three ways for using for producing.

So why is based use to produce the

answer? One is discrimination based such

as non genetic models like GC or G

transformers that can some sort of say

classifier or then for graph based

approaches. So they will just generate

directly from the graph

without

so uh those three like

correlate with the three.

So this summarizes our quiz. So a round

of applause to the

>> go

at the last minute and Chris. Okay.

Congratulations. Thanks Jing for

narrating and creating our first day.

And uh with that I think our first

presenter is online.

So we can

see better

Yeah.

>> Yes. I'm just curious week 13, right? I

mean, there'll be a food after that,

right? So, there's no week 14. So,

>> there's only 14. Yeah. So, what we'll do

is we'll run cahoot at the very end of

that.

>> Yeah. So uh those of us uh whoever is

writing in creating uh the cahoot from

actually needs to work with the

lecturing staff to prepare the cahoot at

the same time. So um yeah I will try to

message it on the week

that we're clearing at that.

>> Okay we'll stop our screen share. Uh I

think our online participant saw

>> Yeah.

>> thing all the way. So the screen share

when you're ready.

>> Yeah. So before screen sharing like toh

say hi from Barrett also. So Barrett is

hi. We we're here at the conference but

we are both busy with our group duties

here. Yeah. So yeah. Great. Okay.

for working very hard while doing work

at NLP which is a top ranked conference

and

so yeah

>> okay uh can you guys hear me um and see

the screen uh everything good?

>> Yes. Um, okay. Yeah. So, I want to

apologize in advance if that I have to

uh have some 30 seconds interruption

because I see the securityities they're

trying to push us away. So, I'm I'm

trying to hang air uh hang here. Okay.

Right. Uh okay. So, um I'd like to first

give a a very quick intro of this

session. So, um then I'll leave the

floor to my colleagues real quick. I'll

go through the introduction and then we

have this agenda of um an uh uh going to

talk about uh provenence niku talk about

credibility

about plurism on sexuality and everyone

on aity and finally Z will close the

session and also give a lot of credit to

because she was the one helping creating

this agenda today. So yeah we have an

amazing agenda today. So u so my um my

path is actually very easy and and it's

uh um very intuitive

and so the idea is that um so when when

we are tracing the source of the we're

tracing the source of the news we're

trying to guess the credibility whether

we we can trust the source of the the

things we are referring to. Is it really

trustable? like you know it's like we we

we put in a small dough into a bigger

dough then a bigger dough into a larger

dough um yeah can we trust them so here

I like to share the quotes from my uh

internship advisor Dr. very short. So

she wrote this one like a simple

examples for how arrows can create into

our papers through LM use. I had

something in 2020 at Google the same

stats for 2023 AI overview talk about um

you know listing is from 2023 because it

only provides a link to support the

claim. So it's sort of like um the

credibility can be like refreshed by

just uh iterating by by just talking

about by just iterating year by year and

we still believe that the the thing um

is quite recent but in fact it's not

that recent. So as the first figure

tells here right we just put a small box

into a larger box but the core the cord

of thing hasn't got up updated and the

statistic for 2020 should also be

probably updated. So this is the general

idea about the temporal awareness about

source credibility. So this one is uh

very interesting and I also like to

share another thing about like the the

false uh the false negative uh things in

in the source and and the things that we

site that ideas that there could be

something we we should site and we

should have but unfortunately due to

some reasons and the citation doesn't

come as expected and there's a false

negative So here I like to quote a

conversation between me and my advisor

proof me. We we were at a a a launch

meet up with a a a staff a lady from the

elsephere publisher and they told us

that it's very hard to track to track

citations to do a really good literature

review of a field because whenever we

read the RW section of a work they site

a few papers. However, there are always

some missing citations. So like here

here are two robots that say I don't

want to site your paper and the other

robot said I want to set your paper

either. So it's like you know um the the

two robots they don't like each other

and they don't site their papers but

they actually they worth being cited. So

however when when someone's paper gots

prominence later so people will believe

us their related work section is pretty

well and they they will probably just

you know ignore the the proper citation

that's really needs to be properly

cited. So here's a false negative

example. I think the the example one is

the false positive because we we think

it's recent and example two is a false

negative signal that things should be

there but uh it's uh hasn't been there

due to some reasons that's among human

not really about machine. Okay. So uh

thanks to our uh lecture coordinator Jin

today I I get to read this uh amazing

survey and I would recommend everybody

in this room to read to read this survey

because it basically like a a a rude

umbrella for today's uh reading group.

So and the authors have brilliantly um

classifies the trustworthiness in rack

system into these six types uh

robustness and privacy in the generation

uh sorry in the retrieval stage and

fairness and transparency in the

generation stage and and factuality and

accountability in the checking stage. So

yeah, we we have this triangle here and

and different components being evaluated

in different stages of a a raich system.

So, and the way they they they did the

evaluation or I mean the the the

outstanding part of this survey is like

they're not only a survey but they also

uh have experiments and evaluation and

they do such evaluation by prompting

language models under prompts like this.

For example, when they're evaluating the

aspect of transparency, they have a

prompt of question and reference and

they say, "Please carefully think about

something and you know um to reflect

some intermediate answer step for for

the provided reference and I I I haven't

gone through the survey in detail. I but

I think they have some ground truth

answering step intermediate steps that

they can measure by by the overlap among

them. So yeah, here's their evaluation

setup and you're welcome to check in

this archive preprint about the full

details about this paper and finally is

their evaluation result. So we can see

that here's a a radar uh figure for the

six aspects that they believe is

important for the credibility of the

source and rack system. So yeah, we see

that uh models older models like llama

213b

uh which is still very uh relevant in

2024 but not very relevant today. We see

that it's not performing very well. We

see this um this um green small very

green shape here because it's not

performing very well in all aspects. So

it's area is very small. So which is the

best one? I would say on the yellow one

uh here oh sorry not very well so we can

see the the yellow one is GBT 3.5 it is

very well on transparency fair fairness

robustness factuality and accountability

but the aspect of privacy is uh dragging

down this overall uh area I think

according to the definition privacy

means that we we shouldn't um use um

private data from from any source to to

be used as the rack. So it so it means

that GBT 3.5 Turbo is not doing very

good job in this aspect. So which one is

the best? Uh I would say probably the

the the green blue one

uh the Sapphire one is the best. GBD4 is

the best when the authors uh compiled a

survey in 2024.

Yeah. Okay. That's me for my

presentation. Are there any questions

for this survey or the intro?

Any questions for you, S?

Okay. Uh maybe we go ahead for the next

segment uh given the time constraints

that we have.

>> Okay. Thank you.

>> So, uh who's on the next? Yeah. Okay.

>> Yeah.

>> Okay.

Share screen

and manage to use the pen. You're

welcome to use it.

>> Uh you need to share entire screen.

Maybe it's better.

Hello everyone. Uh I'm N and uh today

I'll be talking about uh provenence and

authenticity. Uh so basically this is

quite important because um we need

verification if the source is real or

not. And in today's world where almost

everything all all content we have this

doubt is it real is it fake and it's

it's very hard to tell with even AI

generated content. I believe this is a

very important area to to for us to know

about and the title says beyond text

because traditional uh text based

verification methods are now outdated.

they just don't work for existing

multimodal content and uh we need

something that's fundamentally different

and robust.

Um so the trust crisis right the first

issue is or the first challenge is

basically synthetic uh media is now

ubiquitous and it it's because of the

you know very fine line between real and

fake. It's increasingly convincing for

us to think that what is fake could be

real and we should we need to find ways

to tackle this issue. Uh the second main

uh challenge is the traditional

verification methods have failed. Uh the

methods used so far were to look for

image or video content was compression

artifacts uh checking metadata and

analyzing lighting u inconsistencies.

But modern generative models easily

surpass all of these tests for which we

need to be uh careful because right now

we're sort of flying blind on these

areas.

Um the third is users cannot distinguish

between manipulative and authentic you

know evidence. So even when I use

strategy so I work in research I work in

quantum computing research and it's

because it's such a niche field when I

there are a lot of things that I don't

know on a daily basis and when I ask

strategy to explain from scratch like

explain to me like on five of course it

does have those u citations that it

provides but later when I click on those

links I realize the explanation is not

very fitting to the the things that

they've provided which makes it very

easy to for me also to believe that you

know I not every one of us click on all

links right but when you click on them

you realize oh there's a huge

inconsistency with what links and what

the links are saying um so the stakes

are quite high uh with misinformation

spreading faster than corrections uh

according to statistics misinformation

is six times faster than a correction.

So if if there's already some

information out there that's not not

real, it takes six times the effort for

a correction to reach it to reach the

same audience, but you'll you're never

really closing the gap. Uh the second is

the erosion of trust in digital

evidence, be it journalism or courtroom

cases. Uh and then there are some

critical decisions that

um that that rely on unverifiable

content because in future if we're using

AI into like judicial systems or medical

systems, we do not want false data to be

the evidence for some um decision that

the AI system or the human is going to

make.

Um so what's the solution? The solution

is basically content credentials which

means being able to capture and edit

chains with possibly cryptographic

signatures. There might be other ways to

do it but I'm just highlighting one

because on the interest of time um so

what are cryptographic signatures? They

think of it like um like a blockchain

for content. So every time something

happens to a piece of media uh from the

moment it is captured to every edit that

is made along the existence of that

particular media the action gets logged

and cryptographically signed. So there's

a record of every change that's been

made. Um what exactly does this capture?

The device or the tool that's been used

to create the content. Um the exact time

stamp of creation. uh every operation

for example cropping, filtering, AI

enhancements, everything. So every edit

operation is tracked and most

importantly whether AI was involved in

the creation of content. Uh why does

this work? Uh because it's tamperproof

or tamper evident. If someone tries to

modify the content without updating the

the credential, then the signature

breaks and you immediately know that

something's wrong. There's there's like

an attacker. uh it provides chain of

custody proof which means you know can

correlate it to like um evidence in

criminal investigation you can trace the

entire history of that content and

crucially this isn't just theoretical

it's actually being followed and the

standard is called C2PA which is the

coalition for content provenence and

author authenticity uh and this has been

used in real time by big names like

Adobe Microsoft Canonets etc

um so The also what's what's interesting

is these credentials are both human

readable as well as readable which means

they can be used for transparency in

automated system.

Um so I'll be talking about attribution

evaluation u because we need technical

infrastructure right so even with uh

perfect provenence metadata there is

still a fundamental problem which is

evaluating whether claims are actually

supported by evidence and and this is

sort of hard uh a recent paper from u

called attribute bench which is

published in ACL 20.4 before uh

basically test this uh aspects

systematically. So they took uh

state-of-the-art models including GPT3.5

and fine-tuned specifically uh for this

particular task that they were doing

which is to classify they did a binary

classification of whether a certain

information is attributable or

nonattributable

and the result was only 80% um was there

was only an 80% score of macro F1 which

means one in five evaluations is wrong

and we do not on that kind of error bar.

Um so they did something really

interesting which is uh they analyzed

about 300 samples which are 300 error

case samples and to understand why the

model fails in the case that it did. Uh

so 66% of the errors came from what they

call fine grain information

insensitivity.

um which means which which contail uh

basically three uh three areas which is

missing nuanced details and uh not being

able to infer or summarize where it came

from and overlook like subtle

contradictions. Um what we also need to

note is these are very easily these are

errors that are very easily made by

humans as well. And the other 27% of the

errors was from information access

mismatch which basically is um so human

annotators can see the full web page or

the full document whereas for a model it

sees only extract snippets. So the

information given to a human versus the

model is itself different which also

makes the judgments different from the

the judgments made by the human and the

model.

Um this is basically how uh attribution

bench works. Um as you can see this was

fed into GP 3.5 and um the claim was the

population of Thailand is about 63

million and um however the reference

discusses certain connections with like

religion and stuff but it um it so what

um GPT 3.5 g passed the judgment to me

was this this information is

nonattributable but the ground truth is

it is attributable And the same in

another case as well. Um

so how is this different right? Like we

always use F1 score but what did these

people uh you do differently? Like the

authors of this paper uh they introduced

something known as macro F1 which is

basically taking unweighted average of

the F1 score of the precision and the

recall. So uh they calculated the f1

score for attributable classes and the

f1 score for non-attributable classes

and took the unweighted average of that

which is different from the usual f1

score that we talk about. uh why it

matters uh is because um

so we need a balanced data set and uh

it's because when you're training a

model we need to know that it's not uh

giving preference of one one kind of

data or one um one area or basically if

it's a binary classification should not

be giving attributable more weight than

nonattributable. So having this balanced

evaluation is very important. Um and um

it also achieved um

okay yeah so so it detects when the

claims are supported by evidence and

when it is not. So when they are

supported by evidence it labels them as

attributable and nonattri attributable

at the tax. uh some of the key takeaways

from this is B were so the benchmarking

was very important the the way they did

the whole thing and the way they studied

the 300 samples and the insights that

they got from that was very important.

So just as province needs verifiable

chains, evaluating attribution needs

very careful setup as well and design

choices also matter because if um you

don't have like a balanced class then

you're going to see a lot of u an

anomalies and what they did was they

basically took seven data sets four for

the ID and uh three for the OD. So

basically ID is the in u in distribute

indistinguishable or something like that

and out of uh out of I think I I don't

know what word but like

>> out of distribution.

>> Yeah, distribution. Yes.

>> In distribution which means it's trained

on those four data sets and out of

distribution was basically trained on

three data sets that is not given at all

in the testing. So in in the training so

they tested on a different data set all

together. So it basically does not just

take look at semantics or linguistics

but actually looks at if there are

patterns it's able to catch and also for

generalization purposes they chose to

not have the training data training data

set and test data set be

um this is um so there's something known

as the AIS framework which which brings

us um to okay so this is the paper from

Google uh and was published in

computational linguistics. It was a

brief um renown journal and it asks a

simple question which is can generated

text be verified against the sources

that it provides and it's a two-step

process. So step one is the

interpretability which means is the text

understandable on its own u as in are

there any grammatical errors it does

sound weird when you just read it

normally uh those are the things that so

very basic um

facts that it looks at like the how

ambiguous are the references and such

things so um the second uh step is

attribution test so basically can I say

that can I quote something can I say

that according to the source and make a

If I can't then the source is not

accurate.

Um the key innovation here is uh

something known as explicators uh which

is a formal model for meaning and

context. So basically it just handles

tricky cases where context is important.

Um, and a simple example of what an

explication would be is if I'm asking

when, if I'm asking a model, when was

the iPhone 17 released? And then the

system says it was released on 19th

September 2025. Basically, the it that

it's referring to, it's able to

contextualize for my question itself.

That's that's an example of what

expectation means. It already has an

idea of what it's talking about. Um, and

this has been done. So it's validated

across tasks like conversational QA

summarization table test text etc.

Um missing and forged credentials. So um

in a real world credentials are often

missing or forged and in in terms of

missing credentials there are four

things. So assume untrusted by default

which means no chain means no

verification. So you don't have

something tacked to it means cannot

verify. Um the second is flagging it

explicitly in the UI so that users know

what you're looking at and how

believable how believable it is or how

true it is. Uh the third is lower the

trust scores in your ranking and

retrieval system. Uh lower I mean it

should not be shown at a higher level

when people look for something that it

codes for. Um and then um the fourth is

it's important to have clear warnings

which means transparencies.

Um coming to the forged or manipulative

credential cases. This is the the

adversial case where attackers will try

and fake credentials. So um how do you

defend right? Uh so cryptographic

verification basically check the

signatures. If if simple math doesn't

work then it's rejected. that means it's

not it's not attributable and

certificate revocation like SSL

certificates you need a way to be able

to block uh compromised signing keys. Um

next is anomaly detection. So you need

to flag suspicious editing patterns. For

example, if you have some sort of media

and then if the media is edited 47 times

in three settings then you know that's

humanly not possible. So you flag that

as well. Uh then trust anchors. So

basically verify against known

authorities, legitimate device

manufacturers etc. And um the the best

practice is basically uh create

credentials as evidence not truth and

always combine with others.

Um so attribution is not equal to

authenticity. Uh we should try and get

this as clear as we can because it's a

foundation of everything else that

follows your own. Um so attribution

confirms that what the model says aligns

with the source and it is a linguistic

check basically. So does this claim

match the document that is that it is uh

quoting or you know basically the AIS

that we discussed previously and

authenticity what it does is it confirms

that the source itself and its entire

chain is trustworthy. So this is the

technical check. The attribution is the

linguistic check and

the gap here is basically missing a

forged provenence breaks the attribution

link. So you might have a perfect

linguistic alignment with a source but

if the source itself is fabricated then

the attribution is meaningless.

Um so which is why we need a unified uh

framework where we have attribution

provenence and credibility together and

uh the overlap is what the trusted AI

evidence is that's where you want your

AI

uh so naturally we we come across a

question of how do we build this right

so bridging the bridge to rag so

credibility signals uh first is metadata

enrichment so you add provenence to the

embeddings track credential status and

second retrieval reeranking I think

we'll talk about in this class as well.

So boost verify content penalize missing

credentials uh just simple math. Um

third is generation grounding. You wait

by provenence and trust the metadata in

context because that is already very the

fourth is citation and transparency. So

just be as transparent as possible. Your

user should not uh you should user

should not be confused.

Um

yeah uh this is the provenence aware

rack architecture. So it starts with the

ingestion and then goes on to the

verification and scoring and on the

indexing side just indexing retrieval

and the generation. So like it's quite I

think it just explains so the key

signals are credential present signature

valid uh edit um chain intact source

reputation and uh need to check for

missing data and edits.

Um the key takeaway is prominence is the

substrate for trust. So content

credentials provide for as I mentioned

attribution evaluation remains hard. So

even so models struggle with nuance

verification. They can make mistakes.

And third is treat missing or forged um

content like explicitly like no

credentials done with weight or give

them a pen and make trust trans. Um the

next session is by nil which would talk

about credibility aware.

>> Okay let's uh

>> hi.

So uh let's just spend a minute or two

thinking about what N has presented. So

um we have any questions from the

audience. There were two discussion

catalysts that are posted in Slack. So

you have Slack in front of you. You can

look at that. Uh it and myself have

already answered some of the questions

there.

So I mean credibility has been a problem

in search engines from day one. Uh and

one of the other problems that we see is

that we want to make certain sources

trustable and findable and accessible

right. So for example when we do rag you

might want to trace back to a open

source like Wikipedia or you know some

type of uh repository that gives

authenticity as uh Diane also pointed

out right but these resources they are

also encumbered you know operating a

website and being able to make it freely

accessible by all is a cost okay so for

example archive where we've read a lot

of papers from uh services billions of

requests, right? Uh millions every day.

It's not cheap to send all those

electrons all around. So that's why if

you go to Wikipedia archive, they're

always asking for money. And uh with rag

actually we're compounding that problem,

right? because every time we're asking

uh GPT or uh AI mode or whatever to come

up with summary, they have to go access

all those websites, right? They have to

pull that information. They summarize

it. They present it to you and what do

we ask you to do? Oh, go back to those

same websites and do it all over again.

Of course, that's necessary. But you can

see where this is compounding problems,

right? because we are we we need to

check on the providence of this work and

unlike a a search engine which is not

accessing those websites you know

they've just done it once they've

indexed it and they're not pulling it at

runtime right but when you make a query

to a rack system it is indeed most often

either pulling a cached very recent

version or actually live going to that

website in fact if you talk to many of

the providers of wiki wiki data okay

which I went to a uh a meeting behind uh

last month or two months ago about that

they're very concerned uh rag traffic is

uh a huge amount of cost the electricity

involved in serving rag users is 10

times the normal request okay and uh you

can see if everyone's going to use

chatbt their cost and balloons Right.

And you can see if we can't service and

be able to make these resources apparent

uh factchecking and credibility aware

systems be even more of a problem

because you know you can't get to

archive can't get to Wikipedia to check

and how you know that it's right. Right.

Okay. Um any comments from anyone?

>> Yes. very interested to heard about that

function that use blockchain for this.

Um is that the SAP credits or is it just

soft?

uh uh can also address that blockchain

uh the solution set

>> I mean it's just a cryptographic

protocol I just uh sort of try to give

an analogy with the blockchain because

to think about it like that

>> again but to set up

>> it doesn't exist it's just

>> that's my question

>> yeah yeah

>> thank you

so yeah blockchain is definitely a very

useful technology

it requires requires at least 50% of the

uh distribution network to be

distributed, right? So, that's also a

difficult issue because you need enough

you need enough players for a blockchain

to be robust, right?

>> Um and and getting that support uh from

worldwide is not that easy either.

>> Yeah. I mean, it's to provide a a a good

for the colonies, right? It's sort of

like water and air. I mean if you have

to pay for those resources it's not so

easy to get right and if one person

doesn't do it uh or uses it poorly it

suffers for everyone

>> and there's also the environment issue.

>> Can you elaborate more on that?

>> Yeah because uh using blockchain or such

technologies just increases the amount

of energy that you use even for like

tokenizing one one aspect of the thing.

So when you do it like billions of times

then the amount of energy is

exponentially high. So again it comes

back to already LLM is using a lot of

energy and then do you want to inject

something else that's going to increase

that

>> right

this is a very good discussion because

you can see the technology that we we

are creating creates more use of energy

right and so unless we have uh a pairing

on the other side to make more energy

efficient use of our resources we're

always ramping up our computational

requirements

you know uh those us uh uh 10 years ago,

you couldn't imagine the amount of

energy being consumed to do certain

basic things that we take as for granted

now right?

>> Okay. Nil, are you online?

>> Yeah.

>> Okay. Uh our screen share and I'll let

you take over.

>> Yeah, I'll share my screen. Just a

minute.

Sorry.

And you can share and when you're ready

you can start.

>> Is my screen visible now?

>> Yes.

>> Okay.

>> Presentation mode and it's all yours.

>> Yeah. Okay.

So, okay. So, hello everyone. So now I'm

going to present the section three of

the presentation. So like from section

two we can understand that we can now

verify the document authenticity and

track content origins. However, there's

a big challenge there. The challenge is

that not all retrieved documents are

equal. By equal I mean that not all

documents can be accurate. This means

that the RG breaks when the context is

flawed. By flawed it means that it

retrieves irrelevant, outdated or mis

misleading text. The generators may use

low credit credibility snippets if they

look on topic. The simple filters that

we put can drop useful evidence like how

it works that mixed quality documents go

into the system into the retriever.

The how an RJ pipeline works. The mixed

document goes there. Then it searches

for the relevant chunks. then it orders

by the evidence quality and then it

generates the grounded answer. So with

the lowquality documents there's a very

big impact that the performance of the

LLM's drop by 20 to 30%. Because of this

misinformation and as we can see that

one low docu low quality document among

many good ones can mislead the RA. So

the so the thing is that we have to

treat the sources and spans by

credibility and not only by relevance.

So the credibility problem in RA. So

what is the meaning of high credibility

medium credibility? So the thing is that

for a high credibility document can be a

recent news report on the event or a

peer-reviewed high impact literature. A

medium document can be a medium

credibility document can be a like a

general reference page from an

encyclopedia which is like more of

generic but still it it can be it can be

still credible. However, if we put a

outdated report for example a report

from 201819 which which is like 6 year 6

7 years old this this can be a low

credibility that's something was

different that time now has changed and

obviously the is AI generated

misinformation is always a low

credibility so the big problem is that

standard RA treats all four equally all

these type of four documents equally

leading to confused and wrong answer so

what what's the good solution of this

The best solution is credibility aware

RA. So weight the uh like weight the

documents by credibility which will help

to get the correct answer and explicitly

tag each document with credibility

level. Okay. So the credibility aware RA

it it basically means to place trust

inside the pipeline. So a standard works

on this formula works on this function

that llm function which has a query

query and documents and a credibility

away will take query

and we'll also consider credibility and

document alongside the query and the

east each document will be tagged with

credibility. So the credibility

dimensions what are dimensions of

credibility the relevance does it answer

the query the timelessness is it current

and the source obviously whether it's a

high impact peer-reviewed literature or

just a just a low impact literature or

anything like that so the key insight

don't hope the model figures it out we

have to tell it explicitly that which is

high which is low so the there are three

prediums like there are three different

paths to the same goal there are three

different types of methods one is rag

where it acts it acts like for retrieval

and aggregation

and and there's no training required for

it. Second is CAG which is in the

generation training phase and however we

need to train it and the cramp the

generation which is which happens in the

interface phase and it also doesn't

require training. So just giving a very

brief like a real word explanation of

all three before I go to technical I

would like to explain what is RA. So you

can think RA like as a bouncer in a

outsider outsider like a club or

something who's checking sources at the

door and weighs the trustworthy once

more that they follow all the

eligibility criteria that they above 21

they have a legal age and everything. So

you can assume like that the CAG is like

a student like teaching a student to

recognize which is a reliable source and

which is an unreliable source. So as we

are teaching it so we have to train it

and the CRM can be considered as a

glasses like we can adjust the glasses

to blur out the distractions and focus

on what's the what's clear.

So all the three integrate credibility

in different ways to achieve trustworthy

generation and they deliver double all

deliver double-edged grains over

standard RG in noisy conditions. So

explaining each of them technically. So

reliability aware retrieval augumented

generation. So technically speaking it

estimates per source reliability from

cross source agreement and uses it to

filter reank and aggregate evidence via

weighted voting. So if I break break

down its steps. So the step one would be

to estimate source reliability offline

one. So cross check 200 fast checking

queries across sources which sources

agree with consensus and there's no

manual facteing needed

and at the query time retrieve per

source. Get the top K documents from

each source separately. It can be top

five top 10 depending upon the data and

what we are doing with it. and then

select K reliable and relevant sources.

Only consult the top K sources that have

relevant info. If K is equals to four

like if you take the top four of the

thousand sources so it means there's a

99% token reduction more than that

actually. So

and then aggregate with weighted

majority. So the answer would be arc max

multiply by sum of reliability into

vote. So just giving a simple example as

was given in this this hank at all paper

of 2024. So fast checking query was what

causes the COVID 19. So the sources were

source a C the CNN model says the SARS

CO2 virus source B the conspiracy block

which said 5G networks source C my

clinic which says SARS COV2 virus and

Wikipedia also said CO2 virus. So

reliability estimation. So sources

agreeing weighted to high reliability

like we can see that three of the

sources CNN, Mayo Clinic and Wikipedia

said that we would go with the SAS cove

2 virus.

So hence we can see that as it was like

this was the highest one. So reliability

at query time the current so it will go

with this this ones

like the Sask 19. So at query time query

current and employment rate weighted

voltage. So this is just was another

example in the paper but this this is

the main example that why it would give

su because majority of people have said

that. So now going with the second one

the credibility aware generation.

So technically speaking it means to

fine-tune the large range of models on

data where documents are labeled with

credibility plus model to generate the

reasoning.

So it has a two-step data data criterion

like creation sorry. So the

multi-granularity credibility

annotation. So it means to use retriever

to score relevance classify into high

medium and low credibility and applied

at document and sentence level. And

second is credibility guided explanation

generation to prompt GBT 3.5 to generate

explanations that analyze credibility as

it was done in this paper this pan at

all paper and the example was the high

credibility routers article states X low

credibility block states I considering

credibility answer is X. So the answer

should come like this. So just giving

more about the example the C training

data example. So input to model who won

the first no nobel prize in physics a

high credibility as we taught it would

be that William roten won the first

Nobel prize in physics in 1901. So Nobel

which came from this Nobel prize

official website which we have given the

input and the low credibility document

can be the Albert Einstein won the first

Nobel prize in physics which is like

some some person who has blogged just

put a blog. So what would be the output

from the train model? The high

credibility document from the Nobel

Prize official website clearly states

that William Roten won in 1901. This is

authorative as it's the official record.

The low credibility document claims

Einstein but this contradicts the

official source. Einstein did win a

Nobel Prize but not the first one. So

the answer would be like this. So the CJ

model would give you an output in this

format as we can see.

Okay. So the third and the last one the

credibility aware attention

modification. So it adjusts the

attention at interface by downweighing

the tokens from low credibility spans

using calibrated hard scaling. As we can

see the architecture in this in this

figure which is given by dang at all. So

it's again a two-phase process.

Phase one to identify the gullible

heads. Use the 100 calibration example

with misinformation. Extended casual

tracing which which heads are most

affected. Select the top 100 to 300

influential head. Phase two modify

attention at the interface. For the low

credibility documents, scale down

attention. For the high credibility

documents, keep attention normal. Just

like what I said to keep make the

glasses blur for the distractions.

So don't so like don't be attentive to

the low credibility documents. It is

trained like that. only modify the

influential heads identified in phase

one. So and and hence we are not

importing putting the details to it. We

are not teaching it. So there's no

training required. It's a plug-and-play

method. So how crime works attention

modification example. So query who won

the Nobel Prize in physics documents

credibility a point8 credibility

document says Williamen won 1901. A

point.1 credibility says like it's here

Albert Einston won in the Nobel Prize.

However, the standard Rating William 30%

tension each and Albert Einstein get 20%

tension each. So, CRA's model attention

would be based on this formula and then

it does the renormalization

and from the reormalization we can see

that it adds the documents and model

focus 92% attention on the credible

document.

This is how it works.

Okay. So, which method to choose? How we

will come to know that? So first of all

we come to know that there's a need of

credibility aware reg because we we are

not sure if the document is credible or

not. So first thing we need first we

need to check that if do we have

identifiable sources identifiable

sources like the CNN BBC Twitter

accounts. If yes then the best thing to

use is RA because it has the best

explanability and it scales efficiently.

If we don't have the identifiable

sources then we have to check do we have

training sources like do we have example

can we train the model if yes then use

the c because it's one of the best

overall as we'll come we'll come to know

later and it generalizes pretty well

however if we don't have a training

sources then we have to go for this cra

model the good thing is that it will

give you immediate outcomes and so no

training required we as we discussed

before so it doesn't require any

training also So however it not be it

won't be the most accurate. We can see

the paper. So the performance comparison

from the all three method methods from

the papers that I put before the RA RA

multisource excellence we can see the

for the N natural questions it it had

the accuracy of 73.7%

versus the 63.4% of the standard RA. The

trivia questions question answer it had

the accuracy of like 91.3%.

And compared to standard which only had

81.2%. The real world val validation

correlation on political fasteing it had

a correlation of this 99 on fast

checking and they had given a full

example on that the scalability

it has a 99% token reduction with K R

SS. So the however and if I talk about

the second one we can see it for the hot

hot pot question CAG for the hot pawn QA

it had like a 50.9% accuracy which was

incre which was increment of like 82%

accuracy for the base one base model the

RG model the second one for time

sensitive QA it has like more than 91%

accuracy like 91% more accuracy than the

standard baseline RG model for the

misinformation it rather increased by

like 147. Still it was 44.2% but the

incre increase was pretty good and the

noise robustness

it had 89% accuracy at 80% noise versus

charge which only had like 77.3%.

So the third one cra which didn't have

any training as we it was not really

much trained and didn't have much

explanability and was very good. So its

accuracy was also really not very good

for national questions compared to what

we have for RA. So it was like 33.6%.

And

and for the previous QA it is like 59.9%

questions it beats the CAG

it still beats CAG by like by margin.

However for the AD adversarial it had

accuracy of 91.3%.

For 1 to 9 to 72.2%. 2%. Which was

equivalent to the Oracle performance

and so the sum summary comparison

finally so already required we have

discussed all these things before that R

doesn't require training it needs source

id it performance is very high we can

see that its explanability is again very

high the setup speed is medium the so C

it's it's it requires training it

doesn't require source ID its

performance is again high explanability

is high but the setup speed because we

are training it then testing it then it

speed reduces like a lot and the cra

model which doesn't require training

doesn't require source ID doesn't have a

very good performance it's a moderate to

high performance has a low explanability

but it's very quick so the limitation

and open challenges so all methods

depend on the credibility assessment

accuracy this this is a limitation that

garbage in and garbage out if

credibility labels are wrong what if the

credibility levels are wrong the all

methods will fail

So each method again has some specific

limitation. For example, this R the

source reliability can drift over time.

Periodic reestimation is needed. CG

performance depends on label quant

quality and explanation coverage. The

crab the attention head selection is

task specific and recalibration is

needed for new domains because we are

not really training it. So every time we

have to change it. So the biggest

fundamental problem how do we accurately

assess credibility in first place?

Misinformation evolves cons constantly.

Sources change quality over time.

Context dependent credibility.

But

but even with perfect credibility, what

if all credible sources share the same

blind spots? For that we'll come to know

in the next section in the plurism and

bias source selection. So thank you.

>> Okay. Um any comments?

So one thing I I'll just be very quick

because I know we we have a lot of

sections but uh uh the time dynamic uh

we haven't talked about a lot but it

definitely interacts with credibility

right things that are factual we know

are times persistent and somewhat static

uh we can do more credibility aware

verification so this goes back to our

again the very overview of the lecture

the hypo presented we two say that uh

when we know things are temporally

sensitive maybe we cannot assume that

credibility is that strong a

requirement.

Okay. Uh so Zahang is already online. So

uh Zahang when you're ready you can

share uh part four. After part four

we'll take a short break. I know it's uh

after seven already

when you're ready.

Yeah. Can you hear me?

>> Yes, we can hear you fine. Thank you.

>> Yeah. And see the screen?

>> Yes.

>> Okay. Uh just apology in advance because

I have uh got a cough today. So maybe

there are some interruption during the

presentation.

Um so in this section uh we will talk

about plurism and bios aware source

selection

and here is the the overview of this

section. Uh we will focus on three part

and first why credible rankings can

still lack diversity or fairness. Then

what kind of ranking behavior we

actually want and after that how this

can be imp implemented as a simple

plug-in module in the pip in the

pipeline and end up with some takeaways

and let's start with a very simple

iteration for why credible is not the

same as diverse or fair on the left side

picture I asked GB5 a factual question

in which year was Michael Jordan

It does the right thing in terms of

credibility. Besides Brit Britannical uh

wiki and maybe some major news, all of

these uh high credibility sources, which

means we have filter out lowquality

stuff.

But on the right side picture, we can

see the wiki disambiguation page for

Michael Jordan. There isn't just one

credible Michael Jordan. There is a

basketball legend but also a footballer,

actor, researcher and so on. All of them

backed by factual sources. So even if we

are inside the credible pool exposure

can be still be skilled. Almost all the

rankings and all the citations goes to

the basketball Michael Jordan and to the

tiny set of the dominant sides

and on this slide uh we will see their

differences in mathematical form. Uh

first given the assumption that for

every document D uh we already have uh

credibility or relevant score uh you

call UD and classical ranking then just

says pick a permutation pi then

maximizing the utility where VK is

position bias or user attention at rank

K. If we sort documents by UD we

maximizing we maximize the function UI.

So this give us a credible ranking but

it say nothing about diversity or

fairness.

And so far there have been many studies

on search result diversity and you can

click the link below to see this review.

And here we use uh intra list average

distance as an example. It only looks uh

it only looks at the distance uh pure

wise distance between items in the list

to measure diversity. You can see the

formula

and the most tradition and widely

adopted way for defining the pair wise

distance is use the coign similarity.

But neither of these two formulas

involves uh cred cred credibility score

UD.

Then fair exposure is also separate. We

model the ranking as a probability

matrix P and exposure of a document is

the sum over position K from one to K of

P DK multiply we by VK. So if we get the

ranking by maximizing up alone, it may

still produce very large unfair gaps in

exposure between groups.

And now let's talk about what kind of

behavior we actually want from the

ranking.

Uh first think about uh the plural

reason at the list level. Uh here we

reuse iOS

signal. It basically

uh the average distance between all

pairs of items in the list. So higher ID

means items are less similar to each

other. But pluralism is not only about

pair wise distance. It it is also about

coverage over viewpoints or subtopics.

Imagine that there is a global set of

stances or subtopics called S and each

document covers a subset SD. Then we

define the coverage of list L as how

many different stances from S are

covered by the document in L dividing by

the total number of stances.

Then a higher value simply means more

stances or sources are represented.

And finally at the bottom combine

objective we uh when we pick top K list

L we don't only maximizing uh we don't

only maximize the sum of UD we also add

these two terms uh with weights lambda

one and lambda 2 which will trade off

purely credible versus uh more puristic

rankings.

So after addressing the uh the issue of

diversity and let's switch to the uh

fair exposure which is about who

actually get seen in the ranking.

uh first incorporate probability into UP

instead of uh instead of the uh fixed uh

fixed uh fixed uh fixed ranking

up can be written in the matrix form uh

we transpose times P * U

and we also require P to be doubly

stochastic every row and every column

sums to one so it really behaves like a

soft version of a ranking

Then these forms allow us to link link

it to the exposure of the signal

document of a single document

here.

And then for a group G, we just average

over its members to get the exposure G

and U.

With those quantities, we can write down

fairness constraints. One example in the

paper is called demographic parity make

the exposure of group one equal to the

exposure of group two and another is a

desperate treatment which says the ratio

of exposures between groups should match

the ratio of their utilities. In other

word exposure should be proportional to

how useful or credible the group is.

So in summary, this framework let us

control how exposure is distributed

across groups while keeping the uh same

scores.

And next let's move on to the how and

show how this graph based adaptive

reanking works as a pluging.

We start for a very standard two-stage

two-stage retrieval pipeline. Stage one

is a fast retriever that give us a large

candidate set I and stage two is a

slower ranker that scores a smaller set

and produces the final ranking.

And in our cases, we can treat

credibility filtering as stage one

point.

So by the time we enter this box all

documents are already in the credible

pool

and in the box the extra the extra

structure is a corpus graph

uh CG built offline. In graph, each node

is a document and for each document, we

store its K most similar neighborhood

neighbors.

And online, we run the JR loops with a

frontier F.

The step one is to take a batch of

documents from the initial set I score

them with the reanking uh with the

reanker S and append them to the final

ranking R. And step two is uh for every

score document we look at its neighbors

in the corpus graph and push them into

the frontier and with some initial

prescore.

And step three uh take a batch from the

frontier with them and with the same

ranker s then appendant to the r and

again push their neighbors back to the

in front uh back into the frontier.

These loops will repeat until the

scoring budget is hit.

Then this line uh will show how the

policies actually steer the graph and

the frontier.

Uh first

uh every document is tagged with some

group label uh like uh source, region,

stance.

Then the first set uh are pre- retrieval

policies which control how we build the

neighbor set which which is a Z.

Uh policy one is diverse group linking

keeps only neighbors from other groups.

So each nodes neighbors already points

across groups and policy two is balance

neighbors kota. So each group can appear

equally often in the Z.

The second set are inprocess policies

which control uh how we pick documents

from the frontier or the initial set.

The policy three is frontier filter.

Uh when we add neighbors into the

frontier uh we jump save group

neighbors. So the frontier also remains

cross group.

Then the policy 4 tries to keep equal

group proportions uh whenever we build

the candidate pool or a batch.

Then policy five is a groupwise top

scoring.

So every group gets a fair share of the

batch

and the policy six is uh within each

group preferred neighbors discover

earlier in the loop then by the score

because those are more likely to be

relevant.

And finally uh when these policies

especially five and six we can choose

their uh per group top dogs uh we can

control pluralism by adding a diversity

score like the delta diversity. So at

the selection time we are respecting

group balance keep credibility and

pushing the list towards being more

diverse at the same time.

So, here are some quick takeaways and

that's all for me today. Thank you.

>> So, okay, we'll take a three minute

break uh and then we'll go on to our

next segment. So, uh if you need the

bathroom or anything, come back at uh

7:20.

Okay.

So J Chang is uh are you online?

J I think is on the the next segment.

I didn't see

>> I wonder how can we give the PDF to the

>> uh you you can mail it to them. I think

there are instructions on the on the uh

information that uh presenters were gi

but uh I'll double check that.

>> Yeah,

>> it looks like if you were to go there

and tell them all questions.

>> Yeah, they they will ask you to go to I

think support and c

double check this.

Oh, is it the the assist?

>> I think you can actually go here and

then ask them for others and put the

subject line then you attach your file.

But I'm going to double check that's how

they want it.

Um so I think uh if you got the

instructions from our steps coordinators

uh they should have attached some PDF

file giving you instructions for how to

do the poster printing but I'll just

double check

because

>> okay yeah no you you don't have to be

SOC staff you just need to be registered

as a poster presenter in steps yeah

there

So J, maybe you're ready.

>> Yeah, you can come up and present. I

will go find out the answer to J's

questions about posters.

But all the way on the right.

All the way on the right.

Right. Hello everyone. I'm J and today I

will talk about the fact and also the

attribution variant. So I will first

begin with the problem of the back field

citation po. So what is a fact field

citation? It is a a method of

attribution which is to citing some

sources but these citations are are

added after the generation process is

complete.

Maybe you can imagine these situations

when London models they're answering

some answering some questions about

a famous a famous person and he first

directly generate all the biography of

these persons and then it just began to

search for the internet and and randomly

add some relevant or unrelevant

citations to all of the answers he as of

all the answers he has generated. So th

this is different from the attribution

during the generation which is also also

a mop attribution that the L models just

first generate every sentence and then

get the attribution and verify it and

this is so the difference between these

two citations is that um the first one

is is is requires very low resources and

also it is very fast but the second one

is very about cost costly.

Um so maybe some large models don't want

to cost much and they just uh first uh

generate answer based on their internal

knowledge and then added some relevant

or or irrelevant uh citations. So the

drawbacks of the back build senses is

that first the models did not use the

sources which may causes some errors in

in the factuality and also the second is

a misributions of the truth. Maybe there

are some cases when although the logos

generates the right answers or the facts

but but the their citations is

irrelevant or the statistics not verify

what they answered. So so think of this

backfield citations and we we will try

to figure out it and also we will try to

avoid it this problems. So in this

situations the the factuality is very

important here and because if the factor

accuracy can't be satisfied in the L

models generation but do we really need

to uh care about the sitations or the

attributions actually I think we don't

need to care about it because because

there is some factual problems in the in

the in the answers so only so if the

citations is relevant of the citations

can verify the the answers. It it

doesn't matter or it doesn't make any

sense. So the next question is if the

models can answers something that is

that does not have many factor problems.

But this times the attribution

measurements is important here because

we need to verify whether the generating

statement is fully supported by the

cited reference because we maybe we need

to use this citations for some further

studies. So I will first talk about the

factorialities of of the long form text

generation. uh I I have to mention that

um this speciality u doesn't have

relations to the uh attribution variable

it just and verify whether the L models

generated generating generations has

factory problems or doesn't have a

doesn't have a factory problems. So

there are many famous or classic

classific um classic methods on it and

the first I will talk about the factor

factor score. The factory score is a new

evaluations

of the factories of the long form text

generation and look at this p look at

this picture. It asked the model to tell

a bio of famous people and the CHP

generates a long form text which

contains many facts of these people and

we can see that some of these facts are

right and also some of of these facts

are wrong according to the Wikipedia

but the former evaluations of the

factory only gives zero points to these

these two answers because it analyze the

answers in in a general or in or in a

form. So so the ideas of the fact this

paper is that it breaks the generation

of into a series of atomic facts and

also comput the percentage of the atomic

facts supported by a reliable knowledge

source. In this paper it is the

Wikipedia

and the good news is that it can

distinguish from different answers. For

example, in the first situations it has

it has more factor

factor scores because it has it has more

correct atom it contains more correct

atomic effects.

So this is the definition of the fact

factor which is very very easy to

understand. Now the problem is that how

to get the atomic effects from very very

long text and the answer in the paper is

that they just use the bang models to

generate the atomic effects which are

basically on every sentence. For

example, there are many sentence in this

generation and they divide it and also

ask a module to generate whether these

are the atom atomic effects. Usually a

sentence can be divided into many many

atomic packs and these atomic packs add

together and and and they they are

evaluated one by one

and and also they found that using using

human evaluation to to to verify these

facts these atomic facts is very cost

costive and time consuming.

So they they propose a method to do the

auto automatic score score generation

and it contains many parts and and also

many methods. The first they are they

are four method used here. First is a

nonconcept logical model which directly

judge whether atomic factor is right or

wrong and second is a retrieval and also

use a large models to generate. It

retrieves the top the top answers the

relevant passages from the Wikip and and

ask model to generate whether it is

right or wrong. And the third is the non

parametric probability. And the beauty

of this you you can see is in the

reference paper in the at the bottom. Uh

it can be seen as a pure uh retrieval

method which it mask the every token uh

in the in the generation and also it

uses retrieval to retrieve some relevant

answers and then it a results of every

mask token to to to

judge whether the generation is right or

wrong. And the and the third is inseam

method which uses the first uses the

retrieval and retrieve method and also

combine with nonparametric probability

to give a final answer and the the

results are there which is not very

important I think and also the paper

find that um about more than 30 of the

supported and also the unsupported

sentences have citations respectively.

So which means which means that the it

seems that the citations most of them

have no impact on the actuality. So

which which refers to the what we call

the back field citation problems because

no matter whether you you have citations

they are the the correct answers is

about is is near is the same. Uh so so

next paper is also about the factory and

of the long form text generation. it's

called the uh very score and and the key

idea here is that he find that in the

previous study like I have mentioned the

which we the previous the fact score so

the we find that not all claims are

varable uh we can see that from the from

the picture and from the left side of

picture and the claim two and also the

claim six these these answers are are

dependent on context and also some of

these claims they don't have a they are

not varable

for example

the the claim six uh because they have

no information on the VTP entity and

also the these things are not what we

call the fact and the the this paper

find that all clims are verifiable

and and and they then they also find

that the facts may depend on the

context. So there are many ambiguous

reference using the P for example in

this picture and in this sentence it's

large and starchy sweet tasting tubless

roots and this this this sentence have

it but uh but when using the pre

previous method and it only uh contains

the but but they don't know what the

refer to

like like the method text method in the

fact the fax for so this uh automic

factor generation is is not very good

because it dismiss the context in the

generation. So, so in in this method

they also add some prompting to to ask

the alarm to generate the atomic effects

by um considering the context beyond

this this sentence so that the it can be

can be referred to its real entity which

is which can generate more more better

atomic effects and also in the sport

calculations

The researchers found that if large

generates only a few facts or they they

even uh they even don't generate any

effects and they will have a very high

precision score. So they added a record

score to the to it and also which will

penalize the situations and the record

is a is very easy. The K here means the

average effects that a lot generates the

average network and also the SR means

the fact that they are they are answered

correctly.

Uh so so next I will talk about the

atomic attribution measurements. Uh so

in the in the in the in the in second

part of our lecture today actually they

have mentioned a benchmark called the

attribution bench which is is relevant

to what we talk here. It also the topic

is that manufacturing manu verifying

whether the generative statement is

fully supported by the sites cited

reference is costly and consuming. So

they are using the large level models to

to verify whether the generated

statement body and different from the

previous paper they they propose a new

uh classific classific classific frame

which contains two which contains three

labels and the first is attri attri

attributable which means that the

reference fully supports the generated

answer and the second is extra

repository which means that the

reference lacks sufficient information

to validate the gener generated answer

and the third is the contradictory which

means that the generated answer

contictates the information presented in

the reference. Uh let's see examples and

ask there are three categories in them.

The first is would be one which which is

called the attribut attributable and

which means that the the models answer

these questions and also they give a

reference to the uh to the answer and

when we open the reference which is the

website and we found that these f

are the things and we can attribute our

answers to it and the secondly is have

some problems because it doesn't answer

the reference or the citations doesn't

answer the answer the problems. It just

offer something that is relevant to the

larger length of the answer. And also

the third one is what we call the

contradiction which means that there are

some contradicts between the the

generated answer and and also it's

citations and this paper just promps the

models uh to to to with this this papers

give uh two method to solve these

problems and the first is that the the

pro models with a a clear evaluation

instruction to judge whether the

scientist really answers or really

really answers the the question. And

also and and the third and that they

fine tune some large models on set of

diverse

repos data sets

and um and they test the their methods

on the atomic attribution measurements

and found that there are some

mispecified examples which are very

interesting here. Uh and first the most

common problem is that uh when I asked

everyone to judge whether this citation

really verified the this answer and and

the citation to be very long so that it

contain many nuance information or the

finding information and the the logos

can't detected so that they they got a

misclassification

and the Second problem that's um the the

the the loging model misunderstanding

the task definitions and also the

logical relations implied by the labels.

And the third uh problem is that uh the

logical filling on the symbolic

operators.

For example, there are some mass

operations or mass calculations and all

and the this is then a lot of the

counter recognizes so that they judge it

wrong

and these are complementation to the

previous what we call the triion bench

bench paper and I as I shared here

okay that's all for my part and and do

you have any questions

Okay, let's check our speaker.

>> Okay, let's go directly to Eric then.

Yeah thanks.

>> I did want to say one small thing about

what J Chong presented. You'll notice

that uh in uh a lot of the research that

we do at N US there's a center for uh

trusted internet and community and uh

giant who is our coordinator for lecture

today is working very heavily on that

and uh some of what Chen presented was

about attributable

uh datatory and um got the last one um

>> contra contradictory. These are very

similar to what we do in uh factchecking

research. So in facteing um you know you

want to be able to say that a source

verifies a a a fact right um whether

there's enough information to conclude

it or not. So in certain cases we have

that third category

extrapolatory which we call in other

data sets not enough information or NEI.

Okay. And then there's the refs category

which says that there's information that

contradicts the evidence. So you may see

two sets of different terminology that

Chang presented but if you also read

factchecking papers those three

categories are also valuable.

Okay take it away. Thank you so much.

>> Okay so I'll be talking about answer

calibration and as extension here. Uh so

in this paper it talks about how

traditionally

um black blackbox methods are used or

white box methods are used to uh

determine the confidence of models. So

typically you look inside the model

internals inside the the probability of

index token actually being generated.

However it states that there are several

problems with this one being that many

models are actually closed source. So

you can't go back in terms of in terms

of the modeling and also uh high

probability of the next token being

generated does not mean like actual

uh confidence. So for example, if a

sentence says chocolate milk comes from

brown cows, uh the next word might make

sense to come after the other. However,

it's not uh it's not semantically

accurate. So this paper looks at uh

black box methods. So it involves

several stages. Uh involves prompting

the model to ask for confidence scores

as well as sampling it and using various

aggregation strategies to come up with a

final uh confidence score.

Yeah. So some of the prompting

strategies include just a vanilla

prompting strategy which is just ask me

what the confidence score is. Uh other

ones worth noting include uh self pro

which is inspired by the fact that like

if you ask a human to evaluate someone

else's answer it generally has uh

provides a more accurate evaluation of

the answer. So in this case involves

like uh asking a question in one chat

and then starting a new session and

asking uh the LM to NF to give a

confidence score for this answer. Uh

other prompting strategies include

multistep which is breaking down the

problem into K steps and then

aggregating the getting confidence at

each step and then aggregating the

confidence into one uh confidence score.

uh also includes topk which includes

which asks the model to provide multiple

uh guesses and it's it's at least it it

makes the model aware that there are

multiple different answers possible and

it uh induces this to produce the

confidence. So yeah,

so how the evaluation happened was they

use

uh

right uh so yeah so we'll talk about

aggregation strategies. So after

so one of the aggregation strategies

would be just consistency. So you have

an answer and then you ask it again and

again and then ask see how many of the

answers are actually the same as the

given answer. um also uses

average confidence which is uh similar

to consistency but it uses the actual

confidence generated for each answer.

Yeah. And then pairing uh basically

uh uses the ranking of the sample

answers and then creates a distribution

for each answer. Yeah.

So

yeah in terms of evaluation it takes uh

uses two main metrics. One is

calibration which evaluates how well the

confidence level aligns with the correct

level. So if a model says that it's 80%

confident of it answer then 80% of the

time it should produce the correct

answer and then it also uses failure

prediction which is that higher

confidence that's higher or more correct

answer should be given higher

confidence. So it just differentiates

between uh it's able to differentiate

between more

giving correct predictions and wrong

predictions. Yeah. So some of the

findings of the paper include that uh

tend to be overconfident and that the

different prompting strategies uh did

reduce overconfidence over the vanilla

uh prompting strategy. Um

yeah, then some other good some other

finds findings include uh that having

more samples does improve failure

prediction and that pair wise the pair

wise method that they use here uh had

the best calibration while average

confidence had the best failure

prediction probably because average

confidence takes into account the

weights of uh

take uses weighted confidence to

determine the confidence of the answer.

Yeah.

And then I'll talk about another paper

which talks about long form generation.

So in the previous uh paper that we

talked about many of the equations

actually use like y = to y

I right and that's like oh every answer

is either correct or wrong. So, but then

if you have a confidence course that

says that I'm 80% confident for long

form generation, it could either mean

that you don't know if it means that if

I I'm 80% confident that 100 answer is

100% correct or I'm 100% confident that

the answer is 80% correct. So, what you

want for long form generation is you

want to know if the model is x%

confident that the answer is uh y%

correct.

And then basically you have this

distribution and you want to check it

against the actual uh correctness

distribution. Yeah.

So how they generate how they elicit

confidence is uh somewhat similar. So

they have they use this method called

continuous self evaluation which

includes which involves just uh

evaluating an answer over and over again

and seeing how many times

uh yeah evaluating the same answer over

again and just generating a probability

distribution for it. And

another method they use is py

self-existency

which involves just

uh generating

which involves yeah generating answers

again and then comparing how similar it

is and basically producing a confidence

score based off of this. Yeah. So um

this are the just the various methods

they use to compare similarity which

which includes like just uh maybe

evaluating based on claims or named

entities. Yeah.

Yeah. And then uh they also talked about

they also use calibration similar to the

last paper and how it's different is

just that now instead of having uh the

probability that y goes to one uh you

have the probability that y equals to s.

Yeah. So that's that's all there is.

Yeah. That's all

right.

Well, we discussed in the parents part a

little bit about calibration. So, uh

does everyone understand what

calibration means here?

Can I invite somebody to give it a shot?

What does confidence calibration or or

things of this sort mean or error

calibration?

It should mean like the model should be

well aligned like it's verbal as

like if the model says that I'm 80%

confident will be the expect that it is

actually correct for this of sample

>> right okay so a a good domain that's

well calibrated is something like

weather forecasts right because uh when

we do weather forecasting we look over

like a 100 days where the weather was

exactly like this at this hour. We see

how accurate the weather forecast is two

hours later, right? And then based on

that percentage is then we can say that

80% of time it did rain or it didn't

rain, right? So those things are are are

based on uh forecast and then checking,

right? But we know humans and because

humans are not well calibrated that

means large language models because

they're trained on the basis of human

data right uh because uh when when they

they're asked to give percentages it's

basically a next token prediction task

right so um unless you have trained the

model and calibrated the model those

things would not necessarily be well cal

so those are are big problems I

Uh what what type of percentages do you

think humans are bad at u

being calibrated about?

>> Erin, maybe you can say a bit too.

>> Sorry.

>> Well, what percentages do you think

people are bad at uh determining? Like

do you think people really don't

understand? Let's say 100% correct, 50%

correct, 70% correct, 10% correct. What

What are some values that you think

might be off?

You know, I think 100% off might be off.

It's

Yeah like

maybe you just forget like the off

chance that something is like off. Yeah,

it's okay.

Right. So actually uh if you read a lot

of the uh cognitive sciences, people are

really bad at interpreting

probabilities, right? And um we're quite

irrational when we do that. Like uh

you'll see people avoid very very low uh

probability events just because they

think it might happen to them. You know,

you can think about uh being uh struck

by lightning or you know, crashing in

the airplane or things of that sort,

right? So low probability events just

like what Aaron said are very hard also

mid-robability events like exactly what

it means to be like 30% of the time

actually people have a hard time

calibrating that. So those types of

events language models by by proxy

because they're trained with large

language uh data are also going to be

pretty right. So this calibration step

that uh uh Erin was pointing out on the

last couple of steps of slides using the

map is actually very important to do and

we see that in many domains where uh you

actually have verifiable uh out outcomes

and you can roll them out you know for

an example in math reasoning and other

things and you can calibrate the u

certainty more properly in a basine.

Okay. Uh so I think our last segment is

from Chai. So uh Chai, over to you.

Thanks Aaron.

>> Okay, so this will just be a brief

discussion on like the system design

reporting standards from being design

systems with with credibility.

So there is no ground full of consensus

for this as I see. So I'm just speak my

own.

So like first let's talk about

credibility with regard to just the

retrieval process without actually touch

of the cell. So about the scope let's

talk about the retrieval index

presenting say how we fix the retrieve

and the problem and for the goal is like

we want the system to output answers

that are defensible and audit and

auditable by audience.

So here I'm trying to frame this from

five perspectives right what we are

trying to publish and maintain as

designers of the system. So it step it

starts with the step card like how do we

retrieve brand sites and then about the

index data sheet. So this is about the

data that we use in our retrieval

system. So what data do we index and how

do we contain such data or update such

data. The third part is evidence trace.

So when we see an answer as output from

the system do we know when evidence

actually supported the answer? And the

fourth part is the lit one

configuration. How do we make it

transparent and credible for the users

or say like stakeholders of the system

and final part is the kind of the KPI or

diagnostics of the system.

So like let's first talk about like some

sort of rack card. So this is more like

a model card for predility system. So

the purpose is to propose an audience

friendly document for the react layer.

So here I think one important thing to

do is to document the intended use cases

and then the output cases such as when

we do some health advice cases as

presented by the user then the model

should be more towards.

So for here like I see there is this

paper called model cost for model

reporting at the fact 2019 complex. So

they have this intended uh intended use

part here

when we report a model and I think it

should be also used when we report such

system.

So here for the credibility controls I'm

thinking about like how do we wait the

different parts of the viability. So

what are the caps of the diversity and

say like are the data reason is the data

reason how do we handle problems and

what are the threshold that models

sustain and then say there are also some

non failure most of the system such as

whether the can become after some time

or select balances

and finally the version and dependencies

of the embedding rank versions as part

of the red systems model

and then the second part is the index

data sheet. So this is more like the

data source from where like our rack

system is uh retrieving from. So here

like I'm thinking of say what we should

record or expose from this data set.

It's like what is the composition of the

data set such as the domains

institutions or say like what data do we

include what data do we ex and then how

do we maintain or say is how do we

maintain the data uh through say how do

we maintain the data through means of

curation such as cross cadence

retraction handling and say what window

do we set for data freshness and then

the third part is about like all the

license optimized robot policy you'll

see what is the policy of especially

this error of generated content the

fourth part is some sort of non so

biases so here I'm more thinking of like

not the social bias type of stuff but

more like the low resource languages

type of stuff for instance where the

purpose might be over reliant on English

text rather than text other languages

for cultural

and also the last updated time step. So

what is the knowledge cut off of our

data?

So then the third part is for the

evidence trace like can we make every

sentence of the answer traceable to

exact evidence and then make it

detectable when it's not like actually

linked to

so here I'm thinking of say for every

answer the system should be able to

provide some retrieve document ids r

source and where's the starting and

ending ids uh for instance like the

character of the or the bites in the

evidence so it's basically suspect from

the long reaching evidence text and then

like the second part is like I'm more

thinking of say some sort of remove the

answer check so this one is solely

motivated by the attribution benchmark

so in the attribution benchmark so they

are finding that in many cases they

exist a citation

but like the citations do not actually

car so that means like the citations are

more decorative than actually meaningful

for the answer. So I've been thinking of

say like

the remove the answer and thinking of

say uh what if we give the model the

evidence in the first pass then we

measure the answer quality and then we

remove each piece of evidence and then

we check the answer quality to show like

which part of the evidence is actually

say strongly used by the model and which

part is more on the decorative

perspective. So this one is just purely

my opinion like this some attribution

batch but I'm just basing attribution

batch here as a source

and then like there should also be some

steps for cited sources such as publish

that retrie

and then like there should visible

citations response to the except

and then this part is more about the

retrieval or ranking configuration. So

this is basically say uh how do we

document this tunable mix of relevance

credibility diversity and recency of of

this evidence for version and auditable

so for the invest

so I think it's just quite standard to

report all those model cars or versions

for reproility and transparency

and then like I'm thinking of say assume

we are designing a model that is like

actually well-rounded and credibility is

acting as part of it. Say like we have

four ways of

correlated to four different aspects

relevance, credibility, diversity and we

should also say like uh document how we

are tuning those ways in our release

system. The third part I think is quite

important is the profile. So what is the

preset of our system? Is it like the

most

condition on safety so that we should

set a high weight for credibility or say

is it conditional decency so we must

prioritize breaking means or say is it

more on pluralism and diversity that we

must prioritize diversity and set a

higher weight for that. And then there's

also like some few short talk K uh

techniques such as when we get list of

evidence or signals how do we combine

them as he said at the top of them

and then like finally there's this

credibility signal source so a very

brief note whether they say reliability

prior

for instance here we can say one example

uh the media sky spec website where they

actually have a bias score for each

block website. So whether it is

like what specific biases do their

reporting reflects. So I think those if

we are retreating from them should also

be reported as part of what our

credibility is condition.

>> Go back.

So here when you're talking about

profiles, are you talking about profiles

that a system would expose or profiles

that a user would expose when they're

using system? What what are profiles

attributed to?

>> Yeah, I'm more thinking of say from the

aspect of design because like

>> a designer of a system,

>> right? Like assume I am already

packaging this system and then I reuse

it. So it would be good to inform the

user like how are we setting different

ways for diverse aspects

but I think it can also be done from the

user's perspective. So it's condition on

whether the user has access to team

>> right. Yeah. You're saying like should

be on users intention whether it should

be

>> yeah I mean all these dimensions are

possible as a as a provider let's say

for example open AI they say I have

three personas you know safety breaking

knees or breath first you can select

them or you know you could condition on

the query right so you could say oh I

think this one is a static query just

like

what we've heard before about things

that might be temporally sensitive,

right? So, if they're temporally

sensitive, maybe you would choose the

breaking first profile that's more

likely to be the right profile for this

type of query.

I mean, uh it could be a combination

just like uh you can ask the user uh

what their preference is like. uh I've

given you breaking these first based on

your type of query but if you want me to

rerank or regenerate my response to make

sure that verified claims are first uh I

can do that for you right so many many

language models now prompt you okay

what's the followup action after

satisfying response right

>> yeah I think it could be like x

user and why I think

it resets

that should be user and if resetting the

chance of say the option should be

and then thinking of say some KPIs or

say like how do we monitor our

credibility

system so here I'm providing some

examples

For instance like like using fact score

like what percent of atomic claim

balance is factory supported by the

activated exped

source diversity. Say the degree to

which the top pays cover distinct source

groups and sectors or say in the third

part uh some uncertainty when we want

the model to explain uncertainty perhaps

we wanted more to capture quality such

as like how well the systems not enough

information or not enough credible

evidence actually match reality

summarized by so this could be

summarized by a numeric calibration

error such as your price

and then some sort of risk coverage

trade off the chin fashion. So then like

I'm also having some closing discussions

like I design those in case we had more

time but since we are in three minutes

over time like let's just keep those

questions in mind as a closing

discussion.

So this summarizes our lecture.

>> Okay, let's thank Ch for the

So uh maybe what we can do is take this

closing discussion and put it on general

channel and I invite all of you next

week if you are coming to our social

maybe to talk about some of these over a

snack or drink. Okay. Uh because they

are very important. Uh I think one of

the things that I uh think about a lot

is you know the loss of locus of control

for people using large language models

because every time we delegate to a

large language model that is uh

something we're not doing ourselves

right so u it's likely that people who

are overly reliant will become u

commoditized right they're they're not

useful anymore so what we want to do as

a society is bring people along with the

ride rather than just make large

language models more complex, right? We

really actually want large language

models to be getting better, but we

actually want our population to be

getting better, too. So, uh, a lot of it

may come to some of the discussion

questions that talked about, right? So,

um, is it possible to install instill

more critical evaluation of information

uh by our own users of LLMs, right? by

prompting uh an element to call

attention to I mean uh Twitter used to

be able to say you know in order for you

to like or reshare a post you must have

at least clicked on the source that

you're you're citing right you can't

share it without clicking on it um and

that turned out to be a terrible

intervention for Twitter because they

lost a lot of traffic so they removed it

but you can understand that that type of

intervention would actually help provide

more factuality and more checking and

would inspire more skepticism before

spreading potentially fake news. Right?

So, uh I'd like for you to think about

that as something as everyone in this

room is pretty much a designer of AI for

the next 15 20 years. Uh these are

problems that are really uh worth

solving you know and we are only

touching upon tiny bits in this class.

Okay. So, let's thank Jaying and all of

our presenters again.

Okay. And uh we'll see you next week. Uh

I just want to quickly uh note an answer

to Jane's question earlier. So, uh if

you go to our threads uh and you go to

projects,

uh you will see here uh information

about how to print a poster. So this is

in the projects channel. So um it tells

you actually uh when you can print the

post. So I I printed this out. This is

actually from the FAQ site within the

steps uh thing. So you go to 27 steps is

here and you see there's an FAQ link

here and then you can scroll down and

here it is information that I have

screenshotted for you in Slack. So you

have right there this is what you need

to do. You need to go to RT.cip.nus.edu

EU select uh your category, select

others and then say event related A1

coded close to printing for 27 steps and

then attach your PDF file to get it

submitted. Okay. So uh once you submit

the file it's gone be printed uh all

your typos and all. Okay. So then when

you uh uh have finished printing you

will get an email. So earlier I think JF

asked whether I need to print it uh pick

it up right away. No, you can just keep

it there. Um yeah during steps they have

like a hund posters there. So um it may

be a little bit hard to view. Uh but

they will do it and then up your poster

on the day of and then

>> so that means you may not even actually

have to physically go down. You can just

do this online in the com comfort of

your home air conditioner in your PJs uh

and then send the

>> So we just need to submit this by next

Tuesday on my can collect on Wednesday.

>> Yeah. So it says here 7th, 10th and

11th. So this is Monday and Tuesday I

think because this must be Friday.

>> Okay.

So um that's also here. So you can just

go through the slides for the

exhibitors. that to all of you and

another uh thing that tells you about

costs. Okay, so yeah, do do remember

high resolution because they are A1

size. They're bigger than your uh

monitors unless you have a really big

ass monitor. Um should be pretty. Okay,

so thanks everyone. Um so that's all for

today. Thank you.

12: Source Credibility Assessment – Retrieval Augmented Generation (CS6101 WING.NUS)

Web IR / NLP Group at NUS

10 days ago

2:02:18

RAG & Vector Search

Rank #4

Description

(Lecture Starts at xx:xx) Slides for this session: http://soc-n.us/cs6101-t2510-w12 Video Playlist: https://soc-n.us/cs6101-t2510-playlist CS6101/DYC1401 Retrieval-Augmented Generation Week 12: 6 Nov Oct 2025 AY 25/26 Sem 1 (T2510) Summary by Zoom AI AI Credibility Survey Overview The meeting focused on discussing the credibility and trustworthiness of sources in AI-generated content. Miao presented an overview of a survey that evaluates the trustworthiness of AI retrieval and generation systems across six aspects: robustness, privacy, fairness, transparency, factuality, and accountability. The survey, which includes experimental evaluations using language models, found that GPT-4 performs well in most areas but lags in privacy. Content Verification in the Digital Age Niall presented on the challenges of verifying content authenticity in the digital age, highlighting how traditional verification methods are inadequate against modern AI-generated content. He introduced the concept of content credentials, which use cryptographic signatures to create a tamper-proof chain of custody for media content, and discussed a recent paper that evaluated attribution accuracy in content verification, finding that current models achieve only an 80% accuracy rate. AI Credibility and Energy Challenges Kan presented a framework for verifying and attributing generated text to its sources, introducing the concept of explicators for context handling and discussing methods for dealing with missing or forged credentials. The discussion highlighted the challenges of maintaining credibility in search engines and the high costs associated with RAG (Reinforcement of AI Generation) traffic, particularly for open-source platforms like Wikipedia. Credibility-Aware RAG Solutions Nikhil presented Section 3 of the presentation, focusing on the challenges and solutions related to document credibility in RAG (Retrieval Augmented Generation). He explained how low-quality documents can mislead the RAG pipeline, leading to decreased performance of LLMs. To address this issue, Nikhil introduced the concept of Credibility-Aware RAG, which weights documents based on their credibility rather than just relevance. He discussed three methods to integrate credibility: Reliability-Aware RAG (RAD), Credibility-Aware Generation (CAG), and Credibility-Attention Modification (CRAM). Nikhil compared the performance of these methods, highlighting their strengths and limitations. Credibility and Diversity in Rankings The meeting focused on discussing the relationship between credibility and diversity in ranking systems, with Fu presenting a framework for pluralism and source selection. The discussion covered how credible rankings can lack diversity, and introduced a new ranking model that combines credibility with diversity and fairness metrics. Fu presented a graph-based adaptive re-ranking method that uses a two-stage retrieval pipeline, incorporating policies to ensure balanced group representation and fair exposure across different sources. Factuality in AI-Generated Text Jason presented on factuality and attribution measurements in AI-generated text. He explained the difference between backfilled citations and real-time attribution, noting that backfilled citations are faster but may lead to errors and misrepresentations of truth. Jason discussed the importance of factuality in AI-generated content, particularly for technical topics, and introduced the concept of the fact score as a new evaluation method for measuring the accuracy of long-form text generation. Advancing LLM Evaluation Standards The meeting covered several topics related to the development and evaluation of large language models (LLMs). Kan presented on methods for automatic scoring of factual statements and the challenges of context-dependent claims. Jicheng discussed a benchmark for attributing statements to their sources, highlighting the need for transparent and auditable systems. Jared explored confidence calibration in LLMs, presenting various methods to improve model accuracy. Jai Ying proposed design reporting standards for credibility in retrieval systems, emphasizing the importance of transparency and traceability. The discussion concluded with thoughts on critical evaluation of information by LLM users and the need to balance AI development with human capability improvement. 00:00:00 Kahoot! from W11 00:07:31 00:17:20 Section 1: Introduction (Yisong) 00:42:32 Section 2: Provenance and Authenticity: Beyond Text (Nayanthara) 00:57:15 Section 3: 3: Credibility-Aware RAG Methods: Placing Trust Inside the System (Nikhil) 01:19:20 Break 01:24:05 Section 4: Pluralism & Bias-Aware Source Selection (Zihang) 01:45:25 Section 5: Factuality & Attribution Measurement (Jiecheng) 01:45:25 Section 6: Uncertainty, Calibration & Abstention in Credibility-Aware RAG () 01:45:25 Section 7: Discussion: System Design and Reporting Standards (Jiaying)

Video Details

Category

RAG & Vector Search

Featured Date

November 7, 2025

Quality Rank

#4

AI Recommended