Beyond the vibes: Learnosity’s journey to a robust LLM evaluation framework - FC London '25 | DailyDevLists

Loading video player...

Full Transcript

2,777 words • EN

Thank you so much for coming. Um, and

thanks to ways and biases for inviting

me to talk about the type of work we do

at Lenosity.

So, I just said this talk all be about

eval evaluation and our journey through

that kind of moving beyond initial vibes

that we were that we were using and

going towards more experimental and

evidence-based approach for selecting

the best models and prompts to use for

our clients.

So, my name is Shan McCraen. That photo

is a very old photo of me. you may have

told you may tell um and I work I work

as a data scientist at Lenosity and I

don't know if anyone here has actually

heard of Lenosity before it's quite

niche you might not have actually don't

really have a platform but we do power

most of the kind of educational

platforms who want to use our

assessments APIs

so we've also the past two years have

been really focused on AI um assessments

and AI there's huge opportunities within

this space so I think remember when

chachi three first came out and we all

got really excited about the potential

of integrating this into our product. So

our first product was authorade.

Authorade is like a content generation

uh style product where the user can ask

to generate a new question without

having to write themselves. Feedback aid

was really exciting at lenosity. We do a

lot of kind of scoring. So for example

the simple one be like m multiple choice

questions is easy to score multiple

choice. However, essays and short

responses and stuff like that have

always been notorious notorious to score

without dies of AI. So, that was been

really exciting for us. And then our

most recent product is health check. And

what this does is um kind of enhances

and modernizes old content that our

clients may have. So for example, if

you're a client that you wrote maybe a

couple of questions back in the early

2000s, you know, maybe have some some

archaic language or maybe a bit biased

and we use AI to rewrite those questions

and make them reusable again.

So as I said, our talk's all about

evaluations. I suppose what can happen

if we don't evaluate and of course we've

maybe seen some news stories um in the

past year about people who have used AI

to do certain things and there's been

hallucinations and potential bias and

it's kind of really ruined the

reputations most recently probably deote

out in Australia who used AI to write

these reports and I almost guarantee if

they had spent a bit of time evaluating

they probably would have caught these

type of type of hallucinations

but it's not just about trying to avoid

embarrassment that we want to do. We

also ev validate to make sure we have

like the best possible output for our

customers. So if you think of something

as simple as a teacher giving feedback

on an essay, maybe after the 25th essay,

the teacher is late at night and the

feedback gets a bit dry possibly it

might become like a a very good or you

could do better here. It's not quite

actionable or so maybe supported to the

student that might not use evidence to

link into the essay. So we try to use

our AI to really think about what's the

best possible output we want from it and

that's what we evaluate for. Another

thing is it kind of elevates our value.

We see actually quite a lot of people

even from interviewing people who come

from some AI places or who've been using

AI, they don't tend to actually do as

much evaluations as you might think. So

by by evaluating and showing our

customers that we evaluate, it really

does kind of show us like distincts us

from some of our competitors. And we

could go to a customer and say when

they're maybe worried about

hallucinations and we could tell them

well we've evaluated for that we've done

over 100 model 100 prompt combinations

and we could see that for maybe a base

prompt that maybe one of you guys you

know someone could do um or kind of

prompt and model combinations only

getting a hallucination rate of less

than.1%. So really does build uh trust

from our customers.

However, it wasn't always like that.

When we first joined and we first

started working with AI, we kind of just

wanted to jump right in and create these

use cases. So, initially when I first

joined on just over a year ago, um we

were kind of having these prompts and

models and we were looking at the output

and for a lot of time it was just like

we I'd ask someone like how did you test

this and they say well we tested it and

it looks good. So we kind of went off a

lot by vibes and as well a lot of people

had different ideas about what they

wanted to put in their prompts, what

models to use and they would do stuff

like I read this blog and we should use

you know three shots that's the best way

apparently. It would always come back to

like how do we know that's the best way

for this use case it might be the best

way for a different use case. So we have

a very basic framework that we kind of

set up. So at the beginning of every

evaluation task we like to research just

understand what the what our output's

supposed to be. We then like to plan

this out and this is kind of an addition

to our framework. Um and this really

helps us um keep the balance between

speed and rigor. We do need to get

something out. And then of course data

uh data for every data science task has

been important and evaluation with LMS

is no different. Only difference now is

that we tend to go for synthetic data

more often than collecting and cleaning

data. And then we build the pipeline. So

we set up our experiments in weave and

we set up our metrics and then we run

them. And then finally we report and

recommend and we report and recommend to

different audiences.

So I just going to go through a bit of a

use case and one of our this is one of

like our LM calls we use for generating

distraction rationals in our kel check

product and so rationale um you probably

haven't heard I didn't realize what this

was either about 3 months ago but

essentially like a bit of feedback at

the top of an MCQ option. So for the

correct answer it should say correct you

got the right answer for the incorrect

answer it will tell you incorrect and it

will give you a bit of context to why

you might have chosen that option and so

this is something we use AI to generate

where traditionally maybe content

creators themselves might not bother

with this because it took a bit of time

to do

and so our first step to research we to

define what's what makes a good

distractor rationale for example what

attributes does a good distractor

rational have and so we like to kind of

learn through this by sometimes reading

academic papers or blogs or particularly

talking to our in-house experts or

talking with our customers who like to

tell us well a good distractor rationale

should never leak the answer if you

select the wrong option. Um the good

distractor rationale should be

supportive for the student. It should

use particular language depending on the

audience. If the audience is a

10-year-old and we should use very

simple basic language and if the

audience maybe a master student we

should be a bit more complicated with

it. And so

so then yeah so then we kind of look at

like how do we measure this? So we like

to figure out well has anyone done any

measurements on this? Has anyone worked

on this task before? H typically for

ourselves quite niche so not many people

are are generating distractive

rationale. Um but for other kind of our

use cases we see like for the essay

grading there's been a lot of work done

on that and we'd like to understand like

what measures do they use? What's the

standard measure in the industry? Um and

then at the end of all our research we

had to come up with these basic

artifacts. So first is like an initial

educated based prompt. So for example

distracted this would be a prompt may

maybe made by the prof by a

professional. So for example we're

generating distracted rational you might

say gi given a question generate four

distracted rationals that don't need the

answer don't hallucinate and don't

whatever else. And we also have to

generate a vanilla prompt. This maybe an

amateur would do and this be like just

generate a distracted rationale given a

question. And this is always really

interesting to compare our stuff

compared to these base and vanilla

prompts at the end. And of course, if we

find a data set, we would like to use

that. We um also like to have the

measures kind of set in stone. This is

what we're going to measure. And the

acceptable thresholds. So, for example,

if a client tells us that we shouldn't

be using um this generation if at least

the answer more than more than 1% of the

time. So, that's what we're kind of

aiming for.

And then the eval plan. So this is kind

of added in. Um when we first started

doing this kind of framework, we

sometimes found ourselves spending way

too long trying to actually perfect

everything. And I suppose what this here

does is a to give bit give us a bit of

balance between speed and rigor. So of

course we want to be rigorous but we

also want to get something done and get

out into the production. Um so

particularly in the eval plan, what we

want to do is select the models like

have an educated guess of the models. We

don't need to test everything for

something for something as simple as a

jack rationale. We don't need a high

performing reasoning model with a huge

reasoning budget. We might just want to

select down maybe some open source or

some of the mini models. And we also

want to uh document our prompting

techniques, you know, so we said, okay,

we should test this train of thought

prompting technique that's worked really

well for non-reasoning models. We should

maybe try one more one shot, two shots

to see how that works. And then

obviously the evol the evol data set

definition and the metrics and the judge

prompts and the judge prompts. I don't

know if here maybe many people are

familiar with the elements of judge

technique. One of the things that we

found as well was that sometimes people

were not creating their own judges. It's

getting a bit siloed and kind of reusing

stuff and maybe not to the best

performing. So we now like to document

those better in our eval plan. So we

could all have a look and collaborate

and basically peer review it. And this

kind of helps standardize the whole

process and definitely helps with

reproducibility.

So I'll talk a little bit about our

synthetic data process. Um we've been on

a bit of a journey with this. Initially

when we first started hearing about

synthetic data we kind of followed this

tutorial of given a good example try to

like we ask another LM to give me a

similar example. We kind of found that

really lacking coverage. And for

example, say we were doing a biased job

before um one of the questions was like

why are boys better than girls at sports

the LM and try to rewrite that would

just basically give the same question.

It was like why are girls worse in

sports and stuff like that when we

source it in reality. So what we do is

we take textbooks or manuals and we

generate questions and answers from

those. We get a lot better coverage and

a lot a lot more realistic data for our

stuff. So we follow very much a rag

approach. If anyone's tried to evaluate

rag before this is typically how a lot

of people do it. So for example, we'll

get a textbook, a history textbook for

10 year olds. We would label that. We'll

take the topics from that textbook and

we'll get the chunks for each of those

uh topics and then we generate the

questions and answers and we kind of use

this as our foundation phase where we

always re reuse it. So we do a lot of

evaluation just on this itself through

judges and also human annotation. And

then we have a certain job we might want

to inject certain failures so we can

measure against.

And for the evaluation pipeline, this is

where we use weave a lot and weave

really is the heart of everything we do

in terms of our pipeline. So we for the

pipeline we set up our experiments and

our metrics and this all gets stored in

weave. So for each new use case we set

up a new project and we'll put in the

models that we that we're going to test

the metrics the judges and the data sets

and the prompts. And this really helps

us with transparency.

And then finally at the final stage,

this is all about the sharing the

insights and and reporting. And we tend

to have three different audiences for

this. Um, and it's been brilliant for

like even kind of vibe coding new

charts. This makes the whole process a

lot quicker. Before we probably wouldn't

have have been able to do so much, but

typically now we share the analysis with

our internal data science team, detailed

reports, more statistics, and kind of

have detailed discussions on findings.

Like for example, we see that this chain

of thought prompting technique works

really well for non-reasoning models. we

should probably stick with this type of

thing for all our future experiments for

different use cases and this kind of

just overall drives metric and prompt

and uh prompts performance.

We also share with our learn AI team and

so these are the software engineers who

work and building these products as well

and we would like to do is we like to

make more collaborative. So once we set

up our pipeline, we've also built a kind

of an evaluation hub using uh streamlit.

And we like to ask our engineers to try

their prompts as well. See after we've

set up the whole thing, see if they can

beat our score like we would give a

recommendation on our metrics based on

our metrics. And we invite other people

to also try to win this. And we use the

weave leaderboard feature for this. And

overall, this just encourages

experimentation and makes people feel

more involved in the AI piece who maybe

not don't get to write prompts all the

time. And then lastly, we started

sharing now with our customers team and

we now really now want our sorry our

sales team. We now like our sales team

to be able to share with our customers

how much work we're doing. There's no

point in just hiding all these

experiments and all these metrics. So we

like to show them um hallucination rates

and stuff like that to show them to

almost stand out and to build that

confidence with them.

And lastly, it doesn't just stop at um

the at uh pre-eployment for we also

evaluate beyond deployment. So we like

to use like weave monitors

and this is to I think you might have

seen it earlier today in one of the

demos. This is these are judges on a

sample. So we always make sure that

who's ination rates are saying the same

and along we have the we feedback we

have that connected to our accept and

reject rates so we can follow actually

how well it's doing along with our

traces that we then use for back testing

um over our synthetic data.

Beyond the vibes: Learnosity’s journey to a robust LLM evaluation framework - FC London '25

Weights & Biases

83 days ago

13:13

AI Evaluation & Monitoring

Rank #1

Description

At Fully Connected London '25, Sean McCrossan, Data Scientist at Learnosity, shares how his team moved beyond intuition-driven "prompt engineering" to a structured, data-driven framework for evaluating large language models. He explains how Learnosity designs, tests, and validates AI educational products for quality and reliability, using techniques like LLM-as-a-Judge, synthetic data generation, and W&B Weave-based evaluation pipelines to ensure accuracy, fairness, and trust in real-world learning applications.

Watch on YouTube

Video Details

Category

AI Evaluation & Monitoring

Featured Date

December 16, 2025

Quality Rank

#1

AI Recommended