Loading video player...
Thank you so much for coming. Um, and
thanks to ways and biases for inviting
me to talk about the type of work we do
at Lenosity.
So, I just said this talk all be about
eval evaluation and our journey through
that kind of moving beyond initial vibes
that we were that we were using and
going towards more experimental and
evidence-based approach for selecting
the best models and prompts to use for
our clients.
So, my name is Shan McCraen. That photo
is a very old photo of me. you may have
told you may tell um and I work I work
as a data scientist at Lenosity and I
don't know if anyone here has actually
heard of Lenosity before it's quite
niche you might not have actually don't
really have a platform but we do power
most of the kind of educational
platforms who want to use our
assessments APIs
so we've also the past two years have
been really focused on AI um assessments
and AI there's huge opportunities within
this space so I think remember when
chachi three first came out and we all
got really excited about the potential
of integrating this into our product. So
our first product was authorade.
Authorade is like a content generation
uh style product where the user can ask
to generate a new question without
having to write themselves. Feedback aid
was really exciting at lenosity. We do a
lot of kind of scoring. So for example
the simple one be like m multiple choice
questions is easy to score multiple
choice. However, essays and short
responses and stuff like that have
always been notorious notorious to score
without dies of AI. So, that was been
really exciting for us. And then our
most recent product is health check. And
what this does is um kind of enhances
and modernizes old content that our
clients may have. So for example, if
you're a client that you wrote maybe a
couple of questions back in the early
2000s, you know, maybe have some some
archaic language or maybe a bit biased
and we use AI to rewrite those questions
and make them reusable again.
So as I said, our talk's all about
evaluations. I suppose what can happen
if we don't evaluate and of course we've
maybe seen some news stories um in the
past year about people who have used AI
to do certain things and there's been
hallucinations and potential bias and
it's kind of really ruined the
reputations most recently probably deote
out in Australia who used AI to write
these reports and I almost guarantee if
they had spent a bit of time evaluating
they probably would have caught these
type of type of hallucinations
but it's not just about trying to avoid
embarrassment that we want to do. We
also ev validate to make sure we have
like the best possible output for our
customers. So if you think of something
as simple as a teacher giving feedback
on an essay, maybe after the 25th essay,
the teacher is late at night and the
feedback gets a bit dry possibly it
might become like a a very good or you
could do better here. It's not quite
actionable or so maybe supported to the
student that might not use evidence to
link into the essay. So we try to use
our AI to really think about what's the
best possible output we want from it and
that's what we evaluate for. Another
thing is it kind of elevates our value.
We see actually quite a lot of people
even from interviewing people who come
from some AI places or who've been using
AI, they don't tend to actually do as
much evaluations as you might think. So
by by evaluating and showing our
customers that we evaluate, it really
does kind of show us like distincts us
from some of our competitors. And we
could go to a customer and say when
they're maybe worried about
hallucinations and we could tell them
well we've evaluated for that we've done
over 100 model 100 prompt combinations
and we could see that for maybe a base
prompt that maybe one of you guys you
know someone could do um or kind of
prompt and model combinations only
getting a hallucination rate of less
than.1%. So really does build uh trust
from our customers.
However, it wasn't always like that.
When we first joined and we first
started working with AI, we kind of just
wanted to jump right in and create these
use cases. So, initially when I first
joined on just over a year ago, um we
were kind of having these prompts and
models and we were looking at the output
and for a lot of time it was just like
we I'd ask someone like how did you test
this and they say well we tested it and
it looks good. So we kind of went off a
lot by vibes and as well a lot of people
had different ideas about what they
wanted to put in their prompts, what
models to use and they would do stuff
like I read this blog and we should use
you know three shots that's the best way
apparently. It would always come back to
like how do we know that's the best way
for this use case it might be the best
way for a different use case. So we have
a very basic framework that we kind of
set up. So at the beginning of every
evaluation task we like to research just
understand what the what our output's
supposed to be. We then like to plan
this out and this is kind of an addition
to our framework. Um and this really
helps us um keep the balance between
speed and rigor. We do need to get
something out. And then of course data
uh data for every data science task has
been important and evaluation with LMS
is no different. Only difference now is
that we tend to go for synthetic data
more often than collecting and cleaning
data. And then we build the pipeline. So
we set up our experiments in weave and
we set up our metrics and then we run
them. And then finally we report and
recommend and we report and recommend to
different audiences.
So I just going to go through a bit of a
use case and one of our this is one of
like our LM calls we use for generating
distraction rationals in our kel check
product and so rationale um you probably
haven't heard I didn't realize what this
was either about 3 months ago but
essentially like a bit of feedback at
the top of an MCQ option. So for the
correct answer it should say correct you
got the right answer for the incorrect
answer it will tell you incorrect and it
will give you a bit of context to why
you might have chosen that option and so
this is something we use AI to generate
where traditionally maybe content
creators themselves might not bother
with this because it took a bit of time
to do
and so our first step to research we to
define what's what makes a good
distractor rationale for example what
attributes does a good distractor
rational have and so we like to kind of
learn through this by sometimes reading
academic papers or blogs or particularly
talking to our in-house experts or
talking with our customers who like to
tell us well a good distractor rationale
should never leak the answer if you
select the wrong option. Um the good
distractor rationale should be
supportive for the student. It should
use particular language depending on the
audience. If the audience is a
10-year-old and we should use very
simple basic language and if the
audience maybe a master student we
should be a bit more complicated with
it. And so
so then yeah so then we kind of look at
like how do we measure this? So we like
to figure out well has anyone done any
measurements on this? Has anyone worked
on this task before? H typically for
ourselves quite niche so not many people
are are generating distractive
rationale. Um but for other kind of our
use cases we see like for the essay
grading there's been a lot of work done
on that and we'd like to understand like
what measures do they use? What's the
standard measure in the industry? Um and
then at the end of all our research we
had to come up with these basic
artifacts. So first is like an initial
educated based prompt. So for example
distracted this would be a prompt may
maybe made by the prof by a
professional. So for example we're
generating distracted rational you might
say gi given a question generate four
distracted rationals that don't need the
answer don't hallucinate and don't
whatever else. And we also have to
generate a vanilla prompt. This maybe an
amateur would do and this be like just
generate a distracted rationale given a
question. And this is always really
interesting to compare our stuff
compared to these base and vanilla
prompts at the end. And of course, if we
find a data set, we would like to use
that. We um also like to have the
measures kind of set in stone. This is
what we're going to measure. And the
acceptable thresholds. So, for example,
if a client tells us that we shouldn't
be using um this generation if at least
the answer more than more than 1% of the
time. So, that's what we're kind of
aiming for.
And then the eval plan. So this is kind
of added in. Um when we first started
doing this kind of framework, we
sometimes found ourselves spending way
too long trying to actually perfect
everything. And I suppose what this here
does is a to give bit give us a bit of
balance between speed and rigor. So of
course we want to be rigorous but we
also want to get something done and get
out into the production. Um so
particularly in the eval plan, what we
want to do is select the models like
have an educated guess of the models. We
don't need to test everything for
something for something as simple as a
jack rationale. We don't need a high
performing reasoning model with a huge
reasoning budget. We might just want to
select down maybe some open source or
some of the mini models. And we also
want to uh document our prompting
techniques, you know, so we said, okay,
we should test this train of thought
prompting technique that's worked really
well for non-reasoning models. We should
maybe try one more one shot, two shots
to see how that works. And then
obviously the evol the evol data set
definition and the metrics and the judge
prompts and the judge prompts. I don't
know if here maybe many people are
familiar with the elements of judge
technique. One of the things that we
found as well was that sometimes people
were not creating their own judges. It's
getting a bit siloed and kind of reusing
stuff and maybe not to the best
performing. So we now like to document
those better in our eval plan. So we
could all have a look and collaborate
and basically peer review it. And this
kind of helps standardize the whole
process and definitely helps with
reproducibility.
So I'll talk a little bit about our
synthetic data process. Um we've been on
a bit of a journey with this. Initially
when we first started hearing about
synthetic data we kind of followed this
tutorial of given a good example try to
like we ask another LM to give me a
similar example. We kind of found that
really lacking coverage. And for
example, say we were doing a biased job
before um one of the questions was like
why are boys better than girls at sports
the LM and try to rewrite that would
just basically give the same question.
It was like why are girls worse in
sports and stuff like that when we
source it in reality. So what we do is
we take textbooks or manuals and we
generate questions and answers from
those. We get a lot better coverage and
a lot a lot more realistic data for our
stuff. So we follow very much a rag
approach. If anyone's tried to evaluate
rag before this is typically how a lot
of people do it. So for example, we'll
get a textbook, a history textbook for
10 year olds. We would label that. We'll
take the topics from that textbook and
we'll get the chunks for each of those
uh topics and then we generate the
questions and answers and we kind of use
this as our foundation phase where we
always re reuse it. So we do a lot of
evaluation just on this itself through
judges and also human annotation. And
then we have a certain job we might want
to inject certain failures so we can
measure against.
And for the evaluation pipeline, this is
where we use weave a lot and weave
really is the heart of everything we do
in terms of our pipeline. So we for the
pipeline we set up our experiments and
our metrics and this all gets stored in
weave. So for each new use case we set
up a new project and we'll put in the
models that we that we're going to test
the metrics the judges and the data sets
and the prompts. And this really helps
us with transparency.
And then finally at the final stage,
this is all about the sharing the
insights and and reporting. And we tend
to have three different audiences for
this. Um, and it's been brilliant for
like even kind of vibe coding new
charts. This makes the whole process a
lot quicker. Before we probably wouldn't
have have been able to do so much, but
typically now we share the analysis with
our internal data science team, detailed
reports, more statistics, and kind of
have detailed discussions on findings.
Like for example, we see that this chain
of thought prompting technique works
really well for non-reasoning models. we
should probably stick with this type of
thing for all our future experiments for
different use cases and this kind of
just overall drives metric and prompt
and uh prompts performance.
We also share with our learn AI team and
so these are the software engineers who
work and building these products as well
and we would like to do is we like to
make more collaborative. So once we set
up our pipeline, we've also built a kind
of an evaluation hub using uh streamlit.
And we like to ask our engineers to try
their prompts as well. See after we've
set up the whole thing, see if they can
beat our score like we would give a
recommendation on our metrics based on
our metrics. And we invite other people
to also try to win this. And we use the
weave leaderboard feature for this. And
overall, this just encourages
experimentation and makes people feel
more involved in the AI piece who maybe
not don't get to write prompts all the
time. And then lastly, we started
sharing now with our customers team and
we now really now want our sorry our
sales team. We now like our sales team
to be able to share with our customers
how much work we're doing. There's no
point in just hiding all these
experiments and all these metrics. So we
like to show them um hallucination rates
and stuff like that to show them to
almost stand out and to build that
confidence with them.
And lastly, it doesn't just stop at um
the at uh pre-eployment for we also
evaluate beyond deployment. So we like
to use like weave monitors
and this is to I think you might have
seen it earlier today in one of the
demos. This is these are judges on a
sample. So we always make sure that
who's ination rates are saying the same
and along we have the we feedback we
have that connected to our accept and
reject rates so we can follow actually
how well it's doing along with our
traces that we then use for back testing
um over our synthetic data.
At Fully Connected London '25, Sean McCrossan, Data Scientist at Learnosity, shares how his team moved beyond intuition-driven "prompt engineering" to a structured, data-driven framework for evaluating large language models. He explains how Learnosity designs, tests, and validates AI educational products for quality and reliability, using techniques like LLM-as-a-Judge, synthetic data generation, and W&B Weave-based evaluation pipelines to ensure accuracy, fairness, and trust in real-world learning applications.