WEBVTT

00:00:05.200 --> 00:00:14.740
We welcome Laura Patuya from C Rake presenting the structural evaluation framework for assessing LLM

00:00:14.740 --> 00:00:18.080
quality performance in LC contexts.

00:00:19.080 --> 00:00:22.480
So Laura, are you already in the Zoom meeting link?

00:00:23.200 --> 00:00:24.280
OK, so good.

00:00:24.440 --> 00:00:25.560
Good afternoon and everyone.

00:00:25.960 --> 00:00:26.320
I'm Laura.

00:00:27.880 --> 00:00:36.073
I'm also an actor and and today along with my colleague Jinja, you know, I'm going to talk about how

00:00:36.073 --> 00:00:39.760
to LLMS at scale very frankly to that before.

00:00:41.120 --> 00:00:49.840
So by this morning we know that probably the NCA.

00:00:50.680 --> 00:00:56.665
They Teil is nicht um bit of them gigolent and concept and so me and the two evaluates and it and

00:00:56.665 --> 00:00:58.640
i'll can internet see a contact.

00:01:03.280 --> 00:01:10.669
The first one is human, so a human take the LNM output and it's very time consuming, expensive

00:01:10.669 --> 00:01:18.215
actually thinking to read about it like we have the review by colleague and Arthur Donaldson and

00:01:18.215 --> 00:01:19.080
managed to.

00:01:26.040 --> 00:01:31.920
Sorry, another to make that comparison with the ground truth.

00:01:32.320 --> 00:01:38.893
It's a classical machine running strategy, but instead are receivable because they're really just

00:01:38.893 --> 00:01:40.520
one sitting right entry.

00:01:41.240 --> 00:01:49.354
I see this subjective and in addition to that we don't have any ground truth database available yet

00:01:49.354 --> 00:01:49.600
so.

00:01:50.080 --> 00:01:54.320
2. Am approach will we i on this.

00:01:55.040 --> 00:02:01.760
That is evaluation based on criteria with the perspective to use LM attitude.

00:02:03.160 --> 00:02:11.360
So the objective of the three developer structure and reproduce all evaluation framework tailored to

00:02:11.360 --> 00:02:17.920
LCA and this can be applied to different applicable like to completely from LLM.

00:02:18.320 --> 00:02:25.120
Uns for intensiv LL Dings und Probleme Unit Spezies to dete man und es geht das.

00:02:26.040 --> 00:02:30.760
And well, to track performance and problem during the training of women.

00:02:32.560 --> 00:02:39.720
So let's get into this, into this variation framework, I said.

00:02:40.240 --> 00:02:44.960
Gostap and i do gostap and it's the.

00:02:45.040 --> 00:02:45.880
Question definition.

00:02:45.880 --> 00:02:53.012
So here find the goal and the configuration we want to assess by choosing the specific FCA tag,

00:02:53.012 --> 00:02:59.320
retext the LLM settings, for example, the model we evaluate the content to evaluate.

00:02:59.760 --> 00:03:08.316
And then we we define a set of, We define, sorry, a set of questions to test the performance of this

00:03:08.316 --> 00:03:09.600
very operation.

00:03:10.080 --> 00:03:11.720
And those questions are sent.

00:03:12.160 --> 00:03:19.360
We'll in one two if @me the insert of all den test and nova and.

00:03:20.040 --> 00:03:28.145
Act as a bread and together with other Internet and just describe later on, then provide the event

00:03:28.145 --> 00:03:29.800
and important to me.

00:03:31.040 --> 00:03:34.960
That i can't have that we in to that so you know.

00:03:35.040 --> 00:03:42.979
Plantation set aiming at recalibrating evaluation from all the configuration to ensure that the

00:03:42.979 --> 00:03:45.320
judge works with and can be.

00:03:46.640 --> 00:03:51.680
So let's look at the evaluation step.

00:03:52.520 --> 00:03:54.160
So for one configuration.

00:03:59.040 --> 00:04:08.917
Will be answers and we then provide the judge LLM with some additional criteria those evaluation

00:04:08.917 --> 00:04:18.896
yeah are generate some are and they are quicker for any FCA task and some are tailored to pass in

00:04:18.896 --> 00:04:19.720
we also.

00:04:20.440 --> 00:04:24.800
It's a think for the scoring scale or the.

00:04:26.480 --> 00:04:29.600
1234 for each question.

00:04:30.320 --> 00:04:34.880
And then we can right go to the standards. They're a people of.

00:04:35.200 --> 00:04:42.519
For answers that reflect LC subjectivity and this would help the judge to make the evaluation, but

00:04:42.519 --> 00:04:49.465
they are different from the question we can use in the evaluation and that will assign to its

00:04:49.465 --> 00:04:52.080
criteria question or justification.

00:04:58.290 --> 00:05:05.632
So then we ended up with a bunch of comments of each criteria and then we come just like statistic

00:05:05.632 --> 00:05:12.600
to come to the mean the sound of additional maybe we you're interested in and weighted score.

00:05:12.600 --> 00:05:12.960
Why not?

00:05:13.760 --> 00:05:20.285
But how to both results are very good and open question it question and it causes the question of

00:05:20.285 --> 00:05:24.120
the acceptable level of quality performance, if we might.

00:05:25.080 --> 00:05:29.800
So it we plug the number and and there's bice cool. We can't can.

00:05:30.040 --> 00:05:36.880
Operation of two you'll see how many, so if 5.

00:05:40.240 --> 00:05:44.600
C'est écrit plein de.

00:05:45.040 --> 00:05:48.760
To achieve and whether there be 2/3.

00:05:49.160 --> 00:05:49.560
Free.

00:05:50.040 --> 00:05:50.800
Je ne vais pas encore.

00:05:54.440 --> 00:06:03.160
OK, I will now start the full graph from some tested we have done on specific LCA task.

00:06:03.160 --> 00:06:04.640
So the first one is the.

00:06:05.640 --> 00:06:11.741
Golends coom Definition, so we've that Institutional trumping leaks the ios lars and also that you'm

00:06:11.741 --> 00:06:17.599
an validays and is very chanaging because you need to rea i'm going to godance o'ditinition in a

00:06:17.599 --> 00:06:23.640
rose, so we've need them i can simplication that's i'm just about I'm meting to you, that's i'll in

00:06:23.640 --> 00:06:24.800
dentary thata base.

00:06:25.200 --> 00:06:29.560
And we said that you can have many hallucinations if you don't provide umm.

00:06:30.320 --> 00:06:34.960
The list of all the process and flows to the adelan or the white the will.

00:06:35.320 --> 00:06:44.200
You will just make up some flow, process, geography combination and the last task we test is the.

00:06:45.160 --> 00:06:52.984
Inventory generation and or concusion is that is super chanaging to have a real goal standard for

00:06:52.984 --> 00:06:54.920
Prozess. It's on was in.

00:06:55.040 --> 00:06:58.000
Possible and even more for technology.

00:06:58.440 --> 00:07:06.383
We also learned that human validation cannot be done by anyone, it has to be done by an expert on

00:07:06.383 --> 00:07:08.840
the process and the also that.

00:07:10.080 --> 00:07:17.526
And we think a full inventory for a Prozess me be is a bit ambitious @the stage for a ldm and it me

00:07:17.526 --> 00:07:23.920
be that or suit that for a structuring and ähnlich all TS that's it up so we have to.

00:07:24.160 --> 00:07:24.960
This death manual.

00:07:25.160 --> 00:07:29.960
But we're gerity wording on that @the mees and bipline to II was to that.

00:07:31.480 --> 00:07:39.902
Many more context and many more configuration, so the idea is to provide some information directly

00:07:39.902 --> 00:07:40.160
to.

00:07:40.480 --> 00:07:44.800
Will be written, read by the code and then send to an.

00:07:45.080 --> 00:07:53.320
And so all the prom, the output generate that and the egaduy central be them and the time atometic

00:07:53.320 --> 00:08:01.560
lee so a worker terencey walking on that to ier number of as alls and feedbacks prom the crimward.

00:08:05.720 --> 00:08:12.240
So as a discussion, we can highlight maybe the strength of the framework we propose.

00:08:12.240 --> 00:08:17.640
So we systematically reveal the performance trends across LCA context.

00:08:17.640 --> 00:08:24.466
The LLM configuration, it can be applied to any LCA task as long as you provide the appropriate

00:08:24.466 --> 00:08:25.760
criteria for that.

00:08:26.320 --> 00:08:32.640
It supports both human and LLM judges, but we recommend the LLM as a judge to scale the evaluation.

00:08:33.560 --> 00:08:39.564
It also reduces the risk of LLM overfit to evaluation question and answer because if the ground

00:08:39.564 --> 00:08:43.040
truth is made available then LLM could learn from them.

00:08:43.120 --> 00:08:49.416
So here the gold standard are supposed to guide the judge, but are not directly used to evaluate the

00:08:49.416 --> 00:08:49.920
outputs.

00:08:49.920 --> 00:08:54.240
And the evaluation questions should be updated regularly to avoid this overfit.

00:08:55.080 --> 00:09:00.606
And this framework is is will be ready to use very soon, but of course we will need some

00:09:00.606 --> 00:09:01.360
improvement.

00:09:01.640 --> 00:09:08.608
So as a future work we will need to test it on a larger set of configuration to strengthen also the

00:09:08.608 --> 00:09:08.960
data.

00:09:08.960 --> 00:09:11.200
The sorry criteria definition.

00:09:11.800 --> 00:09:17.057
We also need to investigate on what's it an acceptable minimum quality threshold and to develop

00:09:17.057 --> 00:09:20.600
indicator to support the interpretation of the different scores.

00:09:21.280 --> 00:09:26.637
It would be so interesting to compare it to the LLM as a judge approach to a human review to kind of

00:09:26.637 --> 00:09:27.280
validate it.

00:09:28.360 --> 00:09:35.324
And creating a collaborating collaborative ground truth database would be crucial to reflect the

00:09:35.324 --> 00:09:38.880
field subjectivity and to help make LLMS smaller.

00:09:39.280 --> 00:09:39.720
Thank you.