WEBVTT 00:00:05.200 --> 00:00:14.740 We welcome Laura Patuya from C Rake presenting the structural evaluation framework for assessing LLM 00:00:14.740 --> 00:00:18.080 quality performance in LC contexts. 00:00:19.080 --> 00:00:22.480 So Laura, are you already in the Zoom meeting link? 00:00:23.200 --> 00:00:24.280 OK, so good. 00:00:24.440 --> 00:00:25.560 Good afternoon and everyone. 00:00:25.960 --> 00:00:26.320 I'm Laura. 00:00:27.880 --> 00:00:36.073 I'm also an actor and and today along with my colleague Jinja, you know, I'm going to talk about how 00:00:36.073 --> 00:00:39.760 to LLMS at scale very frankly to that before. 00:00:41.120 --> 00:00:49.840 So by this morning we know that probably the NCA. 00:00:50.680 --> 00:00:56.665 They Teil is nicht um bit of them gigolent and concept and so me and the two evaluates and it and 00:00:56.665 --> 00:00:58.640 i'll can internet see a contact. 00:01:03.280 --> 00:01:10.669 The first one is human, so a human take the LNM output and it's very time consuming, expensive 00:01:10.669 --> 00:01:18.215 actually thinking to read about it like we have the review by colleague and Arthur Donaldson and 00:01:18.215 --> 00:01:19.080 managed to. 00:01:26.040 --> 00:01:31.920 Sorry, another to make that comparison with the ground truth. 00:01:32.320 --> 00:01:38.893 It's a classical machine running strategy, but instead are receivable because they're really just 00:01:38.893 --> 00:01:40.520 one sitting right entry. 00:01:41.240 --> 00:01:49.354 I see this subjective and in addition to that we don't have any ground truth database available yet 00:01:49.354 --> 00:01:49.600 so. 00:01:50.080 --> 00:01:54.320 2. Am approach will we i on this. 00:01:55.040 --> 00:02:01.760 That is evaluation based on criteria with the perspective to use LM attitude. 00:02:03.160 --> 00:02:11.360 So the objective of the three developer structure and reproduce all evaluation framework tailored to 00:02:11.360 --> 00:02:17.920 LCA and this can be applied to different applicable like to completely from LLM. 00:02:18.320 --> 00:02:25.120 Uns for intensiv LL Dings und Probleme Unit Spezies to dete man und es geht das. 00:02:26.040 --> 00:02:30.760 And well, to track performance and problem during the training of women. 00:02:32.560 --> 00:02:39.720 So let's get into this, into this variation framework, I said. 00:02:40.240 --> 00:02:44.960 Gostap and i do gostap and it's the. 00:02:45.040 --> 00:02:45.880 Question definition. 00:02:45.880 --> 00:02:53.012 So here find the goal and the configuration we want to assess by choosing the specific FCA tag, 00:02:53.012 --> 00:02:59.320 retext the LLM settings, for example, the model we evaluate the content to evaluate. 00:02:59.760 --> 00:03:08.316 And then we we define a set of, We define, sorry, a set of questions to test the performance of this 00:03:08.316 --> 00:03:09.600 very operation. 00:03:10.080 --> 00:03:11.720 And those questions are sent. 00:03:12.160 --> 00:03:19.360 We'll in one two if @me the insert of all den test and nova and. 00:03:20.040 --> 00:03:28.145 Act as a bread and together with other Internet and just describe later on, then provide the event 00:03:28.145 --> 00:03:29.800 and important to me. 00:03:31.040 --> 00:03:34.960 That i can't have that we in to that so you know. 00:03:35.040 --> 00:03:42.979 Plantation set aiming at recalibrating evaluation from all the configuration to ensure that the 00:03:42.979 --> 00:03:45.320 judge works with and can be. 00:03:46.640 --> 00:03:51.680 So let's look at the evaluation step. 00:03:52.520 --> 00:03:54.160 So for one configuration. 00:03:59.040 --> 00:04:08.917 Will be answers and we then provide the judge LLM with some additional criteria those evaluation 00:04:08.917 --> 00:04:18.896 yeah are generate some are and they are quicker for any FCA task and some are tailored to pass in 00:04:18.896 --> 00:04:19.720 we also. 00:04:20.440 --> 00:04:24.800 It's a think for the scoring scale or the. 00:04:26.480 --> 00:04:29.600 1234 for each question. 00:04:30.320 --> 00:04:34.880 And then we can right go to the standards. They're a people of. 00:04:35.200 --> 00:04:42.519 For answers that reflect LC subjectivity and this would help the judge to make the evaluation, but 00:04:42.519 --> 00:04:49.465 they are different from the question we can use in the evaluation and that will assign to its 00:04:49.465 --> 00:04:52.080 criteria question or justification. 00:04:58.290 --> 00:05:05.632 So then we ended up with a bunch of comments of each criteria and then we come just like statistic 00:05:05.632 --> 00:05:12.600 to come to the mean the sound of additional maybe we you're interested in and weighted score. 00:05:12.600 --> 00:05:12.960 Why not? 00:05:13.760 --> 00:05:20.285 But how to both results are very good and open question it question and it causes the question of 00:05:20.285 --> 00:05:24.120 the acceptable level of quality performance, if we might. 00:05:25.080 --> 00:05:29.800 So it we plug the number and and there's bice cool. We can't can. 00:05:30.040 --> 00:05:36.880 Operation of two you'll see how many, so if 5. 00:05:40.240 --> 00:05:44.600 C'est écrit plein de. 00:05:45.040 --> 00:05:48.760 To achieve and whether there be 2/3. 00:05:49.160 --> 00:05:49.560 Free. 00:05:50.040 --> 00:05:50.800 Je ne vais pas encore. 00:05:54.440 --> 00:06:03.160 OK, I will now start the full graph from some tested we have done on specific LCA task. 00:06:03.160 --> 00:06:04.640 So the first one is the. 00:06:05.640 --> 00:06:11.741 Golends coom Definition, so we've that Institutional trumping leaks the ios lars and also that you'm 00:06:11.741 --> 00:06:17.599 an validays and is very chanaging because you need to rea i'm going to godance o'ditinition in a 00:06:17.599 --> 00:06:23.640 rose, so we've need them i can simplication that's i'm just about I'm meting to you, that's i'll in 00:06:23.640 --> 00:06:24.800 dentary thata base. 00:06:25.200 --> 00:06:29.560 And we said that you can have many hallucinations if you don't provide umm. 00:06:30.320 --> 00:06:34.960 The list of all the process and flows to the adelan or the white the will. 00:06:35.320 --> 00:06:44.200 You will just make up some flow, process, geography combination and the last task we test is the. 00:06:45.160 --> 00:06:52.984 Inventory generation and or concusion is that is super chanaging to have a real goal standard for 00:06:52.984 --> 00:06:54.920 Prozess. It's on was in. 00:06:55.040 --> 00:06:58.000 Possible and even more for technology. 00:06:58.440 --> 00:07:06.383 We also learned that human validation cannot be done by anyone, it has to be done by an expert on 00:07:06.383 --> 00:07:08.840 the process and the also that. 00:07:10.080 --> 00:07:17.526 And we think a full inventory for a Prozess me be is a bit ambitious @the stage for a ldm and it me 00:07:17.526 --> 00:07:23.920 be that or suit that for a structuring and ähnlich all TS that's it up so we have to. 00:07:24.160 --> 00:07:24.960 This death manual. 00:07:25.160 --> 00:07:29.960 But we're gerity wording on that @the mees and bipline to II was to that. 00:07:31.480 --> 00:07:39.902 Many more context and many more configuration, so the idea is to provide some information directly 00:07:39.902 --> 00:07:40.160 to. 00:07:40.480 --> 00:07:44.800 Will be written, read by the code and then send to an. 00:07:45.080 --> 00:07:53.320 And so all the prom, the output generate that and the egaduy central be them and the time atometic 00:07:53.320 --> 00:08:01.560 lee so a worker terencey walking on that to ier number of as alls and feedbacks prom the crimward. 00:08:05.720 --> 00:08:12.240 So as a discussion, we can highlight maybe the strength of the framework we propose. 00:08:12.240 --> 00:08:17.640 So we systematically reveal the performance trends across LCA context. 00:08:17.640 --> 00:08:24.466 The LLM configuration, it can be applied to any LCA task as long as you provide the appropriate 00:08:24.466 --> 00:08:25.760 criteria for that. 00:08:26.320 --> 00:08:32.640 It supports both human and LLM judges, but we recommend the LLM as a judge to scale the evaluation. 00:08:33.560 --> 00:08:39.564 It also reduces the risk of LLM overfit to evaluation question and answer because if the ground 00:08:39.564 --> 00:08:43.040 truth is made available then LLM could learn from them. 00:08:43.120 --> 00:08:49.416 So here the gold standard are supposed to guide the judge, but are not directly used to evaluate the 00:08:49.416 --> 00:08:49.920 outputs. 00:08:49.920 --> 00:08:54.240 And the evaluation questions should be updated regularly to avoid this overfit. 00:08:55.080 --> 00:09:00.606 And this framework is is will be ready to use very soon, but of course we will need some 00:09:00.606 --> 00:09:01.360 improvement. 00:09:01.640 --> 00:09:08.608 So as a future work we will need to test it on a larger set of configuration to strengthen also the 00:09:08.608 --> 00:09:08.960 data. 00:09:08.960 --> 00:09:11.200 The sorry criteria definition. 00:09:11.800 --> 00:09:17.057 We also need to investigate on what's it an acceptable minimum quality threshold and to develop 00:09:17.057 --> 00:09:20.600 indicator to support the interpretation of the different scores. 00:09:21.280 --> 00:09:26.637 It would be so interesting to compare it to the LLM as a judge approach to a human review to kind of 00:09:26.637 --> 00:09:27.280 validate it. 00:09:28.360 --> 00:09:35.324 And creating a collaborating collaborative ground truth database would be crucial to reflect the 00:09:35.324 --> 00:09:38.880 field subjectivity and to help make LLMS smaller. 00:09:39.280 --> 00:09:39.720 Thank you.