Evaluation Metrics for Retrieval Augmented Generation in the Scientific Domain

Oliva, María Paz

dc.contributor.advisor	Teso, Estefano
dc.contributor.advisor	Agerri Gascón, Rodrigo
dc.contributor.advisor	Vankov, Ivan
dc.contributor.author	Oliva, María Paz
dc.date.accessioned	2025-04-03T12:28:26Z
dc.date.available	2025-04-03T12:28:26Z
dc.date.issued	2025-04-28
dc.identifier.uri	http://hdl.handle.net/10810/73106
dc.description.abstract	Retrieval-Augmented Generation represents the state-of-the-art approach to perform questionanswering (QA) tasks in the scientific domain. This system combines a powerful generative component, capable of producing grammatically sound and readable answers, with a retrieval component that efficiently locates specific information within a large corpus of documents. As such, RAG systems are particularly well-suited to address the complexities inherent in this task. However, evaluating the accuracy and quality of the generated answers remains a significant challenge. The aim of this thesis was to find an effective method for assessing RAG performance in a scientific QA task. To this end, we conducted an extensive review of the current automatic evaluation metrics in use. The most common approach involves comparing generated answers with a reference produced by humans. Such comparison can focus on the form (lexical similarity), the content (semantic similarity), or on a deeper analysis through the use of Large Language Models (model-based). Each of these approaches has well-documented advantages and drawbacks, making it necessary to rigorously test their reliability and effectiveness in this context. To explore this, we selected representative metrics from each category and designed three progressively complex experiments to challenge them and analyze their behavior: 1) Can the metrics distinguish between correct and incorrect answers? 2) Can they differentiate between answers of varying quality, particularly when they deviate in form or content from the reference? 3) Do the metrics align with human preferences? The strengths and limitations of these metrics were empirically examined. Findings showed that at the most basic level — distinguishing clearly correct from incorrect answers — all metrics had a good performance to varying degrees. However, when they faced more nuanced challenges, as to differentiate between variations in form or content when comparing higher and lower-quality answers, both lexical and semantic similarity metrics struggled. Therefore, model-based metrics demonstrated greater flexibility and reliability. Nevertheless, in the final experiment, none of the evaluation methods—across all categories—aligned consistently with human judgment. In fact, most of the metrics exhibited divergence from human preferences. Consequently, no metric met performance expectations in all scenarios. Nonetheless, we were able to provide a comprehensive analysis of their behavior, strengths, and limitations. As a conclusion, we propose that for assessing the performance of a RAG system in scientific QA model-based metrics appear to be the most effective, particularly in distinguishing correct from incorrect answers and in differentiating varying levels of answer quality. However, further research is needed to better align these metrics with human judgment. Moreover, findings suggest that relying solely on human-generated reference answers as benchmarks may not effectively capture human preferences. Instead, future evaluation frameworks could integrate human preferences directly into the evaluation process. By shedding light on the performance of current evaluation methods and advocating for a shift toward model-based metrics that better incorporate human preferences, this thesis aims to contribute to the field of QA evaluation and guide future research towards developing more reliable and robust evaluation frameworks.	es_ES
dc.language.iso	eng	es_ES
dc.rights	info:eu-repo/semantics/openAccess	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/es/	*
dc.title	Evaluation Metrics for Retrieval Augmented Generation in the Scientific Domain	es_ES
dc.type	info:eu-repo/semantics/masterThesis	es_ES
dc.rights.holder	Atribución-NoComercial-CompartirIgual 3.0 España	es_ES

Files in this item

Name:: license_rdf
Size:: 1.012Kb
Format:: application/rdf+xml

View/Open

Name:: THESIS (002).pdf
Size:: 7.524Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Máster Universitario Erasmus Mundus en Tecnologías del Lenguaje y la Comunicación (LCT)

Show simple item record

Except where otherwise noted, this item's license is described as Atribución-NoComercial-CompartirIgual 3.0 España