Show simple item record

dc.contributor.advisorTeso, Estefano
dc.contributor.advisorAgerri Gascón, Rodrigo ORCID
dc.contributor.advisorVankov, Ivan
dc.contributor.authorOliva, María Paz
dc.date.accessioned2025-04-03T12:28:26Z
dc.date.available2025-04-03T12:28:26Z
dc.date.issued2025-04-28
dc.identifier.urihttp://hdl.handle.net/10810/73106
dc.description.abstractRetrieval-Augmented Generation represents the state-of-the-art approach to perform questionanswering (QA) tasks in the scientific domain. This system combines a powerful generative component, capable of producing grammatically sound and readable answers, with a retrieval component that efficiently locates specific information within a large corpus of documents. As such, RAG systems are particularly well-suited to address the complexities inherent in this task. However, evaluating the accuracy and quality of the generated answers remains a significant challenge. The aim of this thesis was to find an effective method for assessing RAG performance in a scientific QA task. To this end, we conducted an extensive review of the current automatic evaluation metrics in use. The most common approach involves comparing generated answers with a reference produced by humans. Such comparison can focus on the form (lexical similarity), the content (semantic similarity), or on a deeper analysis through the use of Large Language Models (model-based). Each of these approaches has well-documented advantages and drawbacks, making it necessary to rigorously test their reliability and effectiveness in this context. To explore this, we selected representative metrics from each category and designed three progressively complex experiments to challenge them and analyze their behavior: 1) Can the metrics distinguish between correct and incorrect answers? 2) Can they differentiate between answers of varying quality, particularly when they deviate in form or content from the reference? 3) Do the metrics align with human preferences? The strengths and limitations of these metrics were empirically examined. Findings showed that at the most basic level — distinguishing clearly correct from incorrect answers — all metrics had a good performance to varying degrees. However, when they faced more nuanced challenges, as to differentiate between variations in form or content when comparing higher and lower-quality answers, both lexical and semantic similarity metrics struggled. Therefore, model-based metrics demonstrated greater flexibility and reliability. Nevertheless, in the final experiment, none of the evaluation methods—across all categories—aligned consistently with human judgment. In fact, most of the metrics exhibited divergence from human preferences. Consequently, no metric met performance expectations in all scenarios. Nonetheless, we were able to provide a comprehensive analysis of their behavior, strengths, and limitations. As a conclusion, we propose that for assessing the performance of a RAG system in scientific QA model-based metrics appear to be the most effective, particularly in distinguishing correct from incorrect answers and in differentiating varying levels of answer quality. However, further research is needed to better align these metrics with human judgment. Moreover, findings suggest that relying solely on human-generated reference answers as benchmarks may not effectively capture human preferences. Instead, future evaluation frameworks could integrate human preferences directly into the evaluation process. By shedding light on the performance of current evaluation methods and advocating for a shift toward model-based metrics that better incorporate human preferences, this thesis aims to contribute to the field of QA evaluation and guide future research towards developing more reliable and robust evaluation frameworks.es_ES
dc.language.isoenges_ES
dc.rightsinfo:eu-repo/semantics/openAccesses_ES
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/es/*
dc.titleEvaluation Metrics for Retrieval Augmented Generation in the Scientific Domaines_ES
dc.typeinfo:eu-repo/semantics/masterThesises_ES
dc.rights.holderAtribución-NoComercial-CompartirIgual 3.0 Españaes_ES


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Atribución-NoComercial-CompartirIgual 3.0 España
Except where otherwise noted, this item's license is described as Atribución-NoComercial-CompartirIgual 3.0 España