Evaluation Metrics for Retrieval Augmented Generation in the Scientific Domain
View/ Open
Date
2025-04-28Author
Oliva, María Paz
Metadata
Show full item recordAbstract
Retrieval-Augmented Generation represents the state-of-the-art approach to perform questionanswering
(QA) tasks in the scientific domain. This system combines a powerful generative
component, capable of producing grammatically sound and readable answers, with a retrieval
component that efficiently locates specific information within a large corpus of documents. As
such, RAG systems are particularly well-suited to address the complexities inherent in this task.
However, evaluating the accuracy and quality of the generated answers remains a significant
challenge.
The aim of this thesis was to find an effective method for assessing RAG performance in a
scientific QA task. To this end, we conducted an extensive review of the current automatic evaluation
metrics in use. The most common approach involves comparing generated answers with
a reference produced by humans. Such comparison can focus on the form (lexical similarity), the
content (semantic similarity), or on a deeper analysis through the use of Large Language Models
(model-based). Each of these approaches has well-documented advantages and drawbacks,
making it necessary to rigorously test their reliability and effectiveness in this context.
To explore this, we selected representative metrics from each category and designed three
progressively complex experiments to challenge them and analyze their behavior: 1) Can the
metrics distinguish between correct and incorrect answers? 2) Can they differentiate between
answers of varying quality, particularly when they deviate in form or content from the reference?
3) Do the metrics align with human preferences?
The strengths and limitations of these metrics were empirically examined. Findings showed
that at the most basic level — distinguishing clearly correct from incorrect answers — all metrics
had a good performance to varying degrees. However, when they faced more nuanced challenges,
as to differentiate between variations in form or content when comparing higher and lower-quality
answers, both lexical and semantic similarity metrics struggled. Therefore, model-based metrics
demonstrated greater flexibility and reliability. Nevertheless, in the final experiment, none of the
evaluation methods—across all categories—aligned consistently with human judgment. In fact,
most of the metrics exhibited divergence from human preferences.
Consequently, no metric met performance expectations in all scenarios. Nonetheless, we were
able to provide a comprehensive analysis of their behavior, strengths, and limitations. As a
conclusion, we propose that for assessing the performance of a RAG system in scientific QA
model-based metrics appear to be the most effective, particularly in distinguishing correct from
incorrect answers and in differentiating varying levels of answer quality. However, further research
is needed to better align these metrics with human judgment. Moreover, findings suggest that
relying solely on human-generated reference answers as benchmarks may not effectively capture
human preferences. Instead, future evaluation frameworks could integrate human preferences
directly into the evaluation process.
By shedding light on the performance of current evaluation methods and advocating for a
shift toward model-based metrics that better incorporate human preferences, this thesis aims
to contribute to the field of QA evaluation and guide future research towards developing more
reliable and robust evaluation frameworks.