A large reproducible benchmark of ontology-based methods and word embeddings for word similarity
View/ Open
Date
2020-09-30Author
Goikoetxea Salutregi, Josu
Lastra Díaz, Juan José
Agirre Bengoa, Eneko
Taieb, Mohamed Ali Hadj
García Serrano, Ana
Ben Aouicha, Mohamed
Sánchez, David
Metadata
Show full item record
Information Systems 96 : (2021) // Article ID 101636
Abstract
This work is a companion reproducibility paper of the experiments and results reported in Lastra-Diaz
et al. (2019a), which is based on the evaluation of a companion reproducibility dataset with the HESML
V1R4 library and the long-term reproducibility tool called Reprozip. Human similarity and relatedness
judgements between concepts underlie most of cognitive capabilities, such as categorization, memory,
decision-making and reasoning. For this reason, the research on methods for the estimation of the
degree of similarity and relatedness between words and concepts has received a lot of attention in
the fields of artificial intelligence and cognitive sciences. However, despite the huge research effort
done, there is a lack of a self-contained, reproducible and extensible collection of benchmarks which
being amenable to become a de facto standard for large scale experimentation in this line of research.
In order to bridge this reproducibility gap, this work introduces a set of reproducible experiments
on word similarity and relatedness by providing a detailed reproducibility protocol together with a
set of software tools and a self-contained reproducibility dataset, which allow that all experiments
and results in our aforementioned work to be reproduced exactly. Our aforementioned primary work
introduces the largest, most detailed and reproducible experimental survey on word similarity and
relatedness reported in the literature, which is based on the implementation of all evaluated methods
into the same software platform. Our reproducible experiments evaluate most of methods in the
families of ontology-based semantic similarity measures and word embedding models. We also detail
how to extend our experiments to evaluate other unconsidered experimental setups. Finally, we
provide a corrigendum for a mismatch in the MC28 similarity scores used in our original experiments