Grounding spatial relations in text-only language models

Azkune Galparsoro, Gorka; Salaberria Saizar, Ander; Agirre Bengoa, Eneko

dc.contributor.author	Azkune Galparsoro, Gorka
dc.contributor.author	Salaberria Saizar, Ander
dc.contributor.author	Agirre Bengoa, Eneko
dc.date.accessioned	2024-05-06T18:02:02Z
dc.date.available	2024-05-06T18:02:02Z
dc.date.issued	2024-02
dc.identifier.citation	Neural Networks 170 : 215-226 (2024)	es_ES
dc.identifier.issn	1879-2782
dc.identifier.issn	0893-6080
dc.identifier.uri	http://hdl.handle.net/10810/67552
dc.description.abstract	This paper shows that text-only Language Models (LM) can learn to ground spatial relations like left of or below if they are provided with explicit location information of objects and they are properly trained to leverage those locations. We perform experiments on a verbalized version of the Visual Spatial Reasoning (VSR) dataset, where images are coupled with textual statements which contain real or fake spatial relations between two objects of the image. We verbalize the images using an off-the-shelf object detector, adding location tokens to every object label to represent their bounding boxes in textual form. Given the small size of VSR, we do not observe any improvement when using locations, but pretraining the LM over a synthetic dataset automatically derived by us improves results significantly when using location tokens. We thus show that locations allow LMs to ground spatial relations, with our text-only LMs outperforming Vision-and-Language Models and setting the new state-of-the-art for the VSR dataset. Our analysis show that our text-only LMs can generalize beyond the relations seen in the synthetic dataset to some extent, learning also more useful information than that encoded in the spatial rules we used to create the synthetic dataset itself.	es_ES
dc.description.sponsorship	Ander is funded by a PhD grant from the Basque Government (PRE_2021_2_0143). This work is partially supported by the Ministry of Science and Innovation of the Spanish Government (AWARE project TED2021-131617B-I00, DeepKnowledge project PID2021-127777OB-C21), and the Basque Government (IXA excellence research group IT1570-22).	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Elsevier	es_ES
dc.relation	info:eu-repo/grantAgreement/MICINN/TED2021-131617B-I00	es_ES
dc.relation	info:eu-repo/grantAgreement/MICINN/PID2021-127777OB-C21	es_ES
dc.rights	info:eu-repo/semantics/openAccess	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	spatial grounding	es_ES
dc.subject	language models	es_ES
dc.subject	deep learning	es_ES
dc.title	Grounding spatial relations in text-only language models	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.rights.holder	© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).	es_ES
dc.rights.holder	Atribución 3.0 España	*
dc.relation.publisherversion	https://www.sciencedirect.com/science/article/pii/S089360802300655X	es_ES
dc.identifier.doi	10.1016/j.neunet.2023.11.031
dc.departamentoes	Ciencia de la computación e inteligencia artificial	es_ES
dc.departamentoes	Lenguajes y sistemas informáticos	es_ES
dc.departamentoeu	Hizkuntza eta sistema informatikoak	es_ES
dc.departamentoeu	Konputazio zientziak eta adimen artifiziala	es_ES

Files in this item

Name:: 1-s2.0-S089360802300655X-main.pdf
Size:: 2.934Mb
Format:: PDF
Description:: Artículo

View/Open

Name:: license_rdf
Size:: 914bytes
Format:: application/rdf+xml

View/Open

This item appears in the following Collection(s)

Artículos

Show simple item record

© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Except where otherwise noted, this item's license is described as © 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).