Show simple item record

dc.contributor.authorAzkune Galparsoro, Gorka
dc.contributor.authorSalaberria Saizar, Ander
dc.contributor.authorAgirre Bengoa, Eneko ORCID
dc.date.accessioned2024-05-06T18:02:02Z
dc.date.available2024-05-06T18:02:02Z
dc.date.issued2024-02
dc.identifier.citationNeural Networks 170 : 215-226 (2024)es_ES
dc.identifier.issn1879-2782
dc.identifier.issn0893-6080
dc.identifier.urihttp://hdl.handle.net/10810/67552
dc.description.abstractThis paper shows that text-only Language Models (LM) can learn to ground spatial relations like left of or below if they are provided with explicit location information of objects and they are properly trained to leverage those locations. We perform experiments on a verbalized version of the Visual Spatial Reasoning (VSR) dataset, where images are coupled with textual statements which contain real or fake spatial relations between two objects of the image. We verbalize the images using an off-the-shelf object detector, adding location tokens to every object label to represent their bounding boxes in textual form. Given the small size of VSR, we do not observe any improvement when using locations, but pretraining the LM over a synthetic dataset automatically derived by us improves results significantly when using location tokens. We thus show that locations allow LMs to ground spatial relations, with our text-only LMs outperforming Vision-and-Language Models and setting the new state-of-the-art for the VSR dataset. Our analysis show that our text-only LMs can generalize beyond the relations seen in the synthetic dataset to some extent, learning also more useful information than that encoded in the spatial rules we used to create the synthetic dataset itself.es_ES
dc.description.sponsorshipAnder is funded by a PhD grant from the Basque Government (PRE_2021_2_0143). This work is partially supported by the Ministry of Science and Innovation of the Spanish Government (AWARE project TED2021-131617B-I00, DeepKnowledge project PID2021-127777OB-C21), and the Basque Government (IXA excellence research group IT1570-22).es_ES
dc.language.isoenges_ES
dc.publisherElsevieres_ES
dc.relationinfo:eu-repo/grantAgreement/MICINN/TED2021-131617B-I00es_ES
dc.relationinfo:eu-repo/grantAgreement/MICINN/PID2021-127777OB-C21es_ES
dc.rightsinfo:eu-repo/semantics/openAccesses_ES
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/es/*
dc.subjectspatial groundinges_ES
dc.subjectlanguage modelses_ES
dc.subjectdeep learninges_ES
dc.titleGrounding spatial relations in text-only language modelses_ES
dc.typeinfo:eu-repo/semantics/articlees_ES
dc.rights.holder© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).es_ES
dc.rights.holderAtribución 3.0 España*
dc.relation.publisherversionhttps://www.sciencedirect.com/science/article/pii/S089360802300655Xes_ES
dc.identifier.doi10.1016/j.neunet.2023.11.031
dc.departamentoesCiencia de la computación e inteligencia artificiales_ES
dc.departamentoesLenguajes y sistemas informáticoses_ES
dc.departamentoeuHizkuntza eta sistema informatikoakes_ES
dc.departamentoeuKonputazio zientziak eta adimen artifizialaes_ES


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Except where otherwise noted, this item's license is described as © 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).