dc.contributor.author | Artetxe Zurutuza, Mikel | |
dc.contributor.author | Labaka Intxauspe, Gorka | |
dc.contributor.author | Agirre Bengoa, Eneko | |
dc.date.accessioned | 2024-10-16T18:30:00Z | |
dc.date.available | 2024-10-16T18:30:00Z | |
dc.date.issued | 2018 | |
dc.identifier.citation | Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing : 3632-3642 (2018) | es_ES |
dc.identifier.uri | http://hdl.handle.net/10810/69988 | |
dc.description.abstract | While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In this paper, we propose an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems. Our method profits from the modular architecture of SMT: we first induce a phrase table from monolingual corpora through cross-lingual embedding mappings, combine it with an n-gram language model, and fine-tune hyperparameters through an unsupervised MERT variant. In addition, iterative backtranslation improves results further, yielding, for instance, 14.08 and 26.22 BLEU points in WMT 2014 English-German and English-French, respectively, an improvement of more than 7-10 BLEU points over previous unsupervised systems, and closing the gap with supervised SMT (Moses trained on Europarl) down to 2-5 BLEU points. Our implementation is available at https://github.com/artetxem/monoses. | es_ES |
dc.description.sponsorship | This research was partially supported by the Spanish MINECO (TUNER TIN2015-65308-C51-R, MUSTER PCIN-2015-226 and TADEEP TIN2015-70214-P, cofunded by EU FEDER), the UPV/EHU (excellence research group), and the NVIDIA GPU grant program. Mikel Artetxe enjoys a doctoral grant from the Spanish MECD | es_ES |
dc.language.iso | eng | es_ES |
dc.publisher | ACL | es_ES |
dc.rights | info:eu-repo/semantics/openAccess | es_ES |
dc.rights.uri | http://creativecommons.org/licenses/by/3.0/es/ | * |
dc.title | Unsupervised Statistical Machine Translation | es_ES |
dc.type | info:eu-repo/semantics/conferenceObject | es_ES |
dc.rights.holder | (c) 2018 The authors under the Creative Commons Attribution 4.0 International (CC BY 4.0) | es_ES |
dc.relation.publisherversion | https://doi.org/10.18653/v1/D18-1399 | es_ES |
dc.identifier.doi | 10.18653/v1/D18-1399 | |
dc.departamentoes | Lenguajes y sistemas informáticos | es_ES |
dc.departamentoeu | Hizkuntza eta sistema informatikoak | es_ES |