Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

Ustaszewski, Michael

View/Open

ustaszewski_ThesisFinal.pdf (1.216Mb)

Date

2016-11-30

Author

Ustaszewski, Michael

Metadata

Show full item record

Estadisticas en RECOLECTA
(LA Referencia)

URI

http://hdl.handle.net/10810/19647

Abstract

In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages.

Collections

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International