Show simple item record

dc.contributor.advisorDel Pozo Echezarreta, Arantza
dc.contributor.advisorGonzález Dios, Itziar ORCID
dc.contributor.authorGarcía Montero, Eneritz
dc.date.accessioned2020-11-26T17:21:00Z
dc.date.available2020-11-26T17:21:00Z
dc.date.issued2020-11-26
dc.date.submitted2020-09-21
dc.identifier.urihttp://hdl.handle.net/10810/48626
dc.description.abstractHizkuntzaren Prozesamenduan (HP) zenbait arlotan hitzak erabili izan dira tradizionalki zabaltze-tekniken garapenean, hala nola Informazioaren Berreskurapenean (IB) edota Galdera-Erantzun (GE) sistemetan. Master tesi honek bi hurbilpen aurkezten ditu Elkarrizketa-Sistemen (ES) arloan zabaltze-teknikak garatze aldera, zehazkiago Donostiako (Gipuzkoa) hiri-garraiorako chatbot baten ulertze-modulua garatzera zuzendurik. Lehenengo hurbilpenak hitz-bektoreak erabiltzen ditu semantikoki antzekoak diren terminoak erauzteko, kasu honetan FastText-eko aurre-entreinaturiko embedding sorta espainieraz eta bigarren hurbiltzeak hitzen adiera-desanbiguazioa erabiltzen du sinonimoak datu-base lexiko baten bidez erauzteko, kasu honetan espainierazko WordNet-a. Horretarako, ataza kolaboratibo bat diseinatu da, non corpusa osatuko baitugu balizko-egoera erreal baten sarrerak jasoz. Bestalde, domeinuz kanpo dauden sarrerak identi katze aldera, bi esperimentu sorta garatu dira. Lehenengo fasean kali katze sistema bat garatu da, non corpuseko terminoak Term Frequency-Inverse Document Frequency (TF-IDF) erabiliz ordenatzen baitiren eta ondoren kali katze-sistema kosinu-antzekotasunaren bidez osatzen da. Bigarren faseak aurreko kali katze-sistema formalizatuko da, hiru datu-multzo prestatuz eta estrati katuz. Datu-multzo hauek erregresore lineal bat eta Kernel linealarekin euskarri bektoredun makina bat entreinatzeko erabili dira. Emaitzen arabera, aurre-entreinaturiko bektoreek leialtasun handiagoa daukate input errealari dagokionez. Hala ere, datu-base lexikoek estaldura linguistiko zabalagoa gehituko diote zabalduriko corpus hipotetikoari. Azkenik, domeinuaren diskriminazioari dagokionez, emaitzek TF-IDF-tik erauzitako termino gehienen zeukan datu-multzoa hobesten dute.es_ES
dc.description.abstractText expansion techniques have been used in some sub elds of Natural Language Processing (NLP) such as Information Retrieval or Question-Answering Systems. This Master's Thesis presents two approaches for expansion within the context of Dialogue Systems (DS), more precisely for the Natural Language Understanding (NLU) module of a chatbot for the urban transportation domain in San Sebastian (Gipuzkoa). The rst approach uses word vectors to obtain semantically similar terms while the second one involves synonym extraction from a lexical database. For this purpose, a corpus composed of real case scenario inputs has been exploited. Furthermore, the qualitative analysis of the implemented expansion techniques revealed a need to lter out-of-domain inputs. In relation to this problem, two di erent sets of experiments have been carried out. First, the feasibility of using Term Frequency-Inverse Document Frequency (TF-IDF) and cosine similarity as discrimination features was explored. Then, linear regression and Support Vector Machine (SVM) classi ers were trained and tested. Results show that pre-trained word embedding expansion constitutes a more loyal representation of real case scenario inputs, whereas lexical database expansion adds a wider linguistic coverage to a hypothetically expanded version of the corpus. For out-of-domain detection, increasing the number of features improves both, linear regression and SVM classi cation results.es_ES
dc.language.isoeuses_ES
dc.rightsinfo:eu-repo/semantics/openAccesses_ES
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/es/*
dc.titleCoping with Data Scarcity: First Steps towards Word Expansion for a Chatbot in the Urban transportation Domaines_ES
dc.typeinfo:eu-repo/semantics/masterThesises_ES
dc.rights.holderAtribución-NoComercial-CompartirIgual 3.0 España*
dc.departamentoesLenguajes y sistemas informáticoses_ES
dc.departamentoeuHizkuntza eta sistema informatikoakes_ES


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Atribución-NoComercial-CompartirIgual 3.0 España
Except where otherwise noted, this item's license is described as Atribución-NoComercial-CompartirIgual 3.0 España