Show simple item record

dc.contributor.authorPeñagarikano Badiola, Mikel ORCID
dc.contributor.authorVarona Fernández, Amparo
dc.contributor.authorBordel García, German
dc.contributor.authorRodríguez Fuentes, Luis Javier ORCID
dc.date.accessioned2023-08-01T08:47:56Z
dc.date.available2023-08-01T08:47:56Z
dc.date.issued2023-07-23
dc.identifier.citationApplied Sciences 13(14) : (2023) // Article ID 8492es_ES
dc.identifier.issn2076-3417
dc.identifier.urihttp://hdl.handle.net/10810/62077
dc.description.abstractIn this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an extensive collection of Basque Parliament plenary sessions containing frequent code switchings. Since session minutes are not exact, only the most reliable speech segments are kept for training. To that end, we use phonetic similarity scores between nominal and recognized phone sequences. The process starts with baseline acoustic models trained on generic out-of-domain data, then iteratively updates the models with the extracted data and applies the updated models to refine the training dataset until the observed improvement between two iterations becomes small enough. A development dataset, involving five plenary sessions not used for training, has been manually audited for tuning and evaluation purposes. Cross-validation experiments (with 20 random partitions) have been carried out on the development dataset, using the baseline and the iteratively updated models. On average, Word Error Rate (WER) reduces from 16.57% (baseline) to 4.41% (first iteration) and further to 4.02% (second iteration), which corresponds to relative WER reductions of 73.4% and 8.8%, respectively. When considering only Basque segments, WER reduces on average from 16.57% (baseline) to 5.51% (first iteration) and further to 5.13% (second iteration), which corresponds to relative WER reductions of 66.7% and 6.9%, respectively. As a result of this work, a new bilingual Basque–Spanish resource has been produced based on Basque Parliament sessions, including 998 h of training data (audio segments + transcriptions), a development set (17 h long) designed for tuning and evaluation under a cross-validation scheme and a fully bilingual trigram language model.es_ES
dc.description.sponsorshipThis work was partially funded by the Spanish Ministry of Science and Innovation (OPEN-SPEECH project, PID2019-106424RB-I00) and by the Basque Government under the general support program to research groups (IT-1704-22).es_ES
dc.language.isoenges_ES
dc.publisherMDPIes_ES
dc.relationinfo:eu-repo/grantAgreement/MICINN/PID2019-106424RB-I00es_ES
dc.rightsinfo:eu-repo/semantics/openAccesses_ES
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectautomatic speech recognitiones_ES
dc.subjectmultilingual speeches_ES
dc.subjectlow-resource languageses_ES
dc.subjectcode switchinges_ES
dc.subjectsemisupervised learninges_ES
dc.subjectspoken language resourceses_ES
dc.titleSemisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASRes_ES
dc.typeinfo:eu-repo/semantics/articlees_ES
dc.date.updated2023-07-28T12:22:26Z
dc.rights.holder© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/ 4.0/).es_ES
dc.relation.publisherversionhttps://www.mdpi.com/2076-3417/13/14/8492es_ES
dc.identifier.doi10.3390/app13148492
dc.departamentoesElectricidad y electrónica
dc.departamentoeuElektrizitatea eta elektronika


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/ 4.0/).
Except where otherwise noted, this item's license is described as © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/ 4.0/).