Noisy speech recognition using Kaldi and neural architectures
View/ Open
Date
2018-02Author
González Docasal, Ander
Metadata
Show full item recordAbstract
[EN]Noisy Speech Recognition using Kaldi and Neural Architectures ABSTRACT The goal of an Automatic Speech Recognition (ASR) system is to transform a set of acoustic features into a sequence of words. It mainly consists of various parts: the feature extraction part which extracts information from a speech signal; the acoustic model, in charge of the conversion from speech to phonemes; and the language model that transforms the detected phonemes into the most probable sequence of words.
Throughout their history, these systems were built with statistical methods, mainly Hidden Markov Models (HMM) and Gaussian Mixture Models (GMM). However, in recent years the use of neural architectures such as Deep, Convolutional and Recurrent Neural Networks (DNN, CNN and RNN), have improved the achieved results significantly. Moreover, freely available tools made ASR research develop quickly. Kaldi is one of the most known and widely used ASR systems. It includes a set of neural network packages —nnet1, nnet2 and nnet3— which can be used for implementing the acoustic model. These are fast, accurate and able to handle huge databases since they distribute the load on clusters of machines. However, Kaldi’s slow development cycle implies that new neural architectures may be introduced many years after their publications.
Therefore, in this work we substitute the neural acoustic model of Kaldi by our own implementations written in TensorFlow. TensorFlow has the largest community of users and the best support among the available deep learning libraries. By substituting the Acoustic Model of Kaldi with different architectures and testing their performance on the well-known database Aurora-4, we managed to reduce Word Error Rate (WER) by 3.17 % (baseline 15.14 %) when using a CNN architecture. Also, focusing on just the clean subset of the Test part of the database, a further improvement has been achieved once implementing a CNN + RNN structure, from a 4.54 % WER with the CNNs alone to a 4.13 % with this architecture.
This work is therefore believed to improve the results on obtained by one of the widely used ASR tools simply by implementing more advanced deep learning techniques, which could be executed by more powerful and dedicated external programs.
For future work, a further analysis on more complex convolutional networks could lead to a better performance in this particular database and, in general, in noisy environments. Finally, further improvement of convolutional and recurrent architectures is suggested in clean and noise-free conditions, since they have been shown to obtain the best results in this specific circumstances. [EU]Hizketa Automatikoki Ezagutzeko (ASR) sistema baten helburua tasun akustikoen multzo bat hitz sekuentzia batean bihurtzea da. Ondorengo atalez osatuta dago: tasunen erausketa, hizkuntza-informazioa audio seinaletik erauzten du tasun akustikoko bektore gisa; eredu akustikoa, bektore akustikoak fonematan bihurtzearen arduraduna; eta hizkuntzaeredua, hautemandako fonemekin probabilitate gehien duen hitz sekuentzia itzultzen du.
Haien historia osoan zehar, sistema hauek metodo estatistikoak erabilita eraikitzen ziren, batez ere Markoven Ezkutuko Ereduak (HMM) eta Gaussen Eredu Mistoak (GMM). Hala ere, azkenengo urteetean arkitektura neuronalak erabiliz, hala nola Sare Neuronal Sakonak, Konboluziokoak eta Errepikariak (DNN, CNN eta RNN), lehendabizi lortutako emaitzak modu esanguratsuan hobetzea lortu da. Kaldi gehien ezagutzen eta erabiltzen diren ASR sistemetako bat da. Sare neuronalak ezartzen dituen zenbait pakete (nnet1, nnet2 eta nnet3) ditu barne. Hauek eredu akustikoa inplementatzeko erabil daitezke azkarrak, zehatzak eta datu-base handiak erabiltzeko gai direlako, azken hau zama makina multzoetan banatzen.
Hala ere, Kaldik duen garapen ziklo motela dela eta, arkitektura neuronal berriak haien argitalpenetik urte asko igaro arte ez dira ezarriko.
Beraz, lan honetan Kaldiren eredu akustikoa TensorFlow programazio-lengoaian guk idatzitako inplementazioekin ordezkatuko da.
TensorFlowk erabiltzaile-elkarte handiena eta euskarririk hoberena ditu ikaskuntza sakoneko beste liburutegiekin konparatuta, alegia. Kaldiren eredu akustikoa beste arkitektura ezberdinekin ordezkatzean Aurora-4 deritzon datubasearekin, lehenengo % 15.14ko hitz-errore-tasako (WER) emaitzak % 3.17 puntutan hobetu ahal izan dira Konboluziozko Sare Neuronalekin entrenatzean. Halaber, Test datubaseko submultzo garbian bakarrik fokatzean, emaitzak are gehiago hobetzea lortu da CNN + RNN egitura bat ezartzean; konkretuki, CNN bakarrik erabiltzean lortutako % 4.54ko WERa % 4.13 arte murriztu da arkitektura hau erabilita.
Beraz, lan honek ASR sistema zabalenetako batekin lortzen diren emaitzak soilik ikaskuntza sakoneko teknika aurreratuagoak inplementatzen hobe daitezkela frogatzen du. Izan ere, hauek ardura bakarreko beste programa boteretsuagoren bidez exekuta daitezkeela ere erakusten du.
Hurrengo lanetarako, CNN konplexuagoetan analisi sakonagoak egiteak ASR sisteman errendimendu hobea izatea erakar lezake datu-base konkretu honetan eta, orokorrean, inguru zaratatsuetan. Hala ere, egoera garbietan lan eginez gero CNN-etan eta RNN-etan fokatu beharko lizateke, hauek izan baitira baldintza hauekin emaitza hoberenak lortu dituztenak.