Academia.eduAcademia.edu
OkwuGbé: End-to-End Speech Recognition for Fon and Igbo Bonaventure F. P. Dossou∗ Jacobs University Bremen f.dossou@jacobs-university.de arXiv:2103.07762v2 [cs.CL] 16 Mar 2021 Abstract Language is inherent and compulsory for human communication. Whether expressed in a written or spoken way, it ensures understanding between people of the same and different regions. With the growing awareness and effort to include more low-resourced languages in NLP research, African languages have recently been a major subject of research in machine translation, and other text-based areas of NLP. However, there is still very little comparable research in speech recognition for African languages. Interestingly, some of the unique properties of African languages affecting NLP, like their diacritical and tonal complexities, have a major root in their speech, suggesting that careful speech interpretation could provide more intuition on how to deal with the linguistic complexities of African languages for text-based NLP. OkwuGbé is a step towards building speech recognition systems for African low-resourced languages. Using Fon and Igbo as our case study, we conduct a comprehensive linguistic analysis of each language and describe the creation of end-to-end, deep neural network-based speech recognition models for both languages. We present a stateof-art ASR model for Fon, as well as benchmark ASR model results for Igbo. Our linguistic analyses (for Fon and Igbo) provide valuable insights and guidance into the creation of speech recognition models for other African low-resourced languages, as well as guide future NLP research for Fon and Igbo. The Fon and Igbo models source code have been made publicly available. 1 Introduction OkwuGbé = Okwu(speech) + Gbé(languages) Igbo F on OkwuGbé, the union of two words from Igbo (Okwu) and Fon (Gbé) means the speech of languages, and signifies studying, and integrating . These authors contributed equally to this work. ∗ Chris C. Emezue∗ Technical University of Munich chris.emezue@tum.de automatic speech recognition to several African languages in an effort to unify them. African languages in the past decade received very little to no research in natural language processing (NLP) (Joshi et al., 2020a; Andrew Caines, 2019), prompting recent efforts geared towards improving the state of African languages in NLP (∀ et al., 2020b; Abbott and Martinus, 2018; Siminyu et al., 2020; ∀ et al., 2020a). However, there are few works being done on speech for these African languages, as more emphasis is being placed on their text. Due to the largely acoustic nature of African languages (mostly tonal, diacritical, etc), a careful speech analysis of African languages could provide better insight for text-based NLP involving African languages, as well as supplement the textual data needed for machine translation or language modelling. This is what inspired OkwuGbé and the focus on automatic speech recognition. Automatic speech recognition (ASR, or speechto-text) is a language technology where spoken words are identified, interpreted and converted to text. ASR is changing the way information is accessed, processed, and used. In recent years, ASR achieved state-of-art performances for most western and Asian languages such as English, French, Chinese, Japanese, etc, due to the availability of large quantity of quality speech resources. African languages, on the other hand are still lacking ASR applications. This is mainly due to the lack or unavailability of speech resources for most African Languages (ALs). In this paper, we introduce ASR systems for two low-resourced languages: Fon and Igbo. We show that using end-to-end deep neural networks (E2E DNN) with Connectionist Temporal Classification (CTC) (Graves et al., 2006a), allows us to achieve promising results without using language models (LMs), which usually require huge amounts of data for training. We also demonstrate that leveraging attention mecha- nism (Bahdanau et al., 2016) improves the performance of acoustic models. In section 2, we give an overview of Fon and Igbo languages. Then we discuss some related work in section 3 and examine the data and data processing techniques we employed in this research in section 4. In section 5, we explore models architectures used for our experiments and show our evaluation in section 6. 2 Overview of Fon and Igbo In this section, we give an extensive overview of both languages. Table 1 aims to summarise our analysis for the reader. 2.1 Fon Fon (also known as Fongbe) is a native language of Benin Republic, spoken in average by more than 2.2 million people in Benin, in Nigeria, and Togo (Eberhard et al., 2020). Fon belongs to the NigerCongo-Gbe languages family, and is a tonal, isolating and left-behind language according to (Joshi et al., 2020b), with a basic Subject-Verb-Object (SVO) word order. There are currently about 53 different dialects of the Fon language spoken throughout Benin (Lefebvre and Brousseau, 2002; Capo, 1991; Eberhard et al., 2020). Its alphabet is based on the Latin alphabet, with the addition of the letters: ª, ¡, ¢, and the digraphs gb, hw, kp, ny, and xw. There are 10 vowel phonemes in Fon: 6 said to be closed [i, u, ı̃, ũ], and 4 said to be opened [¢, ª, a, ã]. There are 22 consonants (m, b, n, ¡, p, t, d, c, j, k, g, kp, gb, f, v, s, z, x, h, xw, hw, w). Fon has two phonemic tones: high and low. High is realized as rising (low–high) after a consonant. Basic disyllabic words have all four possibilities: highhigh, high-low, low-high, and low-low. In longer phonological words, like verb and noun phrases, a high tone tends to persist until the final syllable. If that syllable has a phonemic low tone, it becomes falling (high–low). Low tones disappear between high tones, but their effect remains as a downstep. Rising tones (low–high) simplify to high after high (without triggering downstep) and to low before high (Lefebvre and Brousseau, 2002; Capo, 1991). Fon makes extensive use of a rich system of tense or aspect markers, express many semantic features by lexical items, and the periphrastic constructions often used are of a more agglutinative nature (Capo, 1986). Fon nominals are generally preceded by a prefix consisting of a vowel (eg. the word a¡ú: ’tooth’). The quality of this vowel is restricted to the subset of non-nasal vowels (Capo, 1991; Duthie and Vlaardingerbroek, 1981). Reduplication is a morphological process in which the root or stem of a word, or part of it, is repeated. Fon, like the other Gbe languages, makes extensive use of reduplication in the formation of new words, especially in deriving nouns, adjectives, and adverbs from verbs. For instance, the verb lã, which means to cut (both in Fon and Ewe), is nominalized by reduplication, yielding lãlã : the act of cutting. Triplication is used to intensify the meaning of adjectives and adverbs (Capo, 1991; Duthie and Vlaardingerbroek, 1981). 2.2 Igbo Igbo is a native language of the Igbo people, an ethnic group majorly located in the southeastern part of Nigeria, like Abia, Anambra, Ebonyi, Enugu, and Imo states, as well as in the northeast of the Delta state and in the southeast of the Rivers state. Outside Nigeria, it is spoken a little bit in Cameroon and Equatorial Guinea. Igbo belongs to the Benue-Congo group of the Niger-Congo language family and is spoken by over 27 million people (Eberhard et al., 2020). There are approximately 30 Igbo dialects, some of which are not mutually intelligible. To illustrate the complexity of Igbo, we quote (Nwaozuzu, 2008): "...almost every community living as few as three kilometers apart has its few linguistic peculiarities. If these tiny peculiarities are isolated and considered to be able to assign linguistic dependence to each of these communities, we shall therefore be boasting of not less than one thousand languages in what we now know as the Igbo language." This large number of dialects and peculiarities inspired the development of a standardized spoken and written Igbo in 1962, called the Standard Igbo (Ohiri-Aniche, 2007) (which we will refer to when we say "Igbo"). However, studies have shown that there are many sounds (mainly consonants) found in some other dialects of Igbo which are lacking in the Standard Igbo orthography. For example, Achebe et al. (2011) discovered about 50 unique speech sounds in Igbo. Morphologically, Igbo is an agglutinating language, with a compounding word formation: e.g., ugbo (vehicle) + igwe (iron) = ugboigwe (locomotive). Igbo also uses redupli- Characteristics Spoken where Fon mostly in Benin. Some part of Nigeria and Togo Speakers (Eberhard et al., 2020) 2.2 million Igbo mostly in southeastern Nigeria. A little bit in the Equatorial Guinea and Cameroon 22 million Niger-Congo Atlantic Congo Volta-Congo Language family tree Language structure Alphabet structure Special alphabets besides Latin Tonal ? Phoneme structure Number of dialects Reduplication ? Code-switching? Kwa Volta-Niger Gbe Igboid Fon Igbo Isolating language 32 letters: 22 consonants, 10 vowels ª, ¡, ¢, ã, gb, hw, kp, ny, and xw. Agglutinating language 36 letters: 28 consonants, 8 vowels ch, gb, gh, gw, kp, kw, nw, ny, and sh Yes. 3 tones: high (/), low (\) and Yes. 4 tones: high tone (/), low down step (-) (\), down step (−), and down drift (−) 10 vowel phonemes and 22 con- 28 consonant phonemes and 8 sonant phonemes. Nasalization vowel phonemes. Nasalization is present is present about 53 about 30 Yes, especially in deriving Yes, sometimes in compounding nouns, adjectives, and adverbs word formation: e.g., ugbo (vehicle) + igwe (iron) = ugboigwe from verbs. (locomotive). No Yes Table 1: Summary analysis of Fon and Igbo cation like Fon. Igbo has 28 consonants and 8 vowels, totalling 36 letters of the alphabet. The sound system of Igbo consists of eight vowel phonemes, and 28 consonant phonemes (Ikekeonwu, 1999). There are four different types of tones in Igbo language (Odinye and Udechukwu, 2016). They include: High tone (/), Low tone (\), Down step (−) (Rice, 1992), Down drift (−). Down drift is only observed in Igbo sentences because one can raise or lower the pitch before a sentence is completed. Tone is an integral part of a word in Igbo. It is the interface of phonology and syntax in Igbo because it performs both lexical and grammatical functions (Nkamigbo, 2012). Igbo has three syllable types: consonant + vowel (the most common syllable type), vowel or syllabic nasal. Code-switching, the act of “alternation of two languages during speech” (Poplack, 1979), is very common among Igbo-English bilingual speakers, making it an interesting feature for speech recognition research. Therefore, we will go deeper into it. G.O and Mbagwu (2007); Obiamalu and Mbagwu (2010) did an extensive research on code-switching among Igbo speakers, where they classified it into three types: borrowing, quasiborrowing and true code-switching (see Table 2). Borrowing in Igbo arises when words from English are inserted into Igbo during speech and the words go through phonological and morphological transformation (mark -> maakigo, table -> tebulu). This is usually because the speaker can not quickly find the Igbo equivalent of the word or such equiv- alent does not exist. This is illustrated by 1 and 2. In quasi-borrowing, the Igbo equivalents of the English words exist, but the English words are more often used by both monolinguals and bilinguals. It may or may not be assimilated into Igbo, like in borrowing. This is illustrated by 3 and 4. The third situation, called true code-switching, occurs when the speaker purposely chooses to use the English word, even though the Igbo equivalent is known and always used. This is most common among Igbo-English bilinguals. 5 and 6 are good examples. Type borrowing quasiborrowing true codeswitching Examples (Igbo | English) Explanation 1.O. maakigo (mark) ule ahu.. | The words ‘mark’ and ‘table’ had been He has marked the examination borrowed and assimilated into Igbo 2.O. di. na tebulu (table) | It is on because there are not readily available the table in Igbo. 3.Obi zu.ru. car o.hu.ru.. | Obi Even though Igbo has words for ‘car, bought a new car some bilinguals still use English 4.O.bi zu.ru. u.gbo.ala o.hu.ru.. | Obi words. bought a new car 5.Fela na ecriticize onye o.bu.la.| These cases are true code-switching Fela criticizes everybody because the Igbo words for ‘criticize’, 6.Jesus turnu.ru. water o. gho.ro. ‘turn’, ‘water’ and ‘wine’ are readily wine. | Jesus turned water into available in Igbo, but the speaker wine chooses to use the English equivalents. Table 2: Code-switching types and examples. Adapted from (G.O and Mbagwu, 2007) 3 Related Works In this section, we review some related works according to the data resources, the model architectures and the state of ASR research for Fon and Igbo. Previous works according to data resources: Xu et al. (2020) classified previous works on ASR, according to data resources, into rich-resource, low-resource and unsupervised settings, as shown in Table 3. In the rich-resource setting, a large amount of paired speech and text data is available for training. This amounts up to hundreds of hours by multiple speakers. Furthermore, pronunciation lexicon is also leveraged while training for better results. These are the languages with ASR models already deployed in the industry. English is a main example of this setting. In the low-resource setting, there are only about dozen minutes of singlespeaker high-quality paired data, and few hours of multi-speaker low-quality paired data. Compared to the rich-resource setting, these data resources contained fewer paired data. In the extremely low-resource setting, which is where our work lies, there are little to no paired speech data resources, very low online presence, and sometimes no developed pronunciation lexicon rule or language model to improve ASR models. Some of these languages also contain few unpaired multi-speaker data. This is the case of many African languages. Previous works according to model architecture: While traditional phonetic-based approaches (Hidden Markov Models) have produced considerable results in the past, we focus on end-to-end speech recognition with deep learning (Chorowski et al., 2014a,b; Hannun et al., 2014; Amodei et al., 2015; Chan et al., 2016; Chiu et al., 2018) because they have been shown to produce better results, with little dependence on handcrafted features and phoneme dictionaries. Chorowski et al. (2014b) introduced an endto-end continuous ASR using a bidirectional recurrent neural network (RNN) encoder with an RNN decoder that aligns the input and the output sequences using the attention mechanism. The model achieved a word error rate (WER) of 18.57% on the TIMIT data set. Hannun et al. (2014); Amodei et al. (2015) presented a stateof-the-art ASR system using E2E DNNs. They introduced a system that does not use any handdesigned language component, nor even the concept of "phoneme". Their result was achieved, as the authors stated in their original paper, through a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques and language models. Following the promising features that E2E DNNs offer, Mamyrbayev et al. (2020) showed in their recent studies that, using them with CTC works without the need for direct inclusion of language models. State of ASR resources for African languages: Open-sourced data for ASR is one of the driving forces of research in any deep learning field, including ASR, because it fosters experimenting, training and developing better models. The Open Speech and Language Resources (OpenSLR)1 is a platform with open-sourced speech and language resources, such as training corpora for speech recognition for public use. It currently contains many languages (both high and low resources). However, we discovered that of the 2000 African languages, only Yoruba (Gutkin et al., 2020), Nigerian Pidjin and four South African lan1 http://openslr.org/index.html Setting Data Related Work pronunciation lexicon paired data (single-speaker, highquality) paired data (multi-speaker, highquality) unpaired speech (single-speaker, highquality) unpaired speech (multi-speaker,lowquality) unpaired text ASR Rich-Resource Low-Resource Extremely Low-Resource ✓ ✓ ✗ dozens of minutes several minutes ✗ hundreds of hours dozens of hours several hours ✓ dozens of hours ✗ ✓ ✓ dozens of hours ✓ ✓ [(Chorowski et al., 2014a; Chiu et al., 2018; Chan et al., 2016; Hannun et al., 2014; Li et al., 2019; Mamyrbayev et al., 2020; [(Tjandra et al., 2017)] Chorowski et al., 2014b; Hori et al., 2019; Rosenberg et al., 2019; Schneider et al., 2019)] very few [Our Work, Laleye et al. (2016); Xu et al. (2020); Baevski et al. (2020); Liu et al. (2020); Ren et al. (2019)] Table 3: Data sources to build ASR models and the corresponding related works in the different settings. Adapted from (Xu et al., 2020) guages (Afrikaans, Sesotho, Setswana, isiXhosa) (van Niekerk et al., 2017) are present, as of the period of writing this paper. Furthermore, they contain very few samples (tens, few hundreds of audio hours), compared to their high-resource counterparts (thousands, millions of audio hours). This scarcity of open resources for the development of ASR for low-resourced African languages is one of the major factors affecting the low state of ASR research in African languages. Although there may be some non-open resources for some of these African languages, they come with huge licensing fees, among other limiting factors. and speech analysis research in the past decade, but no public research on E2E DNN ASR, to the best of our knowledge. We opine that this is largely because 1) many old works in the past on Igbo focused solely on tonal analysis (Odinye and Udechukwu, 2016; Nkamigbo, 2012), and 2) there is a lack of open-source speech data to encourage further research on exploring ASR with deep learning methods, which are known to be data-hungry. State of ASR research for Fon and Igbo: Fon, unlike Igbo, has little to no digital presence. With very few speakers, and almost no online presence, there have been understandably very few tonal analysis or ASR research for this language. The few that exist are mostly by researchers who are native speakers of the language. To the best of our knowledge, there has only been notable efforts from Laleye (2016); Laleye et al. (2016) to build an ASR for Fon, with a word error rate (WER) of 14.83%. This result had been achieved, building two LMs and also only after normalizing and removing the diacritics; whose crucial importance for both performant ASR and neural machine translation (NMT) has been proved by Orife (2018); Dossou and Emezue (2020). This will be discussed later in section 4.3.2. The best model with diacritics scored a (WER) of 44.04%. Igbo, on the other hand, has had a lot of tonal We got our speech dataset for Fon from the existing Fon speech corpus2 which was built upon the tedious task of recording the texts pronounced by native speakers (including 8 women and 20 men) of Fongbe in a noiseless environment. The recordings are sampled at a frequency of 16Khz. The 28 native speakers have spoken around 1500 phrases (daily conversations domain). These recordings were made with the LigAikuma3 android application. The minimum length of a speech sample is 2 seconds and the maximum is 5 seconds, giving us an average of 4 seconds content length. Overall, there are around 10 hours of speech data that have been collected. The global data set has been split into training, validation and test data sets. The training set contains 8 hours of speech (8235 speech samples), the 4 Speech-to-Text Corpora and Data Preprocessing 4.1 Fon Speech-to-Text Corpus 2 https://paperswithcode.com/dataset/ fongbe-speech-recognition 3 https://lig-aikuma.imag.fr/ validation data set contains 1500 speech samples and finally the test data set contains 669 speech samples. The text corpus made of the 1500 sentences used to build the speech data set has been scraped from BéninLangues 4 . 4.2 Igbo Speech-to-Text Corpus It was very hard to find the data set of Igbo audio samples and their transcripts. We realized that there is a great lacuna: even though there’s been much research on Igbo phonology, there has really not been any (public) efforts to gather any speechto-text data set for it. The data set for our experiments on Igbo was got through a license from the Linguistic Data Consortium (LDC2019S16: IARPA Babel Igbo Language Pack) (Nikki et al., 2019). It contains approximately 207 hours of Igbo conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts. The data set (hereafter called IgboDataset) is made up of telephone calls representing the Owerri, Onitsha, and Ngwa dialects spoken in Nigeria, sampled at 8kHz, with a few sampled at 48kHz. The gender distribution among speakers is approximately equal; speakers’ ages range from 16 years to 67 years. The telephone calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. The diacritics were originally removed from the transcripts. Unlike the Fon data set, which was modern and contained very clean audio, IgboDataset had old speech patterns, and contained many noisy audio samples. Therefore, we had to implement a number of cleaning strategies (like filtering based on length of words, upsampling, exploring different mel-spectrograms units and number of Fast Fourier Transform (FFT) bins. The FFT is an algorithm that computes the discrete Fourier transform (DFT, described in section 4.3.1) of a sequence. Our cleaning strategies gave us a reduced data set of 2.5 hours, which we split into train, dev and test sizes of 4000, 100 and 100 audio samples respectively. To test the importance of our pre-cleaning, we trained the model on both the large uncleaned data set and our cleaned version (results are discussed in section 6). 4 https://beninlangues.com/fongbe 4.3 Data Preprocessing 4.3.1 Speech Preprocessing Speech signals are made up of amplitudes and frequencies. Amplitudes simply inform about the loudness of the sound recording, nothing informative. To get more information from our speech samples, we decided to map them into the frequency domain. Two of the known techniques, enabling the conversion of speech data from its time domain to its frequency domain, are the Fourier Transformation (FT) and Discrete Fourier Transformation (DFT) (Boashash, 2003; Bracewell, 2000). FT is a mathematical concept that converts a continuous signal from the time domain to the frequency domain. FT decomposes a continuous signal into its frequency components, giving the frequencies present in the signal, and their respective magnitudes. DFT, similary to FT, converts a sequence, considered as a discrete signal, into its frequency components. However, applying only the FFT just gives frequency values without time information. To make sure we preserve frequencies, time and amplitudes information about the speech samples, in reasonable and adequate range, we decided then to use mel-spectrograms. The mel-scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another (Stevens et al., 1937). It is constructed such that sounds of equal distance from each other on the mel-scale, also «sound» to humans as they are equal in distance from one another. A popular formula to convert f hertz (frequencies measures) into m mels is the O’Shaughnessy formula described in O’Shaughnessy book (O’Shaughnessy, 1987), defined as m = 2595 ∗ log10 ∗ (1 + f ) (10). 700 A spectrogram 5 is an image, a three dimensional (3D) representation that displays how energy frequencies components of the speech change over time. The abscissa represents time, the ordinate axis represents frequency, and amplitudes are shown by the darkness of a precise frequency at a particular time: low amplitudes are represented with a light-blue color, and very high amplitudes are represented by dark red. 5 http://www.glottopedia.org/index.php/Spectrogram There are two types of spectrograms: broadband spectrograms and narrow-band spectrograms6 . • Broad-band spectrograms have higher temporal resolutions allowing the detection of changes in frequency over small intervals of time. However, they usually do not help making good frequency distinctions, as the time interval for each spectrum is small. • Narrow-band spectrograms have higher frequency resolution, and larger time interval for every spectrum than broad-band spectrograms: this allows the detection of very small differences in frequencies. Moreover, they show individual harmonic structures, which are vibration frequency folds of the speech, as horizontal striations. Lang. Fon Fon 4.3.2 Text Preprocessing Scholars like Orife (2018) and Dossou and Emezue (2020) have shown in their studies that keeping the diacritics reduces lexical disambiguation and provides more morphological information to neural machine translation models. Additionally, diacritics relay the pronunciation tone and the sound generated, leading to an improved understanding of the sentences and their contexts. 6 http://www.cas.usf.edu/~frisch/SPA3011_L07.html a ná gbª nù ¡ú bó gbª sin nù bó gbª xó ¡ª káká yi azãn tªn gbè English tion Transla- I’m super hungry, I’m starving. Please give me some food You will not be eating, neither drinking nor speaking for the next three days Table 4: Fon sentences before and after the preprocessing with their English Translations Fon (a) to (ambiguous and uncertain) tó (ear) A mel-spectrogram is hence a spectrogram with the mel-scale. In our study, we decided to use narrow-band mel-spectrograms, as input features for the model. We used 512 as length of the FFT window, 512 as the hop-length (number of samples between successive frames) and a hanning windows size is set to the length of FFT window. For handling the audio data, we used the torchaudio utility from PyTorch (Paszke et al., 2019). We used Spectrogram Augmentation (SpecAugment) (Park et al., 2019) as a form of data augmentation: we cut out random blocks of consecutive time and frequency dimensions. Mel-Spectograms were generated from each speech samples with some fine-tuned hyper-parameters: • sample_rate: sample rate of audio signal, set to 16000 hertz (16 kHz) for Fon and 8000 hertz (8 kHz) for Igbo. • n_mels: number of mel filterbanks, set to 128 for Fon and 64 for Igbo. Source Text (preprocessed text same as the source) xov¢ sin mì tlala k¢nkl¢n bo ná mì nù¡é tò (sea) tô (country) Igbo (b) akwa (ambiguous and uncertain) ákwà (cloth) ākwá (egg) ákwá (to cry) Table 5: Examples of tonal inflection with Fon and Igbo A diacritic is a glyph added to a letter or basic glyph. They appear above or below a letter, or in some other position such as within the letter or between two letters. Some diacritical marks, such as the acute (´) and grave (‘), are often called accents. In Fon and Igbo, a good example of the importance can be seen in Table 5 (a) where we demonstrate that removing diacritics from a word could lead to ambiguity and result in the confusion of the word. Therefore, we preprocessed the textual data without removing the diacritics. The results of the text data preprocessing on the Fon data set, are presented in Table 4. For Igbo, unfortunately the IgboDataset was originally stripped of its diacritics. Therefore, we were not able to encode any diacritical information. 5 Models Architectures and Experiments 5.1 Preliminaries We take a moment to briefly define the task of ASR mathematically. We have a training set of n samples: χ =  (1) (1) (2) (2) (n) (n) (x , y ), (x , y ), ..., (x , y ) . Each utterance, x(i) , sampled from the training set, is a time-series of length T (i) where every time-slice is (i) a vector of audio features, (xt , t = 0, ..., T (i) − 1. The goal of ASR is to generate transcripts ŷ (i) for each utterance x(i) . In order to achieve this, CNN layer rCNN block BiLSTM block CTC Loss Fully Connected Layer Mel-Spectrogram (narrow band) Fully Connected Layer to flatten output from rCNN block Batch Normalization BiGRU block Output from previous layer Output from previous layer Layer Norm | GeLU Layer Norm | GeLU | Dropout | CNN RNN-type connection Hidden state ith x at. w conc Layer Norm | GeLU | Dropout | CNN Bidirectional GRU (outputs x and hidden state) Alignment score softmax Attention weights Attention mechanism Context Vector + Input for next layer Input for next layer Figure 1: Architecture of the best model for Fon and Igbo, with an expansion of each component of the rCNN block and BiGRU block. we use an architecture consisting of one or more recurrent neural networks (RNN), since they are best equipped for time-series data. In order to generate ŷ (i) for a given utterance (i) x , the RNN models the probability of picking a character from the character set. To put it mathematically, at each output time-step t, the RNN makes a prediction over characters, p(ct |x(i) ), where ct is either a character of the alphabet (including their diacritics) or the blank symbol. In Fon for example, due to our inclusion of the diacritics, we have ct = {a, b, c, ..., z, à, á, ā, ă, è, é, ē, ĕ, ì, í, î, ï, ı̆, ó, ŏ, ò, ū, ŭ, ù, ú, ª, ª, ª, ª, ¡, ¢, ¢, ¢, ¢, ¢, f ullstop, apostrophe, comma, space, blank}. Apostrophe, white-space, comma, and full stop characters have been added to denote word boundaries. For Igbo, ct = {a, b, ..., z, fullstop, apostrophe, comma, blank} which is the same as the character set for the English language. This constraint was because the speech dataset for Igbo was already stripped of the diacritics, as explained in section 4.2. 5.2 Model Architecture Related works have shown that we can increase model capacity, in order to efficiently learn from large speech datasets, by adding more hidden layers rather than making each layer larger. Graves et al. (2013) explored increasing the number of consecutive bidirectional recurrent layers, and Amodei et al. (2015) proposed the Deep Speech2, which among a number of optimization techniques, extensively applied batch normalization (Ioffe and Szegedy, 2015) to the deep RNNs. Furthermore, Chorowski et al. (2014a) showed that the use of Badhanau (additive) attention mechanism (Bahdanau et al., 2016) could reduce the phoneme or word error rate (WER) of the ASR model. This is possible because the attention mechanism forces the decoder to make monotonic alignment and hence improve the predictions. Our model architecture, shown in Figure 1, draws inspiration from these research findings. While our model at its core is similar to Deep Speech 2, our key improvements are: • the exploration of the combination of Bidirectional Long Short Term Memory (BiLSTMs) and Bidirectional Gated Recurrent Units (BiGRUs) for low-resource ASR. • the integration of the Badhanau attention mechanism, which effect has been demonstrated on Fon. Our model has two main neural network modules: N -blocks of Residual Convolutional Neural Networks (rCNNs) (He et al., 2015, 2016) and M -blocks each of BiLSTMs and BiGRUs. Each rCNN block is made of two CNN layers, two dropout layers and two normalization layers (Ba et al., 2016) for the CNN inputs. We leveraged the power of convolutional neural networks (CNNs) to extract abstract features by converting speeches into spectrograms. RNNs process the abstract features produced by the rCNNs, step by step, making a prediction for each frame while using context from previous frames. We use BiRNN’s (Schuster and Paliwal, 1997) because we want the context of not only the frame before each timestep, but the following as well. This help the model make better predictions. In our scenario, BiLSTMs and BiGRUs act respectively as encoder and decoder blocks. Each block produces subsequentially outputs and hidden states fed to the next block. The last hidden state of the last BiGRU block is used to compute the attention weights and the context vector, that will be concatenated to the BiGRU output, to serve as final output. In-between and towards each block output are stacked dropout layers to prevent overfitting (Srivastava et al., 2014). The output from the models is a probability matrix for characters which will be fed into a greedy decoder. We implemented the greedy decoder suggested by Graves et al. (2006b) to extract what the model believes are the highest probability characters that were spoken. This simple decoder, albeit with no linguistic information, has been shown to produce useful transcriptions (Zenkel et al., 2017). Our model is trained to predict the probability distribution of every character of the alpha- bet at each timestep from the narrow-band melspectrogram we feed it. Traditional ASR models require aligning transcripted text to the speech before the training, and the model is trained to predict specific labels at specific timesteps. However, with the CTC loss function (Graves et al., 2006a), the previously described step is skipped and the model directly learns to align the transcript itself during training. 5.2.1 Implementing the attention mechanism Let us recall here that we have 5 blocks of rCNNs, 3 blocks each of BiLSTMs and BiGRUs. Bahdanu attention a bit modified and implemented as explained in Figure 1, in the following steps: • an input x from the stacked blocks of rCNNs is fed to the BiLSTM (encoder), and primarily layer-normalized. The output is then passed through a GeLU activation function, which output is fed to the BiLSTM layers of the current block. Within the current encoder block, each of the BiLSTM layers, produce an output, a hidden state and a cell state. The encoder output is lastly passed to a dropout layer, and is used as input for the next encoder block. The output of the last encoder is used as input for the first BiGRU (decoder) block. • The input of the decoder goes successively through a normalization layer and a GeLU activation function. The output is then fed to BiGRU layers of the current decoder, which produces the decoder output and a hidden state. In common NLP or more specifically in neural machine translation (NMT), the hidden state is a 2-dimensional tensor. However, this is not the case here, since our initial input features from the stack of rCNNs layers are 4-dimensional tensors of shape (batch, channel, feature, time), and the output from the stack of encoders is a 3-dimensional tensor of shape (batch_size, number of features, hidden size). • Using the hidden state h, and the decoder output x shapes, three dense layers are created accordingly using the following PyTorch version pseudo-code: w1 = nn.Linear(x.size[2], x.size[2]//2) w2 = nn.Linear(h.size[2], h.size[2]) v = nn.Linear(x.size[2]//2, x.size[2]) Along our manipulation, in the attempt to customize the NMT concept to ours, we encountered few cases were shapes of the decoder output and the hidden state were mismatching. To resolve that, we augmented the hidden state tensor, along the second axis (axis = 1) with desirable number of zeros. This was done for two reasons: 1. the concatenation will not affect the original hidden state tensor 2. tanh(x|x = 0) = 0: 0 is hence acting like a neutral element. Once we get compatible shapes for both the decoder output and the hidden state tensor, we used them to conventionally compute the attention scores, s: m = nn.T anh() s = v(m(w1(x) + w2(h)) • The attention weights are then computed by passing s through a softmax operation. The context vector, cv , which is got by multiplying the attention weights by x, is concatenated to the decoder output x to get the attention features: n = nn.Sof tmax() attention_weights = n(s) cv = attention_weights ∗ x x = concatenate(cv , x) • The output of the current decoder block is the attention output passed through a dropout layer, and taken as input for the next decoder. 5.3 Experiments Throughout our experiments, we explored different model architectures with various number of convolutional and bidirectional recurrent layers. The best ASR model, shown on Figure 1 has 5 blocks of rCNNs, 3 blocks each of BiLSTMs and BiGRUs, with attention incorporated into each component of the BiGRU block. Also, we used a form of Batch Normalization throughout the model. We got the best evaluation results with the following hyper-parameters: • max learning_rate: 5e-4 (for Fon), 3e-4 (for Igbo) • batch_size: 20 (for Fon), 20 (for Igbo). • (N, M ): (5, 3) for Fon and Igbo. • embedding_size: 512 • epochs: 500 (for Fon) and 1000 (for Igbo), with early stopping after 100 epochs. • activation_function: GeLU (Hendrycks and Gimpel, 2016) • optimizer: AdamW (Loshchilov and Hutter, 2019) (Fon), Nesterov accelerated descent (Nesterov, 1983) (Igbo) We used two metrics to evaluate the models: the Character Error Rate (CER) and the WER. WER uses the Levenshtein distance (Levenshtein, 1966) to compare reference text and hypothesis text in word-level. Even with a low CER, the WER can be high: hence the lower the WER, the better the model. For the training processes, we used the WER of the validation data set to select the best weights and parameters. 6 Results Models (rCNN) +BiGRUs +BiLSTMs +BiLSTMs+BiGRUs +BiLSTMs+BiGRUs+Attn (Laleye et al., 2016)7 Fon CER WER 22.0831 24.2783 16.9581 18.7976 - 59.66 61.46 47.05 42.50 44.09 Igbo (without cleaning) CER WER 56.00 50.12 (92.67) - 64.00 55.03 (97.99) - Table 6: CER (%), and WER (%) of different models on Fon and Igbo (original and cleaned) test datasets. We present our findings using 5 blocks of rCNNs with: • 3 blocks solely of BiGRUs • 3 blocks solely of BiLSTMs • 3 blocks each of BiLSTMs and BiGRUs • 3 blocks each of BiLSTMs and BiGRUs + Attention mechanism. Table 6 presents the results of different models architectures on the test data set for Fon and Igbo. 6.1 Results for Fon We show that implementing attention mechanism reduced the WER by 5%. Our Fon ASR model outperformed the current Fon ASR model with diacritics of Laleye et al. (2016). Table 7 shows some decoded predictions and targets from the Fon ASR model, which are very similar. Common mistakes (colored), happen most often at a character level where a character is either omitted, added or replaced by another one. The native speakers included in this study have testified to the fact that those mismatched words or characters are often practically not distinguishable in 7 WER of the best model with diacritics from (Laleye et al., 2016) Fon Decoded Predictions tª ce xwe yªyª din tªn ª ci gblagadaa eo mi sa aakpan nu mi fit¢ a gosin xwe yi gbe e kpo kp¢¡é akw¢ c¢ gbadé jí ¡axim¢ Fon Decoded Targets tª ce xwe yªyª din tªn ª ci gblagadaa eo mi sa akpan nu mi fit¢ a go sin xwe yin gbe e kpo kp¢¡e akw¢ j¢ gbadé ji ¡axim¢ Table 7: Decoded Predictions and Targets of the best Fon ASR Model speaking. The model source code is open-sourced at: https://github.com/bonaventuredossou/fonasr 6.2 Results for Igbo An important observation we show in Table 6 is the effect of the state of audio samples on the model’s ability to learn: for our large IgboData with background noise, uneven audio length, low sampling rate, etc, the model found it very difficult to learn the speech representations. Taking time to sieve through the data (in Section 4.2) mitigated this issue by helping the model learn the abstract features better, albeit on a small training set. While the model is currently still training on more epochs (with hope of improving), our preliminary results serve as a benchmark for ASR on Igbo. The source code for the model can be accessed at https://github.com/chrisemezue/IgboASR. In Table 6, one may observe the large difference between the CER and WER on Fon language, unlike Igbo. We strongly believe that this is due to the fact that the character set for Fon contains all the possible diacritics for each letter of the Fon alphabet, making it extremely large (compared to the set of Igbo characters which had no diacritical information). To further support our claim, a close observation of the targets and predictions in Table 7 reveals that the errors are mostly due to omission or mismatch of diacritics for the characters (‘e’ predicted instead of ‘é ’ in row 4 or a space added between ‘go’ and ‘sin’ in row 3 ). 7 Future Work Our work shows promising results considering the small training sizes, and we have presented a stateof-the-art ASR model for Fon. As future pathway to improve the proposed models, we are exploring approaches like leveraging language models, deeper model structures, transformers and crowd-sourcing/compiling speech-to-text data set for Igbo and Fon. For Igbo language, the next stage involves incorporating diacritical information in the ASR model. We have begun by gathering new speech dataset which include the diacritics. 8 Acknowledgement We are grateful to Professor Graham Neubig of Carnegie Mellon University for coming to our aid by providing us with an Amazon EC2 instance for training our model when we were very low on computational resources. We also thank Dr Frejus Layele for giving us access to the Fon data set, and Dr Iroro Orife, for his guidance on designing the ASR model and cleaning the IgboData. References Jade Z. Abbott and Laura Martinus. 2018. Towards neural machine translation for african languages. CoRR, abs/1811.05467. Ike Achebe, Clara Ikekeonwu, Cecilia Eme, Nolue Emenanjo, and Nganga Wanjiku. 2011. A composite synchronic alphabet of igbo dialects (csaid). IADP, New York. Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep speech 2: End-to-end speech recognition in english and mandarin. Andrew Caines. 2019. The Geographic Diversity of NLP Conferences. Jimmy Ba, J. Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. ArXiv, abs/1607.06450. Alexei Baevski, Michael Auli, and Abdelrahman Mohamed. 2020. Effectiveness of self-supervised pre– training for speech recognition. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural machine translation by jointly learning to align and translate. B. Boashash. 2003. Time-Frequency Signal Analysis and Processing: A Comprehensive Reference. Oxford: Elsevier Science. R. N Bracewell. 2000. The Fourier Transform and Its Applications. Boston: McGraw-Hill. Hounkpati B. C. Capo. 1986. Renaissance du gbe, une langue de l’Afrique occidentale: étude critique sur les langues ajatado, l’ewe, le fon, le gen, laja, le gun, etc. Université du Bénin, Institut national des sciences de l’éducation. Hounkpati B. C. Capo. 1991. A comparative phonology of Gbe. Foris Publications. William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP. Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, State-of-the-art and Michiel Bacchiani. 2018. speech recognition with sequence-to-sequence models. Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014a. End-to-end continuous speech recognition using attention-based recurrent nn: First results. Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014b. End-to-end continuous speech recognition using attention-based recurrent nn: First results. Bonaventure F. P. Dossou and Chris C. Emezue. 2020. Ffr v1.1: Fon-french neural machine translation. Alan S. Duthie and R. K. Vlaardingerbroek. 1981. Bibliography of GBE: (Ewe, Gen, Aja, Xwala, Fon, Gun, etc.): publications "on" and "in" the language. Basler Afrika Bibliographien. David M. Eberhard, Gary F. Simons, and Charles D. Fennig (eds.). 2020. Ethnologue: Languages of the world. twenty-third edition. ∀, Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddee Hassan Muhammad, Salomon Kabongo, Salomey Osei, et al. 2020a. Participatory research for low-resourced machine translation: A case study in african languages. Findings of EMNLP. ∀, Iroro Orife, Julia Kreutzer, Blessing Sibanda, Daniel Whitenack, Kathleen Siminyu, Laura Martinus, Jamiil Toure Ali, Jade Abbott, Vukosi Marivate, Salomon Kabongo, Musie Meressa, Espoir Murhabazi, Orevaoghene Ahia, Elan van Biljon, Arshath Ramkilowan, Adewale Akinfaderin, Alp Öktem, Wole Akin, Ghollah Kioko, Kevin Degila, Herman Kamper, Bonaventure Dossou, Chris Emezue, Kelechi Ogueji, and Abdallah Bashir. 2020b. Masakhane – machine translation for africa. Obiamalu G.O and D.U. Mbagwu. 2007. Codeswitching: Insights from code-switched english/igbo expressions. pages 51–53. Awka Journal of Linguistics and Languages Vol 3. A. Graves, Abdel rahman Mohamed, and Geoffrey E. Hinton. 2013. Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006a. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 369–376, New York, NY, USA. Association for Computing Machinery. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006b. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 369–376, New York, NY, USA. Association for Computing Machinery. Alexander Gutkin, Işın Demirşahin, Oddur Kjartansson, Clara Rivera, and Kó.lá Túbò.sún. 2020. Developing an Open-Source Corpus of Yoruba Speech. In Proceedings of Interspeech 2020, pages 404–408, Shanghai, China. International Speech and Communication Association (ISCA). Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. 2014. Deep speech: Scaling up end– to-end speech recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). T. Hori, R. Astudillo, T. Hayashi, Y. Zhang, S. Watanabe, and J. Le Roux. 2019. Cycle-consistency training for end-to-end speech recognition. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6271–6275. Clara Ikekeonwu. 1999. Igbo”, handbook of the international phonetic association. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020a. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics. Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020b. The state and fate of linguistic diversity and inclusion in the nlp world. F. A. A. Laleye, L. Besacier, E. C. Ezin, and C. Motamed. 2016. First automatic fongbe continuous speech recognition system: Development of acoustic models and language models. In 2016 Federated Conference on Computer Science and Information Systems (FedCSIS), pages 477–482. Frejus Adissa Akintola Laleye. 2016. Contributions à the á study and à automatic speech recognition in Fongbe. Theses, Universit é du Littoral C ô te d’Opale. Daniel van Niekerk, Charl van Heerden, Marelie Davel, Neil Kleynhans, Oddur Kjartansson, Martin Jansche, and Linne Ha. 2017. Rapid development of TTS corpora for four South African languages. In Proc. Interspeech 2017, pages 2178–2182, Stockholm, Sweden. Adams Nikki, Bills Aric, Conners Thomas, David Anne, Dubinski Eyal, Fiscus Jonathan G., Gann Ketty, Harper Mary, Kaiser-Schatzlein Alice, Kazi Michael, Malyska Nicolas, Melot Jennifer, Onaka Akiko, Paget Shelley, Ray Jessica, Richardson Fred, Rytting Anton, and Sinney Shen. 2019. Iarpa babel igbo language pack iarpa-babel306b-v2.0c ldc2019s16. web download. Linda Chinelo Nkamigbo. 2012. A phonetic analysis of igbo tone. ISCA Archive, The Third International Symposium on Tonal Aspects of Languages. G.I. Nwaozuzu. 2008. Dialects of the Igbo Language. University of Nigeria Press. G. Obiamalu and Davidson U. Mbagwu. 2010. Motivations for code-switching among igboenglish bilinguals: A linguistic and sociopsychological survey. OGIRISI: a New Journal of African Studies, 5:27– 39. Sunny Odinye and Gladys Udechukwu. 2016. Igbo and chinese tonal systems: a comparative analysis. Ogirisi: A new Journal of African Studies, Volume 1:48. Claire Lefebvre and Anne-Marie Brousseau. 2002. A grammar of Fongbe. Mouton de Gruyter. Chinyere Ohiri-Aniche. 2007. Stemming the tide of centrifugal forces in igbo orthography. Dialectical Anthropology, 31(4):423–436. V. I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707. Iroro Orife. 2018. Attentive sequence-to-sequence learning for diacritic restoration of yorùbá language text. Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer network. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6706–6713. Douglas O’Shaughnessy. 1987. Speech communication: human and machine. Journal of the Acoustical Society of America. Alexander H. Liu, Tao Tu, Hung yi Lee, and Lin shan Lee. 2020. Towards unsupervised speech recognition and synthesis with quantized speech representation learning. Daniel S. Park, William Chan, Yu Zhang, ChungCheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech 2019. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. Orken Mamyrbayev, Keylan Alimhan, Bagashar Zhumazhanov, Tolganay Turdalykyzy, and Farida Gusmanova. 2020. End-to-end speech recognition in agglutinative languages. In Intelligent Information and Database Systems, pages 391–401, Cham. Springer International Publishing. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. Y. Nesterov. 1983. A method for unconstrained convex minimization problem with the rate of convergence o(1/k 2 ). S. Poplack. 1979. “sometimes i’ll start a sentence in spanish y termino en español”: Toward a typology of code-switching. Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Almost unsupervised text to speech and automatic speech recognition. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5410–5419. PMLR. Keren Rice. 1992. Language, 68(1):149–156. Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno, Yonghui Wu, and Zelin Wu. 2019. Speech recognition with augmented synthesized speech. Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469. Mike Schuster and Kuldip Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45:2673 – 2681. Kathleen Siminyu, Sackey Freshia, Jade Abbott, and Vukosi Marivate. 2020. Ai4d – african language dataset challenge. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958. Stanley Smith Stevens, John Volkmann, and Edwin B Newman. 1937. A scale for the measurement of the psychological magnitude pitch. Journal of the Acoustical Society of America. Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2017. Listening while speaking: Speech chain by deep learning. Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, and Tie-Yan Liu. 2020. Lrspeech: Extremely low-resource speech synthesis and recognition. Thomas Zenkel, Ramon Sanabria, Florian Metze, Jan Niehues, Matthias Sperber, Sebastian Stüker, and Alex Waibel. 2017. Comparison of decoding strategies for ctc acoustic models.