OkwuGbé: End-to-End Speech Recognition for Fon and Igbo
Bonaventure F. P. Dossou∗
Jacobs University Bremen
f.dossou@jacobs-university.de
arXiv:2103.07762v2 [cs.CL] 16 Mar 2021
Abstract
Language is inherent and compulsory for human communication. Whether expressed in a
written or spoken way, it ensures understanding between people of the same and different
regions. With the growing awareness and effort to include more low-resourced languages
in NLP research, African languages have recently been a major subject of research in
machine translation, and other text-based areas of NLP. However, there is still very little
comparable research in speech recognition for
African languages. Interestingly, some of the
unique properties of African languages affecting NLP, like their diacritical and tonal complexities, have a major root in their speech,
suggesting that careful speech interpretation
could provide more intuition on how to deal
with the linguistic complexities of African languages for text-based NLP. OkwuGbé is a step
towards building speech recognition systems
for African low-resourced languages. Using
Fon and Igbo as our case study, we conduct a
comprehensive linguistic analysis of each language and describe the creation of end-to-end,
deep neural network-based speech recognition
models for both languages. We present a stateof-art ASR model for Fon, as well as benchmark ASR model results for Igbo. Our linguistic analyses (for Fon and Igbo) provide valuable insights and guidance into the creation of
speech recognition models for other African
low-resourced languages, as well as guide future NLP research for Fon and Igbo. The Fon
and Igbo models source code have been made
publicly available.
1 Introduction
OkwuGbé = Okwu(speech) + Gbé(languages)
Igbo
F on
OkwuGbé, the union of two words from Igbo
(Okwu) and Fon (Gbé) means the speech of languages, and signifies studying, and integrating
. These authors contributed equally to this work.
∗
Chris C. Emezue∗
Technical University of Munich
chris.emezue@tum.de
automatic speech recognition to several African
languages in an effort to unify them. African
languages in the past decade received very little to no research in natural language processing
(NLP) (Joshi et al., 2020a; Andrew Caines, 2019),
prompting recent efforts geared towards improving the state of African languages in NLP (∀ et al.,
2020b; Abbott and Martinus, 2018; Siminyu et al.,
2020; ∀ et al., 2020a). However, there are few
works being done on speech for these African languages, as more emphasis is being placed on their
text. Due to the largely acoustic nature of African
languages (mostly tonal, diacritical, etc), a careful
speech analysis of African languages could provide better insight for text-based NLP involving
African languages, as well as supplement the textual data needed for machine translation or language modelling. This is what inspired OkwuGbé
and the focus on automatic speech recognition.
Automatic speech recognition (ASR, or speechto-text) is a language technology where spoken
words are identified, interpreted and converted to
text. ASR is changing the way information is accessed, processed, and used. In recent years, ASR
achieved state-of-art performances for most western and Asian languages such as English, French,
Chinese, Japanese, etc, due to the availability of
large quantity of quality speech resources. African
languages, on the other hand are still lacking ASR
applications. This is mainly due to the lack or unavailability of speech resources for most African
Languages (ALs). In this paper, we introduce
ASR systems for two low-resourced languages:
Fon and Igbo. We show that using end-to-end deep
neural networks (E2E DNN) with Connectionist Temporal Classification (CTC) (Graves et al.,
2006a), allows us to achieve promising results
without using language models (LMs), which usually require huge amounts of data for training. We
also demonstrate that leveraging attention mecha-
nism (Bahdanau et al., 2016) improves the performance of acoustic models.
In section 2, we give an overview of Fon and
Igbo languages. Then we discuss some related
work in section 3 and examine the data and data
processing techniques we employed in this research in section 4. In section 5, we explore
models architectures used for our experiments and
show our evaluation in section 6.
2 Overview of Fon and Igbo
In this section, we give an extensive overview of
both languages. Table 1 aims to summarise our
analysis for the reader.
2.1
Fon
Fon (also known as Fongbe) is a native language of
Benin Republic, spoken in average by more than
2.2 million people in Benin, in Nigeria, and Togo
(Eberhard et al., 2020). Fon belongs to the NigerCongo-Gbe languages family, and is a tonal, isolating and left-behind language according to (Joshi
et al., 2020b), with a basic Subject-Verb-Object
(SVO) word order. There are currently about
53 different dialects of the Fon language spoken
throughout Benin (Lefebvre and Brousseau, 2002;
Capo, 1991; Eberhard et al., 2020).
Its alphabet is based on the Latin alphabet, with
the addition of the letters: ª, ¡, ¢, and the digraphs gb, hw, kp, ny, and xw. There are 10 vowel
phonemes in Fon: 6 said to be closed [i, u, ı̃, ũ],
and 4 said to be opened [¢, ª, a, ã]. There are
22 consonants (m, b, n, ¡, p, t, d, c, j, k, g, kp,
gb, f, v, s, z, x, h, xw, hw, w). Fon has two
phonemic tones: high and low. High is realized as rising (low–high) after a consonant. Basic
disyllabic words have all four possibilities: highhigh, high-low, low-high, and low-low. In longer
phonological words, like verb and noun phrases,
a high tone tends to persist until the final syllable. If that syllable has a phonemic low tone,
it becomes falling (high–low). Low tones disappear between high tones, but their effect remains
as a downstep. Rising tones (low–high) simplify
to high after high (without triggering downstep)
and to low before high (Lefebvre and Brousseau,
2002; Capo, 1991).
Fon makes extensive use of a rich system of
tense or aspect markers, express many semantic
features by lexical items, and the periphrastic constructions often used are of a more agglutinative
nature (Capo, 1986). Fon nominals are generally
preceded by a prefix consisting of a vowel (eg. the
word a¡ú: ’tooth’). The quality of this vowel is
restricted to the subset of non-nasal vowels (Capo,
1991; Duthie and Vlaardingerbroek, 1981).
Reduplication is a morphological process in
which the root or stem of a word, or part of it, is repeated. Fon, like the other Gbe languages, makes
extensive use of reduplication in the formation of
new words, especially in deriving nouns, adjectives, and adverbs from verbs. For instance, the
verb lã, which means to cut (both in Fon and Ewe),
is nominalized by reduplication, yielding lãlã : the
act of cutting. Triplication is used to intensify the
meaning of adjectives and adverbs (Capo, 1991;
Duthie and Vlaardingerbroek, 1981).
2.2 Igbo
Igbo is a native language of the Igbo people, an ethnic group majorly located in the southeastern part
of Nigeria, like Abia, Anambra, Ebonyi, Enugu,
and Imo states, as well as in the northeast of
the Delta state and in the southeast of the Rivers
state. Outside Nigeria, it is spoken a little bit in
Cameroon and Equatorial Guinea. Igbo belongs to
the Benue-Congo group of the Niger-Congo language family and is spoken by over 27 million
people (Eberhard et al., 2020). There are approximately 30 Igbo dialects, some of which are not
mutually intelligible. To illustrate the complexity
of Igbo, we quote (Nwaozuzu, 2008): "...almost
every community living as few as three kilometers
apart has its few linguistic peculiarities. If these
tiny peculiarities are isolated and considered to
be able to assign linguistic dependence to each of
these communities, we shall therefore be boasting
of not less than one thousand languages in what
we now know as the Igbo language."
This large number of dialects and peculiarities
inspired the development of a standardized spoken
and written Igbo in 1962, called the Standard Igbo
(Ohiri-Aniche, 2007) (which we will refer to when
we say "Igbo"). However, studies have shown that
there are many sounds (mainly consonants) found
in some other dialects of Igbo which are lacking
in the Standard Igbo orthography. For example,
Achebe et al. (2011) discovered about 50 unique
speech sounds in Igbo. Morphologically, Igbo is
an agglutinating language, with a compounding
word formation: e.g., ugbo (vehicle) + igwe (iron)
= ugboigwe (locomotive). Igbo also uses redupli-
Characteristics
Spoken where
Fon
mostly in Benin. Some part of
Nigeria and Togo
Speakers (Eberhard et al., 2020)
2.2 million
Igbo
mostly in southeastern Nigeria.
A little bit in the Equatorial
Guinea and Cameroon
22 million
Niger-Congo
Atlantic Congo
Volta-Congo
Language family tree
Language structure
Alphabet structure
Special alphabets besides Latin
Tonal ?
Phoneme structure
Number of dialects
Reduplication ?
Code-switching?
Kwa
Volta-Niger
Gbe
Igboid
Fon
Igbo
Isolating language
32 letters: 22 consonants, 10
vowels
ª, ¡, ¢, ã, gb, hw, kp, ny, and xw.
Agglutinating language
36 letters: 28 consonants, 8 vowels
ch, gb, gh, gw, kp, kw, nw, ny,
and sh
Yes. 3 tones: high (/), low (\) and Yes. 4 tones: high tone (/), low
down step (-)
(\), down step (−), and down
drift (−)
10 vowel phonemes and 22 con- 28 consonant phonemes and 8
sonant phonemes. Nasalization vowel phonemes. Nasalization
is present
is present
about 53
about 30
Yes, especially in deriving Yes, sometimes in compounding
nouns, adjectives, and adverbs word formation: e.g., ugbo (vehicle) + igwe (iron) = ugboigwe
from verbs.
(locomotive).
No
Yes
Table 1: Summary analysis of Fon and Igbo
cation like Fon. Igbo has 28 consonants and 8 vowels, totalling 36 letters of the alphabet.
The sound system of Igbo consists of eight
vowel phonemes, and 28 consonant phonemes
(Ikekeonwu, 1999). There are four different
types of tones in Igbo language (Odinye and
Udechukwu, 2016). They include: High tone
(/), Low tone (\), Down step (−) (Rice, 1992),
Down drift (−). Down drift is only observed in
Igbo sentences because one can raise or lower the
pitch before a sentence is completed. Tone is an
integral part of a word in Igbo. It is the interface of phonology and syntax in Igbo because it
performs both lexical and grammatical functions
(Nkamigbo, 2012). Igbo has three syllable types:
consonant + vowel (the most common syllable
type), vowel or syllabic nasal.
Code-switching, the act of “alternation of two
languages during speech” (Poplack, 1979), is very
common among Igbo-English bilingual speakers,
making it an interesting feature for speech recognition research. Therefore, we will go deeper into
it.
G.O and Mbagwu (2007); Obiamalu and
Mbagwu (2010) did an extensive research on
code-switching among Igbo speakers, where they
classified it into three types: borrowing, quasiborrowing and true code-switching (see Table 2).
Borrowing in Igbo arises when words from English are inserted into Igbo during speech and the
words go through phonological and morphological
transformation (mark -> maakigo, table -> tebulu).
This is usually because the speaker can not quickly
find the Igbo equivalent of the word or such equiv-
alent does not exist. This is illustrated by 1 and 2.
In quasi-borrowing, the Igbo equivalents of the English words exist, but the English words are more
often used by both monolinguals and bilinguals. It
may or may not be assimilated into Igbo, like in
borrowing. This is illustrated by 3 and 4. The third
situation, called true code-switching, occurs when
the speaker purposely chooses to use the English
word, even though the Igbo equivalent is known
and always used. This is most common among
Igbo-English bilinguals. 5 and 6 are good examples.
Type
borrowing
quasiborrowing
true codeswitching
Examples (Igbo | English)
Explanation
1.O. maakigo (mark) ule ahu.. | The words ‘mark’ and ‘table’ had been
He has marked the examination
borrowed and assimilated into Igbo
2.O. di. na tebulu (table) | It is on because there are not readily available
the table
in Igbo.
3.Obi zu.ru. car o.hu.ru.. | Obi Even though Igbo has words for ‘car,
bought a new car
some bilinguals still use English
4.O.bi zu.ru. u.gbo.ala o.hu.ru.. | Obi words.
bought a new car
5.Fela na ecriticize onye o.bu.la.| These cases are true code-switching
Fela criticizes everybody
because the Igbo words for ‘criticize’,
6.Jesus turnu.ru. water o. gho.ro. ‘turn’, ‘water’ and ‘wine’ are readily
wine. | Jesus turned water into available in Igbo, but the speaker
wine
chooses to use the English equivalents.
Table 2: Code-switching types and examples. Adapted
from (G.O and Mbagwu, 2007)
3 Related Works
In this section, we review some related works according to the data resources, the model architectures and the state of ASR research for Fon and
Igbo.
Previous works according to data resources:
Xu et al. (2020) classified previous works on ASR,
according to data resources, into rich-resource,
low-resource and unsupervised settings, as shown
in Table 3.
In the rich-resource setting, a large amount of
paired speech and text data is available for training. This amounts up to hundreds of hours by multiple speakers. Furthermore, pronunciation lexicon is also leveraged while training for better results. These are the languages with ASR models
already deployed in the industry. English is a main
example of this setting. In the low-resource setting, there are only about dozen minutes of singlespeaker high-quality paired data, and few hours of
multi-speaker low-quality paired data. Compared
to the rich-resource setting, these data resources
contained fewer paired data.
In the extremely low-resource setting, which is
where our work lies, there are little to no paired
speech data resources, very low online presence,
and sometimes no developed pronunciation lexicon rule or language model to improve ASR models. Some of these languages also contain few unpaired multi-speaker data. This is the case of many
African languages.
Previous works according to model architecture: While traditional phonetic-based approaches (Hidden Markov Models) have produced
considerable results in the past, we focus on
end-to-end speech recognition with deep learning
(Chorowski et al., 2014a,b; Hannun et al., 2014;
Amodei et al., 2015; Chan et al., 2016; Chiu
et al., 2018) because they have been shown to produce better results, with little dependence on handcrafted features and phoneme dictionaries.
Chorowski et al. (2014b) introduced an endto-end continuous ASR using a bidirectional recurrent neural network (RNN) encoder with an
RNN decoder that aligns the input and the output sequences using the attention mechanism. The
model achieved a word error rate (WER) of
18.57% on the TIMIT data set. Hannun et al.
(2014); Amodei et al. (2015) presented a stateof-the-art ASR system using E2E DNNs. They
introduced a system that does not use any handdesigned language component, nor even the concept of "phoneme". Their result was achieved, as
the authors stated in their original paper, through
a well-optimized RNN training system that uses
multiple GPUs, as well as a set of novel data synthesis techniques and language models.
Following the promising features that E2E
DNNs offer, Mamyrbayev et al. (2020) showed
in their recent studies that, using them with CTC
works without the need for direct inclusion of language models.
State of ASR resources for African languages:
Open-sourced data for ASR is one of the driving forces of research in any deep learning field,
including ASR, because it fosters experimenting,
training and developing better models. The Open
Speech and Language Resources (OpenSLR)1 is
a platform with open-sourced speech and language resources, such as training corpora for
speech recognition for public use. It currently
contains many languages (both high and low resources). However, we discovered that of the 2000
African languages, only Yoruba (Gutkin et al.,
2020), Nigerian Pidjin and four South African lan1
http://openslr.org/index.html
Setting
Data
Related Work
pronunciation lexicon
paired data (single-speaker, highquality)
paired data (multi-speaker, highquality)
unpaired speech (single-speaker, highquality)
unpaired speech (multi-speaker,lowquality)
unpaired text
ASR
Rich-Resource
Low-Resource
Extremely Low-Resource
✓
✓
✗
dozens of minutes
several minutes
✗
hundreds of hours
dozens of hours
several hours
✓
dozens of hours
✗
✓
✓
dozens of hours
✓
✓
[(Chorowski et al., 2014a;
Chiu et al., 2018; Chan
et al., 2016; Hannun et al.,
2014; Li et al., 2019;
Mamyrbayev et al., 2020; [(Tjandra et al., 2017)]
Chorowski et al., 2014b;
Hori et al., 2019; Rosenberg et al., 2019; Schneider et al., 2019)]
very few
[Our Work, Laleye et al.
(2016); Xu et al. (2020);
Baevski et al. (2020); Liu
et al. (2020); Ren et al.
(2019)]
Table 3: Data sources to build ASR models and the corresponding related works in the different settings. Adapted
from (Xu et al., 2020)
guages (Afrikaans, Sesotho, Setswana, isiXhosa)
(van Niekerk et al., 2017) are present, as of the period of writing this paper. Furthermore, they contain very few samples (tens, few hundreds of audio
hours), compared to their high-resource counterparts (thousands, millions of audio hours). This
scarcity of open resources for the development
of ASR for low-resourced African languages is
one of the major factors affecting the low state
of ASR research in African languages. Although
there may be some non-open resources for some
of these African languages, they come with huge
licensing fees, among other limiting factors.
and speech analysis research in the past decade,
but no public research on E2E DNN ASR, to the
best of our knowledge. We opine that this is
largely because 1) many old works in the past on
Igbo focused solely on tonal analysis (Odinye and
Udechukwu, 2016; Nkamigbo, 2012), and 2) there
is a lack of open-source speech data to encourage
further research on exploring ASR with deep learning methods, which are known to be data-hungry.
State of ASR research for Fon and Igbo: Fon,
unlike Igbo, has little to no digital presence. With
very few speakers, and almost no online presence,
there have been understandably very few tonal
analysis or ASR research for this language. The
few that exist are mostly by researchers who are
native speakers of the language.
To the best of our knowledge, there has only
been notable efforts from Laleye (2016); Laleye
et al. (2016) to build an ASR for Fon, with a
word error rate (WER) of 14.83%. This result
had been achieved, building two LMs and also
only after normalizing and removing the diacritics; whose crucial importance for both performant
ASR and neural machine translation (NMT) has
been proved by Orife (2018); Dossou and Emezue
(2020). This will be discussed later in section
4.3.2. The best model with diacritics scored a
(WER) of 44.04%.
Igbo, on the other hand, has had a lot of tonal
We got our speech dataset for Fon from the existing Fon speech corpus2 which was built upon the
tedious task of recording the texts pronounced by
native speakers (including 8 women and 20 men)
of Fongbe in a noiseless environment. The recordings are sampled at a frequency of 16Khz. The 28
native speakers have spoken around 1500 phrases
(daily conversations domain). These recordings
were made with the LigAikuma3 android application. The minimum length of a speech sample is 2
seconds and the maximum is 5 seconds, giving us
an average of 4 seconds content length. Overall,
there are around 10 hours of speech data that have
been collected.
The global data set has been split into training,
validation and test data sets. The training set contains 8 hours of speech (8235 speech samples), the
4 Speech-to-Text Corpora and Data
Preprocessing
4.1 Fon Speech-to-Text Corpus
2
https://paperswithcode.com/dataset/
fongbe-speech-recognition
3
https://lig-aikuma.imag.fr/
validation data set contains 1500 speech samples
and finally the test data set contains 669 speech
samples. The text corpus made of the 1500 sentences used to build the speech data set has been
scraped from BéninLangues 4 .
4.2
Igbo Speech-to-Text Corpus
It was very hard to find the data set of Igbo audio samples and their transcripts. We realized that
there is a great lacuna: even though there’s been
much research on Igbo phonology, there has really
not been any (public) efforts to gather any speechto-text data set for it.
The data set for our experiments on Igbo was
got through a license from the Linguistic Data
Consortium (LDC2019S16: IARPA Babel Igbo
Language Pack) (Nikki et al., 2019). It contains approximately 207 hours of Igbo conversational and scripted telephone speech collected in
2014 and 2015 along with corresponding transcripts. The data set (hereafter called IgboDataset)
is made up of telephone calls representing the Owerri, Onitsha, and Ngwa dialects spoken in Nigeria,
sampled at 8kHz, with a few sampled at 48kHz.
The gender distribution among speakers is approximately equal; speakers’ ages range from 16 years
to 67 years. The telephone calls were made using
different telephones (e.g., mobile, landline) from
a variety of environments including the street, a
home or office, a public place, and inside a vehicle. The diacritics were originally removed from
the transcripts.
Unlike the Fon data set, which was modern
and contained very clean audio, IgboDataset had
old speech patterns, and contained many noisy audio samples. Therefore, we had to implement a
number of cleaning strategies (like filtering based
on length of words, upsampling, exploring different mel-spectrograms units and number of Fast
Fourier Transform (FFT) bins. The FFT is an algorithm that computes the discrete Fourier transform
(DFT, described in section 4.3.1) of a sequence.
Our cleaning strategies gave us a reduced data set
of 2.5 hours, which we split into train, dev and test
sizes of 4000, 100 and 100 audio samples respectively. To test the importance of our pre-cleaning,
we trained the model on both the large uncleaned
data set and our cleaned version (results are discussed in section 6).
4
https://beninlangues.com/fongbe
4.3 Data Preprocessing
4.3.1 Speech Preprocessing
Speech signals are made up of amplitudes and
frequencies. Amplitudes simply inform about
the loudness of the sound recording, nothing informative. To get more information from our
speech samples, we decided to map them into
the frequency domain. Two of the known techniques, enabling the conversion of speech data
from its time domain to its frequency domain,
are the Fourier Transformation (FT) and Discrete
Fourier Transformation (DFT) (Boashash, 2003;
Bracewell, 2000).
FT is a mathematical concept that converts a
continuous signal from the time domain to the frequency domain. FT decomposes a continuous signal into its frequency components, giving the frequencies present in the signal, and their respective
magnitudes. DFT, similary to FT, converts a sequence, considered as a discrete signal, into its frequency components.
However, applying only the FFT just gives frequency values without time information. To make
sure we preserve frequencies, time and amplitudes
information about the speech samples, in reasonable and adequate range, we decided then to use
mel-spectrograms.
The mel-scale is a perceptual scale of pitches
judged by listeners to be equal in distance from
one another (Stevens et al., 1937). It is constructed
such that sounds of equal distance from each other
on the mel-scale, also «sound» to humans as they
are equal in distance from one another. A popular
formula to convert f hertz (frequencies measures)
into m mels is the O’Shaughnessy formula described in O’Shaughnessy book (O’Shaughnessy,
1987), defined as
m = 2595 ∗ log10 ∗ (1 +
f
) (10).
700
A spectrogram 5 is an image, a three dimensional (3D) representation that displays how energy frequencies components of the speech change
over time. The abscissa represents time, the ordinate axis represents frequency, and amplitudes are
shown by the darkness of a precise frequency at
a particular time: low amplitudes are represented
with a light-blue color, and very high amplitudes
are represented by dark red.
5
http://www.glottopedia.org/index.php/Spectrogram
There are two types of spectrograms: broadband spectrograms and narrow-band spectrograms6 .
• Broad-band spectrograms have higher temporal resolutions allowing the detection of
changes in frequency over small intervals of
time. However, they usually do not help making good frequency distinctions, as the time
interval for each spectrum is small.
• Narrow-band spectrograms have higher frequency resolution, and larger time interval
for every spectrum than broad-band spectrograms: this allows the detection of very small
differences in frequencies. Moreover, they
show individual harmonic structures, which
are vibration frequency folds of the speech,
as horizontal striations.
Lang.
Fon
Fon
4.3.2 Text Preprocessing
Scholars like Orife (2018) and Dossou and
Emezue (2020) have shown in their studies that
keeping the diacritics reduces lexical disambiguation and provides more morphological information
to neural machine translation models. Additionally, diacritics relay the pronunciation tone and
the sound generated, leading to an improved understanding of the sentences and their contexts.
6
http://www.cas.usf.edu/~frisch/SPA3011_L07.html
a ná gbª nù ¡ú bó
gbª sin nù bó gbª
xó ¡ª káká yi azãn
tªn gbè
English
tion
Transla-
I’m super hungry, I’m starving.
Please give me
some food
You will not be eating, neither drinking nor speaking
for the next three
days
Table 4: Fon sentences before and after the preprocessing with their English Translations
Fon (a)
to (ambiguous and uncertain)
tó (ear)
A mel-spectrogram is hence a spectrogram with
the mel-scale. In our study, we decided to use
narrow-band mel-spectrograms, as input features
for the model.
We used 512 as length of the FFT window, 512
as the hop-length (number of samples between successive frames) and a hanning windows size is set
to the length of FFT window.
For handling the audio data, we used the torchaudio utility from PyTorch (Paszke et al., 2019).
We used Spectrogram Augmentation (SpecAugment) (Park et al., 2019) as a form of data augmentation: we cut out random blocks of consecutive
time and frequency dimensions. Mel-Spectograms
were generated from each speech samples with
some fine-tuned hyper-parameters:
• sample_rate: sample rate of audio signal, set
to 16000 hertz (16 kHz) for Fon and 8000
hertz (8 kHz) for Igbo.
• n_mels: number of mel filterbanks, set to 128
for Fon and 64 for Igbo.
Source
Text
(preprocessed
text same as the
source)
xov¢ sin mì tlala
k¢nkl¢n bo ná mì
nù¡é
tò (sea) tô (country)
Igbo (b)
akwa (ambiguous and uncertain)
ákwà (cloth)
ākwá (egg)
ákwá (to cry)
Table 5: Examples of tonal inflection with Fon and Igbo
A diacritic is a glyph added to a letter or basic
glyph. They appear above or below a letter, or in
some other position such as within the letter or between two letters. Some diacritical marks, such
as the acute (´) and grave (‘), are often called accents. In Fon and Igbo, a good example of the
importance can be seen in Table 5 (a) where we
demonstrate that removing diacritics from a word
could lead to ambiguity and result in the confusion
of the word. Therefore, we preprocessed the textual data without removing the diacritics. The results of the text data preprocessing on the Fon data
set, are presented in Table 4. For Igbo, unfortunately the IgboDataset was originally stripped of
its diacritics. Therefore, we were not able to encode any diacritical information.
5 Models Architectures and Experiments
5.1 Preliminaries
We take a moment to briefly define
the task of ASR mathematically.
We
have
a
training
set
of
n
samples:
χ
=
(1) (1)
(2)
(2)
(n)
(n)
(x , y ), (x , y ), ..., (x , y ) . Each
utterance, x(i) , sampled from the training set, is a
time-series of length T (i) where every time-slice is
(i)
a vector of audio features, (xt , t = 0, ..., T (i) − 1.
The goal of ASR is to generate transcripts ŷ (i)
for each utterance x(i) . In order to achieve this,
CNN layer
rCNN block
BiLSTM block
CTC Loss
Fully Connected Layer
Mel-Spectrogram
(narrow band)
Fully Connected Layer
to flatten output from rCNN block
Batch Normalization
BiGRU block
Output from previous layer
Output from previous layer
Layer Norm
|
GeLU
Layer Norm
|
GeLU
|
Dropout
|
CNN
RNN-type
connection
Hidden state
ith x
at. w
conc
Layer Norm
|
GeLU
|
Dropout
|
CNN
Bidirectional GRU
(outputs x and hidden state)
Alignment score
softmax
Attention weights
Attention
mechanism
Context Vector
+
Input for next layer
Input for next layer
Figure 1: Architecture of the best model for Fon and Igbo, with an expansion of each component of the rCNN
block and BiGRU block.
we use an architecture consisting of one or more
recurrent neural networks (RNN), since they are
best equipped for time-series data.
In order to generate ŷ (i) for a given utterance
(i)
x , the RNN models the probability of picking a
character from the character set. To put it mathematically, at each output time-step t, the RNN
makes a prediction over characters, p(ct |x(i) ),
where ct is either a character of the alphabet (including their diacritics) or the blank symbol. In
Fon for example, due to our inclusion of the diacritics, we have ct = {a, b, c, ..., z, à, á, ā, ă,
è, é, ē, ĕ, ì, í, î, ï, ı̆, ó, ŏ, ò, ū, ŭ, ù, ú, ª, ª, ª,
ª, ¡, ¢, ¢, ¢, ¢, ¢, f ullstop, apostrophe, comma,
space, blank}. Apostrophe, white-space, comma,
and full stop characters have been added to denote word boundaries. For Igbo, ct = {a, b, ..., z,
fullstop, apostrophe, comma, blank} which is the
same as the character set for the English language.
This constraint was because the speech dataset for
Igbo was already stripped of the diacritics, as explained in section 4.2.
5.2 Model Architecture
Related works have shown that we can increase
model capacity, in order to efficiently learn from
large speech datasets, by adding more hidden layers rather than making each layer larger. Graves
et al. (2013) explored increasing the number of
consecutive bidirectional recurrent layers, and
Amodei et al. (2015) proposed the Deep Speech2,
which among a number of optimization techniques, extensively applied batch normalization
(Ioffe and Szegedy, 2015) to the deep RNNs.
Furthermore, Chorowski et al. (2014a) showed
that the use of Badhanau (additive) attention mechanism (Bahdanau et al., 2016) could reduce the
phoneme or word error rate (WER) of the ASR
model. This is possible because the attention
mechanism forces the decoder to make monotonic
alignment and hence improve the predictions.
Our model architecture, shown in Figure 1,
draws inspiration from these research findings.
While our model at its core is similar to Deep
Speech 2, our key improvements are:
• the exploration of the combination of Bidirectional Long Short Term Memory (BiLSTMs)
and Bidirectional Gated Recurrent Units (BiGRUs) for low-resource ASR.
• the integration of the Badhanau attention
mechanism, which effect has been demonstrated on Fon.
Our model has two main neural network modules: N -blocks of Residual Convolutional Neural Networks (rCNNs) (He et al., 2015, 2016) and
M -blocks each of BiLSTMs and BiGRUs. Each
rCNN block is made of two CNN layers, two
dropout layers and two normalization layers (Ba
et al., 2016) for the CNN inputs. We leveraged the
power of convolutional neural networks (CNNs)
to extract abstract features by converting speeches
into spectrograms. RNNs process the abstract features produced by the rCNNs, step by step, making a prediction for each frame while using context
from previous frames. We use BiRNN’s (Schuster
and Paliwal, 1997) because we want the context
of not only the frame before each timestep, but the
following as well. This help the model make better
predictions.
In our scenario, BiLSTMs and BiGRUs act respectively as encoder and decoder blocks. Each
block produces subsequentially outputs and hidden states fed to the next block. The last hidden
state of the last BiGRU block is used to compute
the attention weights and the context vector, that
will be concatenated to the BiGRU output, to serve
as final output. In-between and towards each block
output are stacked dropout layers to prevent overfitting (Srivastava et al., 2014).
The output from the models is a probability matrix for characters which will be fed into a greedy
decoder. We implemented the greedy decoder suggested by Graves et al. (2006b) to extract what the
model believes are the highest probability characters that were spoken. This simple decoder, albeit
with no linguistic information, has been shown to
produce useful transcriptions (Zenkel et al., 2017).
Our model is trained to predict the probability distribution of every character of the alpha-
bet at each timestep from the narrow-band melspectrogram we feed it. Traditional ASR models
require aligning transcripted text to the speech before the training, and the model is trained to predict specific labels at specific timesteps. However,
with the CTC loss function (Graves et al., 2006a),
the previously described step is skipped and the
model directly learns to align the transcript itself
during training.
5.2.1 Implementing the attention mechanism
Let us recall here that we have 5 blocks of rCNNs,
3 blocks each of BiLSTMs and BiGRUs. Bahdanu
attention a bit modified and implemented as explained in Figure 1, in the following steps:
• an input x from the stacked blocks of rCNNs is fed to the BiLSTM (encoder), and primarily layer-normalized. The output is then
passed through a GeLU activation function,
which output is fed to the BiLSTM layers of
the current block. Within the current encoder
block, each of the BiLSTM layers, produce
an output, a hidden state and a cell state. The
encoder output is lastly passed to a dropout
layer, and is used as input for the next encoder
block. The output of the last encoder is used
as input for the first BiGRU (decoder) block.
• The input of the decoder goes successively
through a normalization layer and a GeLU activation function. The output is then fed to BiGRU layers of the current decoder, which produces the decoder output and a hidden state.
In common NLP or more specifically in neural machine translation (NMT), the hidden
state is a 2-dimensional tensor. However, this
is not the case here, since our initial input
features from the stack of rCNNs layers are
4-dimensional tensors of shape (batch, channel, feature, time), and the output from the
stack of encoders is a 3-dimensional tensor
of shape (batch_size, number of features, hidden size).
• Using the hidden state h, and the decoder output x shapes, three dense layers are created
accordingly using the following PyTorch version pseudo-code:
w1 = nn.Linear(x.size[2], x.size[2]//2)
w2 = nn.Linear(h.size[2], h.size[2])
v = nn.Linear(x.size[2]//2, x.size[2])
Along our manipulation, in the attempt to
customize the NMT concept to ours, we encountered few cases were shapes of the decoder output and the hidden state were mismatching. To resolve that, we augmented
the hidden state tensor, along the second axis
(axis = 1) with desirable number of zeros.
This was done for two reasons:
1. the concatenation will not affect the original hidden state tensor
2. tanh(x|x = 0) = 0: 0 is hence acting
like a neutral element.
Once we get compatible shapes for both the
decoder output and the hidden state tensor,
we used them to conventionally compute the
attention scores, s:
m = nn.T anh()
s = v(m(w1(x) + w2(h))
• The attention weights are then computed by
passing s through a softmax operation. The
context vector, cv , which is got by multiplying the attention weights by x, is concatenated to the decoder output x to get the attention features:
n = nn.Sof tmax()
attention_weights = n(s)
cv = attention_weights ∗ x
x = concatenate(cv , x)
• The output of the current decoder block is
the attention output passed through a dropout
layer, and taken as input for the next decoder.
5.3
Experiments
Throughout our experiments, we explored different model architectures with various number of
convolutional and bidirectional recurrent layers.
The best ASR model, shown on Figure 1 has
5 blocks of rCNNs, 3 blocks each of BiLSTMs
and BiGRUs, with attention incorporated into each
component of the BiGRU block. Also, we used
a form of Batch Normalization throughout the
model.
We got the best evaluation results with the following hyper-parameters:
• max learning_rate: 5e-4 (for Fon), 3e-4 (for
Igbo)
• batch_size: 20 (for Fon), 20 (for Igbo).
• (N, M ): (5, 3) for Fon and Igbo.
• embedding_size: 512
• epochs: 500 (for Fon) and 1000 (for Igbo),
with early stopping after 100 epochs.
• activation_function: GeLU (Hendrycks and
Gimpel, 2016)
• optimizer: AdamW (Loshchilov and Hutter,
2019) (Fon), Nesterov accelerated descent
(Nesterov, 1983) (Igbo)
We used two metrics to evaluate the models: the
Character Error Rate (CER) and the WER. WER
uses the Levenshtein distance (Levenshtein, 1966)
to compare reference text and hypothesis text in
word-level. Even with a low CER, the WER can
be high: hence the lower the WER, the better the
model. For the training processes, we used the
WER of the validation data set to select the best
weights and parameters.
6 Results
Models
(rCNN)
+BiGRUs
+BiLSTMs
+BiLSTMs+BiGRUs
+BiLSTMs+BiGRUs+Attn
(Laleye et al., 2016)7
Fon
CER
WER
22.0831
24.2783
16.9581
18.7976
-
59.66
61.46
47.05
42.50
44.09
Igbo (without cleaning)
CER
WER
56.00
50.12 (92.67)
-
64.00
55.03 (97.99)
-
Table 6: CER (%), and WER (%) of different models
on Fon and Igbo (original and cleaned) test datasets.
We present our findings using 5 blocks of rCNNs with:
• 3 blocks solely of BiGRUs
• 3 blocks solely of BiLSTMs
• 3 blocks each of BiLSTMs and BiGRUs
• 3 blocks each of BiLSTMs and BiGRUs + Attention mechanism.
Table 6 presents the results of different models
architectures on the test data set for Fon and Igbo.
6.1 Results for Fon
We show that implementing attention mechanism
reduced the WER by 5%. Our Fon ASR model
outperformed the current Fon ASR model with diacritics of Laleye et al. (2016).
Table 7 shows some decoded predictions and
targets from the Fon ASR model, which are very
similar. Common mistakes (colored), happen most
often at a character level where a character is either
omitted, added or replaced by another one. The native speakers included in this study have testified
to the fact that those mismatched words or characters are often practically not distinguishable in
7
WER of the best model with diacritics from (Laleye
et al., 2016)
Fon Decoded Predictions
tª ce xwe yªyª din tªn ª ci gblagadaa
eo mi sa aakpan nu mi
fit¢ a gosin xwe yi gbe
e kpo kp¢¡é
akw¢ c¢ gbadé jí ¡axim¢
Fon Decoded Targets
tª ce xwe yªyª din tªn ª ci gblagadaa
eo mi sa akpan nu mi
fit¢ a go sin xwe yin gbe
e kpo kp¢¡e
akw¢ j¢ gbadé ji ¡axim¢
Table 7: Decoded Predictions and Targets of the best Fon ASR Model
speaking. The model source code is open-sourced
at: https://github.com/bonaventuredossou/fonasr
6.2
Results for Igbo
An important observation we show in Table 6 is
the effect of the state of audio samples on the
model’s ability to learn: for our large IgboData
with background noise, uneven audio length, low
sampling rate, etc, the model found it very difficult to learn the speech representations. Taking
time to sieve through the data (in Section 4.2) mitigated this issue by helping the model learn the abstract features better, albeit on a small training set.
While the model is currently still training on more
epochs (with hope of improving), our preliminary
results serve as a benchmark for ASR on Igbo.
The source code for the model can be accessed at
https://github.com/chrisemezue/IgboASR.
In Table 6, one may observe the large difference
between the CER and WER on Fon language, unlike Igbo. We strongly believe that this is due to
the fact that the character set for Fon contains all
the possible diacritics for each letter of the Fon
alphabet, making it extremely large (compared to
the set of Igbo characters which had no diacritical
information). To further support our claim, a close
observation of the targets and predictions in Table
7 reveals that the errors are mostly due to omission or mismatch of diacritics for the characters
(‘e’ predicted instead of ‘é ’ in row 4 or a space
added between ‘go’ and ‘sin’ in row 3 ).
7 Future Work
Our work shows promising results considering the
small training sizes, and we have presented a stateof-the-art ASR model for Fon. As future pathway to improve the proposed models, we are exploring approaches like leveraging language models, deeper model structures, transformers and
crowd-sourcing/compiling speech-to-text data set
for Igbo and Fon.
For Igbo language, the next stage involves incorporating diacritical information in the ASR model.
We have begun by gathering new speech dataset
which include the diacritics.
8 Acknowledgement
We are grateful to Professor Graham Neubig of
Carnegie Mellon University for coming to our aid
by providing us with an Amazon EC2 instance
for training our model when we were very low on
computational resources. We also thank Dr Frejus
Layele for giving us access to the Fon data set, and
Dr Iroro Orife, for his guidance on designing the
ASR model and cleaning the IgboData.
References
Jade Z. Abbott and Laura Martinus. 2018. Towards
neural machine translation for african languages.
CoRR, abs/1811.05467.
Ike Achebe, Clara Ikekeonwu, Cecilia Eme, Nolue
Emenanjo, and Nganga Wanjiku. 2011. A composite synchronic alphabet of igbo dialects (csaid).
IADP, New York.
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl
Case, Jared Casper, Bryan Catanzaro, Jingdong
Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun,
Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan
Raiman, Sanjeev Satheesh, David Seetapun, Shubho
Sengupta, Yi Wang, Zhiqian Wang, Chong Wang,
Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao
Zhu. 2015. Deep speech 2: End-to-end speech
recognition in english and mandarin.
Andrew Caines. 2019. The Geographic Diversity of
NLP Conferences.
Jimmy Ba, J. Kiros, and Geoffrey E. Hinton. 2016.
Layer normalization. ArXiv, abs/1607.06450.
Alexei Baevski, Michael Auli, and Abdelrahman Mohamed. 2020. Effectiveness of self-supervised pre–
training for speech recognition.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural machine translation by jointly
learning to align and translate.
B. Boashash. 2003. Time-Frequency Signal Analysis
and Processing: A Comprehensive Reference. Oxford: Elsevier Science.
R. N Bracewell. 2000. The Fourier Transform and Its
Applications. Boston: McGraw-Hill.
Hounkpati B. C. Capo. 1986. Renaissance du gbe, une
langue de l’Afrique occidentale: étude critique sur
les langues ajatado, l’ewe, le fon, le gen, laja, le
gun, etc. Université du Bénin, Institut national des
sciences de l’éducation.
Hounkpati B. C. Capo. 1991. A comparative phonology of Gbe. Foris Publications.
William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol
Vinyals. 2016. Listen, attend and spell: A neural
network for large vocabulary conversational speech
recognition. In ICASSP.
Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen,
Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski,
State-of-the-art
and Michiel Bacchiani. 2018.
speech recognition with sequence-to-sequence models.
Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho,
and Yoshua Bengio. 2014a. End-to-end continuous
speech recognition using attention-based recurrent
nn: First results.
Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho,
and Yoshua Bengio. 2014b. End-to-end continuous
speech recognition using attention-based recurrent
nn: First results.
Bonaventure F. P. Dossou and Chris C. Emezue. 2020.
Ffr v1.1: Fon-french neural machine translation.
Alan S. Duthie and R. K. Vlaardingerbroek. 1981. Bibliography of GBE: (Ewe, Gen, Aja, Xwala, Fon,
Gun, etc.): publications "on" and "in" the language.
Basler Afrika Bibliographien.
David M. Eberhard, Gary F. Simons, and Charles
D. Fennig (eds.). 2020. Ethnologue: Languages of
the world. twenty-third edition.
∀, Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa
Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo
Fagbohungbe, Solomon Oluwole Akinola, Shamsuddee Hassan Muhammad, Salomon Kabongo, Salomey Osei, et al. 2020a. Participatory research for
low-resourced machine translation: A case study in
african languages. Findings of EMNLP.
∀, Iroro Orife, Julia Kreutzer, Blessing Sibanda, Daniel
Whitenack, Kathleen Siminyu, Laura Martinus,
Jamiil Toure Ali, Jade Abbott, Vukosi Marivate, Salomon Kabongo, Musie Meressa, Espoir Murhabazi,
Orevaoghene Ahia, Elan van Biljon, Arshath Ramkilowan, Adewale Akinfaderin, Alp Öktem, Wole
Akin, Ghollah Kioko, Kevin Degila, Herman Kamper, Bonaventure Dossou, Chris Emezue, Kelechi
Ogueji, and Abdallah Bashir. 2020b. Masakhane –
machine translation for africa.
Obiamalu G.O and D.U. Mbagwu. 2007. Codeswitching:
Insights from code-switched english/igbo expressions. pages 51–53. Awka Journal
of Linguistics and Languages Vol 3.
A. Graves, Abdel rahman Mohamed, and Geoffrey E.
Hinton. 2013. Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing,
pages 6645–6649.
Alex Graves, Santiago Fernández, Faustino Gomez,
and Jürgen Schmidhuber. 2006a. Connectionist
temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on
Machine Learning, ICML ’06, page 369–376, New
York, NY, USA. Association for Computing Machinery.
Alex Graves, Santiago Fernández, Faustino Gomez,
and Jürgen Schmidhuber. 2006b. Connectionist
temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on
Machine Learning, ICML ’06, page 369–376, New
York, NY, USA. Association for Computing Machinery.
Alexander Gutkin, Işın Demirşahin, Oddur Kjartansson, Clara Rivera, and Kó.lá Túbò.sún. 2020. Developing an Open-Source Corpus of Yoruba Speech.
In Proceedings of Interspeech 2020, pages 404–408,
Shanghai, China. International Speech and Communication Association (ISCA).
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and
Andrew Y. Ng. 2014. Deep speech: Scaling up end–
to-end speech recognition.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. 2015. Deep residual learning for image recognition.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. 2016. Identity mappings in deep residual networks.
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus).
T. Hori, R. Astudillo, T. Hayashi, Y. Zhang, S. Watanabe, and J. Le Roux. 2019. Cycle-consistency training for end-to-end speech recognition. In ICASSP
2019 - 2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP),
pages 6271–6275.
Clara Ikekeonwu. 1999. Igbo”, handbook of the international phonetic association.
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by
reducing internal covariate shift.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika
Bali, and Monojit Choudhury. 2020a. The state and
fate of linguistic diversity and inclusion in the NLP
world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
pages 6282–6293, Online. Association for Computational Linguistics.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika
Bali, and Monojit Choudhury. 2020b. The state and
fate of linguistic diversity and inclusion in the nlp
world.
F. A. A. Laleye, L. Besacier, E. C. Ezin, and C. Motamed. 2016. First automatic fongbe continuous
speech recognition system: Development of acoustic models and language models. In 2016 Federated
Conference on Computer Science and Information
Systems (FedCSIS), pages 477–482.
Frejus Adissa Akintola Laleye. 2016. Contributions
à the á study and à automatic speech recognition
in Fongbe. Theses, Universit é du Littoral C ô te
d’Opale.
Daniel van Niekerk, Charl van Heerden, Marelie Davel,
Neil Kleynhans, Oddur Kjartansson, Martin Jansche,
and Linne Ha. 2017. Rapid development of TTS corpora for four South African languages. In Proc. Interspeech 2017, pages 2178–2182, Stockholm, Sweden.
Adams Nikki, Bills Aric, Conners Thomas, David
Anne, Dubinski Eyal, Fiscus Jonathan G., Gann
Ketty, Harper Mary, Kaiser-Schatzlein Alice, Kazi
Michael, Malyska Nicolas, Melot Jennifer, Onaka
Akiko, Paget Shelley, Ray Jessica, Richardson Fred,
Rytting Anton, and Sinney Shen. 2019. Iarpa
babel igbo language pack iarpa-babel306b-v2.0c
ldc2019s16. web download.
Linda Chinelo Nkamigbo. 2012. A phonetic analysis
of igbo tone. ISCA Archive, The Third International
Symposium on Tonal Aspects of Languages.
G.I. Nwaozuzu. 2008. Dialects of the Igbo Language.
University of Nigeria Press.
G. Obiamalu and Davidson U. Mbagwu. 2010. Motivations for code-switching among igboenglish bilinguals: A linguistic and sociopsychological survey.
OGIRISI: a New Journal of African Studies, 5:27–
39.
Sunny Odinye and Gladys Udechukwu. 2016. Igbo
and chinese tonal systems: a comparative analysis.
Ogirisi: A new Journal of African Studies, Volume
1:48.
Claire Lefebvre and Anne-Marie Brousseau. 2002. A
grammar of Fongbe. Mouton de Gruyter.
Chinyere Ohiri-Aniche. 2007. Stemming the tide of
centrifugal forces in igbo orthography. Dialectical
Anthropology, 31(4):423–436.
V. I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet
Physics Doklady, 10:707.
Iroro Orife. 2018. Attentive sequence-to-sequence
learning for diacritic restoration of yorùbá language
text.
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and
Ming Liu. 2019. Neural speech synthesis with transformer network. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6706–6713.
Douglas O’Shaughnessy. 1987. Speech communication: human and machine. Journal of the Acoustical
Society of America.
Alexander H. Liu, Tao Tu, Hung yi Lee, and Lin shan
Lee. 2020. Towards unsupervised speech recognition and synthesis with quantized speech representation learning.
Daniel S. Park, William Chan, Yu Zhang, ChungCheng Chiu, Barret Zoph, Ekin D. Cubuk, and
Quoc V. Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition.
Interspeech 2019.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization.
Orken Mamyrbayev, Keylan Alimhan, Bagashar Zhumazhanov, Tolganay Turdalykyzy, and Farida Gusmanova. 2020. End-to-end speech recognition in
agglutinative languages. In Intelligent Information and Database Systems, pages 391–401, Cham.
Springer International Publishing.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Köpf, Edward
Yang, Zach DeVito, Martin Raison, Alykhan Tejani,
Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An
imperative style, high-performance deep learning library.
Y. Nesterov. 1983. A method for unconstrained convex
minimization problem with the rate of convergence
o(1/k 2 ).
S. Poplack. 1979. “sometimes i’ll start a sentence in
spanish y termino en español”: Toward a typology
of code-switching.
Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao,
and Tie-Yan Liu. 2019. Almost unsupervised text
to speech and automatic speech recognition. In Proceedings of the 36th International Conference on
Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5410–5419. PMLR.
Keren Rice. 1992. Language, 68(1):149–156.
Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran,
Ye Jia, Pedro Moreno, Yonghui Wu, and Zelin Wu.
2019. Speech recognition with augmented synthesized speech.
Steffen Schneider, Alexei Baevski, Ronan Collobert,
and Michael Auli. 2019. wav2vec: Unsupervised
Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469.
Mike Schuster and Kuldip Paliwal. 1997. Bidirectional
recurrent neural networks. Signal Processing, IEEE
Transactions on, 45:2673 – 2681.
Kathleen Siminyu, Sackey Freshia, Jade Abbott, and
Vukosi Marivate. 2020. Ai4d – african language
dataset challenge.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.,
15(1):1929–1958.
Stanley Smith Stevens, John Volkmann, and Edwin B
Newman. 1937. A scale for the measurement of
the psychological magnitude pitch. Journal of the
Acoustical Society of America.
Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura.
2017. Listening while speaking: Speech chain by
deep learning.
Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao,
and Tie-Yan Liu. 2020. Lrspeech: Extremely low-resource speech synthesis and recognition.
Thomas Zenkel, Ramon Sanabria, Florian Metze, Jan
Niehues, Matthias Sperber, Sebastian Stüker, and
Alex Waibel. 2017. Comparison of decoding strategies for ctc acoustic models.