HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata
  • failed: inconsolata
  • failed: xstring
  • failed: arydshln
  • failed: linguex
  • failed: arydshln
  • failed: expex
  • failed: tabto
  • failed: gb4e
  • failed: qtree

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2401.01053v3 [cs.CL] 10 Jan 2024
\noautomath

Cheetah[Uncaptioned image]: Natural Language Generation for 517 African Languages

Ife Adebara1,1{}^{1,\star}start_FLOATSUPERSCRIPT 1 , ⋆ end_FLOATSUPERSCRIPT   AbdelRahim Elmadany1,1{}^{1,\star}start_FLOATSUPERSCRIPT 1 , ⋆ end_FLOATSUPERSCRIPT   Muhammad Abdul-Mageed1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTDeep Learning & Natural Language Processing Group, The University of British Columbia
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTDepartment of Natural Language Processing & Department of Machine Learning, MBZUAI
{ife.adebara@,a.elmadany@,muhammad.mageed@}ubc.ca
Abstract

Low-resource African languages pose unique challenges for natural language processing (NLP) tasks, including natural language generation (NLG). In this paper, we develop Cheetah, a massively multilingual NLG language model for African languages. Cheetah supports 517517517517 African languages and language varieties, allowing us to address the scarcity of NLG resources and provide a solution to foster linguistic diversity. We demonstrate the effectiveness of Cheetah through comprehensive evaluations across six generation downstream tasks. In five of the six tasks, Cheetah significantly outperforms other models, showcasing its remarkable performance for generating coherent and contextually appropriate text in a wide range of African languages. We additionally conduct a detailed human evaluation to delve deeper into the linguistic capabilities of Cheetah. The introduction of Cheetah has far-reaching benefits for linguistic diversity. By leveraging pretrained models and adapting them to specific languages, our approach facilitates the development of practical NLG applications for African communities. The findings of this study contribute to advancing NLP research in low-resource settings, enabling greater accessibility and inclusion for African languages in a rapidly expanding digital landscape. We will publicly release our models for research. 111https://github.com/UBC-NLP/Cheetah {}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT Authors contributed equally.

1 Introduction

The linguistic diversity present in African languages poses unique challenges for NLG systems. With over 2,00020002,0002 , 000 languages spoken across the African continent Eberhard et al. (2021), the need for effective NLG solutions that can accommodate this rich linguistic ecosystem cannot be over-emphasized. This is especially important because traditional NLG approaches have primarily focused on high-resource languages, such as English and French due to the availability of large-scale datasets and resources. Consequently, low-resource languages, including numerous African languages, have been marginalized in NLG research and development. Developing robust NLG systems for the diverse needs of African communities is challenging due to the scarcity of extensive language datasets, limited linguistic research, and variations across these languages.

Refer to caption
Figure 1: Cheetah is trained on 517517517517 African languages and language varieties across 14141414 language families. The languages are domiciled in 50505050 of 54545454 African countries and are written in six different scripts.

To address these challenges, recent advancements in language modeling and transfer learning techniques have shown promise in supporting NLG in low-resource languages. Pretrained language models, such as GPT-3 Radford et al. (2018, 2019); Brown et al. (2020), mT5 Xue et al. (2021), and mT0 Muennighoff et al. (2022), have demonstrated remarkable capabilities in understanding and generating human-like text. These models capture the statistical regularities and syntactic structures of the languages they are trained on, making them valuable starting points for supporting NLG in low-resource settings.

In this paper, we present a pioneering work on NLG in African languages by introducing Cheetah: a novel language model (LM) specifically designed to support 517517517517 African languages and language varieties. To the best of our knowledge, Cheetah supports the largest number of African languages and language varieties. Leveraging a vast corpus of text data collected from diverse sources, Cheetah learns some intricate linguistic information that characterize each African language. The contributions of this research are three fold. First, we address the scarcity of NLG resources for African languages by providing a comprehensive language model that covers a wide range of languages spoken on the continent. Second, we demonstrate the efficacy of our approach through extensive evaluations across six downstream task clusters. Each cluster includes multiple languages, showcasing the model’s ability to generate coherent and contextually appropriate text in different African languages. Third, we perform fine grained human analysis of Cheetah using a controlled machine translation (MT) test set. This uncovers model behaviour that is not visible with automatic metrics. By supporting NLG in African languages, we foster linguistic diversity, empower African communities to express themselves in their native languages, and bridge the digital divide. This paper serves as a foundational step towards promoting Afrocentric NLP Adebara and Abdul-Mageed (2022) that prioritizes inclusivity and cultural preservation in language technology, emphasizing the importance of catering to the unique linguistic needs of diverse populations.

The rest of the paper is organized as follows: In Section 2, we discuss related work. In Section, 4 we describe AfroNLG, the benchmark we create for evaluation. We provide details of Cheetah in Section 3. We present performance of Cheetah in Section 5 and compare it to other multilingual models. We present controlled test sets in Section 5.1. We conclude in Section 6, and outline a number of limitations and use cases for our work in Section 7 and Section 8.

2 Literature Review

One of the challenges in NLG is to generate coherent and semantically meaningful text. Various approaches have been proposed, including template-based Becker (2002); Van Deemter et al. (2005), rule-based Dušek and Jurčíček (2015); van Miltenburg et al. (2020), and statistical approaches Li et al. (2016). More recently, deep learning approaches Sutskever et al. (2014) including the transformer model Vaswani et al. (2017) have achieved SoTA results in various NLG tasks such as text summarization Shi et al. (2021) and machine translation Vaswani et al. (2017).

While these models have shown impressive results, they often require a large amount of training data and computing resources. However, only a few African languages have benefited from these advancements due to inadequate data. To address this issue, researchers have proposed transfer learning-based approaches, where a pretrained model is finetuned for a specific NLG task. Transfer learning Raffel et al. (2020); He et al. (2022); Ruder et al. (2019) has enabled the use of low-resource languages on various NLP tasks. Due to lack of adequate (or good quality) pretraining data Kreutzer et al. (2021), transfer learning is often the most accessible method for only a few low-resource languages leaving behind a vast majority of extremely low-resource languages. This is because about 90%percent9090\%90 % of the world’s languages is claimed to be either left-behinds, in that it is probably impossible to build NLP resources for them, or scraping-bys with no labelled datasets Joshi et al. (2020). For the left-behinds, labelled and unlabelled data are unavailable and even transfer learning approaches are beyond reach while the scraping-by languages have no labelled data with which to evaluate model performance.

2.1 Language Models

Only a few African languages have benefited from the recent advancement of language models (LM) due to inadequate data sizes. We now describe encoder-decoder LMs that support NLP tasks in African languages. We describe these under two broad headings: massively multilingual models and African models. We summarize the models and African languages they cover in Table 1.

Multilingual Models: The massively multilingual models such as mBART (Liu et al., 2020), MT0 Muennighoff et al. (2022), and mT5 (Xue et al., 2021) are trained on several languages. However, in most cases, only a few African languages are represented. Among the mentioned models, mT0 is pretrained on the highest number of African languages (n𝑛nitalic_n=13131313).

African Models. Adelani et al. (2022) use pretrained T5, mT5, and mBART models and develop AfriByT5, AfriMT5, AfriMBART respectively to investigate machine translation in zero-shot and out-of-domain settings. The authors experiment on 17171717 African languages and demonstrate that further pretraining is effective for adding new languages to pretrained models. Jude Ogundepo et al. (2022) train AfriTeVa, an encoder-decoder language model from scratch on 10101010 African languages and English using similar training objectives like T5 model.

Category LM Lang/Total African Languages Families
Multilingual MBART 3333/50505050 afr, swh, yor. 2
MT0 14141414/101101101101 afr, amh, hau, ibo, lin, mlg, nyj, orm, sot, 4
sna, som, swh, xho, yor, and zul
MT5 12121212/101101101101 afr, amh, nya, hau, ibo, mlg, sna, som, swh, xho, yor, and zul 3
African AfriVeTa 10101010/10101010 gaz, amh, Gahuza, hau, ibo, pcm, som, swa, tir, and yor. 3
AfriMT5 17171717/17171717 bam, bbj, ewe, fon, hau, ibo, lug, luo, pcm, mos, swa, tsn, twi, wol, yor, zul. 3
AfriByT5 17171717/17171717 bam, bbj, ewe, fon, hau, ibo, lug, luo, pcm, mos, swa, tsn, twi, wol, yor, zul. 3
AfriMBART 17171717/17171717 afr, amh, nya, hau, orm, som, swh, xho. 3
Cheetah[Uncaptioned image] 517517517517/517517517517 Includes 517517517517 African languages. 14141414
Table 1: Comparing with available encoder-decoder models with African languages represented. Lang/Total. describe the number of African languages comparing with the covered languages in the pretrained language models. Families. describes the number of covered language families.

African Natural Language Understanding. Several works attempt to improve the performance on African NLU tasks by proposing multilingual and African-dedicated models such as mBERT Devlin et al. (2019), XLM-R Conneau et al. (2020), AfriBERTa Ogueji et al. (2021), AfroLM Dossou et al. (2022), Afro-XLM-R Alabi et al. (2022), KINYaBERT Nzeyimana and Niyongabo Rubungo (2022), and SERENGETI Adebara et al. (2023).

Category Benchmark Reference Task Lang/Total Datasets Tasks
Multilingual FLoRES200

Costa-jussà et al. (2022)

52/200 MT Wiki 1
GEMv1

Gehrmann et al. (2021)

DRG, DT, RES, TS, SMP 10/52 18 13
GEMv2

Gehrmann et al. (2021)

DRG, DT, PPH, QA,
RES, TS, SLG, SMP, TS
10/52 50 9
IndicNLG

Kumar et al. (2022)

BG, HG, SUM, PARA, QA 0/11 5 5
IndoNLG

Cahyawijaya et al. (2021)

SUM, QA, Chit-Chat 0/3 5 3
NLLB M.D.

Costa-jussà et al. (2022)

MT 2/8 Wiki 1
NLLB S.D.

Costa-jussà et al. (2022)

MT 2/8 Wiki 1
Toxicity200

Costa-jussà et al. (2022)

MT 50/200 Wiki 1
XGLUE

Liang et al. (2020)

NER, POS, MLQA, PAWS-X,
XLNI, NC, QADSM, WPR,
QAM, QG, NTG
1/19 19 11
African AfroMT

Reid et al. (2021a)

MT 8/8 5 1
Menyo-20k

Adelani et al. (2021)

MT 1/2 6 1
AfroNLG

Our Work

Cloze, CS, MT, QA, TG, SUM, PARA 517/527 67 7
Table 2: A Comparison of AfroNLG with other multilingual Benchmarks. MT: Machine translation, QA: Question Answering, CS: Code-Switching, TG: Title Generation, SUM: Summarization, PARA: Paraphrase, NER: Named Entity Recognition, POS: Part-Of-Speech Tagging, MLQA: Multilingual Question Answering, PAWS-X: Parallel Aggregated Word Scrambling for Cross-Lingual Understanding, XNLI: Cross-Lingual Natural Language Interference, NC: News Classification, QADSM: Query-AD Matching, WPR: Web Page Ranking, QAM: QA Matching, NTG: News Title Generation, BG: WikiBio Biography Generation, and HG: Headline Generation. SD: Seed Data, MD: Multi Domain. DRG: Dialogue Response Generator, DT: Data-to-Text, RES: Reasoning, TS: Text Summarization, SMP: Text Simplification, PPH: Paraphrase, SLG: Slide Generation

2.2 Benchmarks

Multiple benchmarks have been developed for NLG. However, only a few of Africa’s 2,00020002,0002 , 000 languages have been supported to date. In most cases, the benchmarks support only the machine translation task. We provide a brief overview under two headings: African and multilingual. We summarize key information about each benchmark in Table 2.

African Benchmarks. AfroMT Reid et al. (2021a) is a multilingual machine translation benchmark. It consists of translation tasks between English and eight African languages — Afrikaans, Xhosa, Zulu, Rundi, Sesotho, Swahili, Bemba, and Lingala. Menyo-20k Adelani et al. (2021) is an MT evaluation benchmark for English-Yorùbá.

Multilingual with African Languages. FLoRES-200 Costa-jussà et al. (2022); Guzmán et al. (2019) is an evaluation benchmark that provides MT evaluation support in 200200200200 languages including 52525252 African languages. GEM Gehrmann et al. (2021, 2022) referenced as “living" benchmark, comprises of 40404040 tasks and supports 52525252 languages including 10101010 African languages. NLLB Seed Data Costa-jussà et al. (2022) is a set of professionally-translated sentences sampled from Wikipedia. It consists of around six thousand sentences in 39393939 languages which include 8888 African language. Similarly, NLLB Multi Domain Costa-jussà et al. (2022) is an MT evaluation benchmark made from a set of professionally-translated sentences in the news and health domains. It consists of approximately 3,00030003,0003 , 000 sentences in each domain and supports 8888 languages including 2222 African languages. Toxicity-200 Costa-jussà et al. (2022) is an evaluation benchmark to evaluate the presence of toxic items in the MT text. It provides support for 50505050 African languages. XGLUE Liang et al. (2020) is a cross-lingual, multi-task benchmark created with multilingual and bilingual corpora. It supports 19191919 languages and one African language, i.e., Swahili.

3 Cheetah

3.1 Pretraining Data

We are guided by three main principles in developing this data: quality, linguistic diversity, and coverage.

Quality. Developing NLP technologies for low resource languages poses a significant challenge due to the limited availability of high-quality training data. To address this issue, we undertook the task of manually curating a diverse corpus spanning multiple domains, including news articles, health documents, religious texts, legal documents, and social media feeds. This manual curation approach was necessary because there were no existing datasets available for the majority of the languages we aimed to support, and we wanted to ensure the utilization of reliable and high-quality data.

Coverage. In all, we train Cheetah using a 42G multi-domain corpus across 517517517517 African languages and language varieties. The languages are spoken in 50505050 of 54545454 African countries and they are written with five scripts. This provides support to at least 500500500500M Africans.

Linguistic Diversity. The inclusion of languages from various domains, geographical regions, and linguistic typologies, along with the utilization of reliable data sources, contributes to enhancing the robustness and quality of Cheetah. Our data consists of languages from 14141414 language families in Africa written in five different orthographies. Furthermore, our data spans languages with a vast array of exotic linguistic features including tone, vowel and consonant harmony, reduplication, word orders, and word classes.

We provide further details on the data used for pretraining in Section A in the Appendix.

3.2 Implementation Details

Vocabulary. We use SentencePiece Kudo and Richardson (2018) to encode text as WordPiece tokens Sennrich et al. (2016) with 250250250250K WordPieces. We also include data covering the ten top spoken languages globally: Arabic, English, French, German, Greek, Italian, Portuguese, Russian, Spanish, and Turkish. We use Wikipedia dumps for these ten languages. We use 1111M sentences for each language. However, we only include it in the vocabulary.

Models Architecture. We pretrain Cheetah using the encoder-decoder architecture Xue et al. (2021). Each of the encoder and decoder components is similar in size and configuration to T5, with 12121212 layers each with 12121212 attention heads, and 768768768768 hidden units for the base model. In total, this results in a model with 580similar-toabsent580\sim 580∼ 580 million parameters. We provide further details in Table 3.

Model Size Params No._heads No._layers D_model Vocab S._Len B. Size #Train_Steps #Langs #A.Langs
mT0 base 580M 12 12 768 similar-to\sim250k 1024 1024 UNK 101 13
mT5 base 580M 12 12 768 250K 1024 1024 1M 101 13
AfriMT5 base 580M UNK UNK UNK UNK UNK 2048 UNK 17 17
AfriTeVa base 229M 12 12 768 40K 512 256 500K 10 10
Cheetah[Uncaptioned image] base 580M 12 12 768 250K 1024 1024 1M 527 517
Table 3: Parameters of Cheetah  compared with other models.

Objective. We use an unsupervised (denoising) objective. The main idea is to feed the model with masked (corrupted) versions of the original sentence, and train it to reconstruct the original sequence. The denoising objective Xue et al. (2021) works by randomly sampling and dropping out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are then replaced by a single sentinel token.

Pretraining Procedure For pretraining Cheetah, we use a learning rate of 0.010.010.010.01, a batch size of 1,02410241,0241 , 024 sequences, and a maximum sequence length of 1,02410241,0241 , 024. We pretrain each model for 1111M steps. We train our models on Google Cloud TPU with 128128128128 cores (v312831283-1283 - 128) from TensorFlow Research Cloud (TFRC).222https://sites.research.google/trc/about/

4 AfroNLG Benchmark

We create AfroNLG, a multi-lingual, multi-task benchmark comprising 67676767 test sets across six task clusters. Specifically, AfroNLG includes the following: cloze tasks, machine translation, paraphrase, question answering, summarization, and title generation. AfroNLG supports 527527527527 languages, including 517517517517 African languages and language varieties and the top 10101010 world languages. To the best of our knowledge, this is the most extensive benchmark till date for African languages. Table 2 shows, at a glance, how our benchmark compares to others benchmark. We provide the details of each task cluster and datasets in what follows. For detailed statistics about the task clusters, we refer to Appendix B.

Cloze Test. In order to comprehensively evaluate Cheetah across all the languages it was pretrained on, we employ cloze-tasks as our evaluation approach and perform two cloze tasks experiments. These tasks assess the model’s ability to fill in missing information. In the first cloze task, which we henceforth call mask-one, we randomly mask only one token in each sentence. In the second cloze-task, which we call mask-at-least-one, we randomly mask at least one token and not more than 10%percent1010\%10 % of the tokens in each sentence. For each of the 517517517517 languages, we construct a cloze-task dataset comprising 200200200200 data points for each language in the Train set, 100100100100 examples for each language in the Test set, and 50505050 data points for each language in the Dev set. We ensure that there is no overlap between the data used for the cloze tasks and the pretraining data. We show an example of our cloze task in Figure 2.

Refer to caption
Figure 2: Examples from the mask-one and mask-at-least-one cloze task data.

Machine Translation. We include only datasets pertaining African languages in our benchmark. In selecting the languages for our MT benchmark, we strive to keep datasets that have been used in any published machine translation task. This allows us to cover a diverse set of languages and compare our models to existing SoTA across a large number of language pairs. Our benchmark thus contains data from Afro-MT333https://github.com/machelreid/afromt Reid et al. (2021b), Lafand-MT444https://github.com/masakhane-io/lafand-mt Adelani et al. (2022), PidginUNMT555https://github.com/keleog/PidginUNMT Ogueji and Ahia (2019), and SALT666https://github.com/SunbirdAI/salt Akera et al. (2022). The datasets we consider make up 35353535 language pairs.

Paraphrase. A paraphrase task aims to create semantically similar and fluent paraphrases given an input text Chen et al. (2023); Palivela (2021). We use the TaPaCo dataset Scherrer (2020) for our paraphrase generation benchmark. TaPaCo is a freely available paraphrase corpus for 73737373 languages extracted from the Tatoeba database. The dataset has four African languages: Afrikaans, Berber (a macro-language), Amazigh, and Kirundi.

Question Answering. The QA task aims to provide answers to questions based on a knowledge base also referred to as contexts. We use TYDIA777https://github.com/google-research-datasets/tydiqa QA dataset Clark et al. (2020). The dataset has a primary task and a gold passage task. In our benchmark, we only include the gold passage task, where a correct answer is predicted from a passage containing one answer, similar to the existing reading comprehension task.

Summarization. Summarization is the task of generating an abridged version of a text, while capturing the salient ideas and the intended information from the original text Nallapati et al. (2016); King et al. (2022). We use the subset of XL-Sum Hasan et al. (2021), an abstractive summarization dataset, that consists of African languages including Amharic, Hausa, Igbo, Kirundi, Oromo, Pidgin, Somali, Swahili, Tigrinya, and Yorùbá. We also develop new test sets using data we crawled from the web, which are non-overlapping with XL-Sum. Specifically, we crawl data from BBC and Voice of Africa (webpages) for Hausa, Ndebele, and Swahili.

Title Generation. The title generation task returns a single sentence title for a given article. Similar to the summarization task, we use XL-SUM to create a news title generation dataset. We also collect a new test set for title generation across 15151515 languages. The dataset comprises 6,000similar-toabsent6000\sim 6,000∼ 6 , 000 BBC and Voice of Africa articles, non-overlapping with XL-Sum, and is particularly useful for zero-shot title generation.

5 Evaluation and Results

Cluster Task Metric mT0 mT5 Afri-MT5 AfriTeVa Cheetah
Machine Translation (MT) English \rightarrow Afrikaans Bleu 20.38±plus-or-minus\pm±0.3 12.35±plus-or-minus\pm±1.1 7.12±plus-or-minus\pm±2.67 7.75±plus-or-minus\pm±1.67 19.72±plus-or-minus\pm±0.75
English \rightarrow Bemba Bleu 19.19±plus-or-minus\pm±0.3 12.28±plus-or-minus\pm±0.48 11.73±plus-or-minus\pm±12.3 20.5±plus-or-minus\pm±0.87 18.9±plus-or-minus\pm±1.22
English \rightarrow Lingala Bleu 15.98±plus-or-minus\pm±1.16 14.12±plus-or-minus\pm±0.56 14.32±plus-or-minus\pm±12.74 13.88±plus-or-minus\pm±1.04 9.64±plus-or-minus\pm±1.11
English \rightarrow Rundi Bleu 12.26±plus-or-minus\pm±0.47 8.82±plus-or-minus\pm±0.43 9.57±plus-or-minus\pm±0.42 7.83±plus-or-minus\pm±1.04 10.54±plus-or-minus\pm±0.54
English \rightarrow Sesotho Bleu 11.04±plus-or-minus\pm±1.2 12.74±plus-or-minus\pm±0.75 10.0±plus-or-minus\pm±1.79 10.76±plus-or-minus\pm±1.4 13.3±plus-or-minus\pm±1.38
English \rightarrow Swahili Bleu 10.59±plus-or-minus\pm±1.84 9.33±plus-or-minus\pm±0.58 3.08±plus-or-minus\pm±0.57 7.24±plus-or-minus\pm±0.46 11.08±plus-or-minus\pm±0.61
English \rightarrow Xhosa Bleu 10.04±plus-or-minus\pm±0.98 8.25±plus-or-minus\pm±0.7 3.86±plus-or-minus\pm±1.35 7.5±plus-or-minus\pm±0.32 12.34±plus-or-minus\pm±0.51
English \rightarrow Zulu Bleu 17.65±plus-or-minus\pm±1.86 17.97±plus-or-minus\pm±1.69 1.9±plus-or-minus\pm±1.11 13.45±plus-or-minus\pm±1.81 19.49±plus-or-minus\pm±1.16
English \rightarrow Hausa Bleu 5.06±plus-or-minus\pm±0.21 4.96±plus-or-minus\pm±0.16 0.85±plus-or-minus\pm±0.04 7.32±plus-or-minus\pm±0.00 9.22±plus-or-minus\pm±0.08
English \rightarrow Igbo Bleu 13.05±plus-or-minus\pm±0.17 11.57±plus-or-minus\pm±0.23 1.12±plus-or-minus\pm±0.09 12.34±plus-or-minus\pm±0.23 16.75±plus-or-minus\pm±0.26
English \rightarrow Luganda Bleu 2.17±plus-or-minus\pm±2.77 3.33±plus-or-minus\pm±0.35 0.09±plus-or-minus\pm±0.01 4.21±plus-or-minus\pm±0.77 9.75±plus-or-minus\pm±0.01
English \rightarrow N. Pidgin Bleu 33.17±plus-or-minus\pm±0.28 32.65±plus-or-minus\pm±0.19 2.39±plus-or-minus\pm±0.23 9.39±plus-or-minus\pm±0.18 32.64±plus-or-minus\pm±0.14
English \rightarrow Swahili Bleu 22.04±plus-or-minus\pm±2.89 23.2±plus-or-minus\pm±0.23 2.79±plus-or-minus\pm±0.08 22.39±plus-or-minus\pm±0.28 28.11±plus-or-minus\pm±0.14
English \rightarrow Zulu Bleu 6.83±plus-or-minus\pm±0.29 0.58±plus-or-minus\pm±1.37 0.4±plus-or-minus\pm±0.03 4.45±plus-or-minus\pm±0.37 11.75±plus-or-minus\pm±0.38
English \rightarrow Twi Bleu 3.4±plus-or-minus\pm±0.12 1.23±plus-or-minus\pm±0.03 0.03±plus-or-minus\pm±0.0 1.68±plus-or-minus\pm±0.94 4.64±plus-or-minus\pm±0.13
English \rightarrow Yoruba Bleu 5.42±plus-or-minus\pm±0.85 2.58±plus-or-minus\pm±3.1 0.04±plus-or-minus\pm±0.0 3.63±plus-or-minus\pm±4.01 7.83±plus-or-minus\pm±0.14
English \rightarrow Zulu Bleu 10.28±plus-or-minus\pm±0.49 1.31±plus-or-minus\pm±2.26 0.14±plus-or-minus\pm±0.03 3.8±plus-or-minus\pm±4.2 12.13±plus-or-minus\pm±0.1
French \rightarrow Bambara Bleu 2.0±plus-or-minus\pm±2.6 0.37±plus-or-minus\pm±0.19 0.15±plus-or-minus\pm±0.01 3.18±plus-or-minus\pm±0.18 3.06±plus-or-minus\pm±0.27
French \rightarrow Ghomálá’ Bleu 0.4±plus-or-minus\pm±0.09 0.33±plus-or-minus\pm±0.01 0.07±plus-or-minus\pm±0.0 0.96±plus-or-minus\pm±0.01 0.28±plus-or-minus\pm±0.25
French \rightarrow Ewe Bleu 0.7±plus-or-minus\pm±0.35 0.31±plus-or-minus\pm±0.36 0.09±plus-or-minus\pm±0.07 0.84±plus-or-minus\pm±0.16 3.47±plus-or-minus\pm±0.03
French \rightarrow Fon Bleu 0.69±plus-or-minus\pm±0.31 0.8±plus-or-minus\pm±0.13 1.52±plus-or-minus\pm±0.06 1.73±plus-or-minus\pm±0.53 1.29±plus-or-minus\pm±0.16
French \rightarrow Moore Bleu 0.27±plus-or-minus\pm±0.06 0.12±plus-or-minus\pm±0.05 0.19±plus-or-minus\pm±0.02 0.47±plus-or-minus\pm±0.04 1.66±plus-or-minus\pm±0.86
French \rightarrow Wolof Bleu 4.02±plus-or-minus\pm±0.12 0.3±plus-or-minus\pm±0.05 0.11±plus-or-minus\pm±0.01 3.08±plus-or-minus\pm±0.25 3.01±plus-or-minus\pm±0.07
English \rightarrow N. Pidgin (UNMT) Bleu 27.44±plus-or-minus\pm±0.26 23.42±plus-or-minus\pm±1.61 7.05±plus-or-minus\pm±1.37 22.54±plus-or-minus\pm±0.84 26.56±plus-or-minus\pm±0.04
Acholi \rightarrow English Bleu 16.41±plus-or-minus\pm±0.08 11.16±plus-or-minus\pm±4.77 4.9±plus-or-minus\pm±0.11 8.37±plus-or-minus\pm±8.12 19.33±plus-or-minus\pm±0.1
Acholi \rightarrow Lugbara Bleu 2.57±plus-or-minus\pm±0.21 1.48±plus-or-minus\pm±1.31 2.44±plus-or-minus\pm±0.37 8.29±plus-or-minus\pm±0.14 7.21±plus-or-minus\pm±0.69
Acholi \rightarrow Luganda Bleu 3.64±plus-or-minus\pm±0.07 1.74±plus-or-minus\pm±0.12 0.92±plus-or-minus\pm±0.01 5.53±plus-or-minus\pm±0.34 8.03±plus-or-minus\pm±0.38
Acholi \rightarrow Nyankore Bleu 2.17±plus-or-minus\pm±0.14 0.79±plus-or-minus\pm±0.51 0.46±plus-or-minus\pm±0.03 4.26±plus-or-minus\pm±0.54 5.1±plus-or-minus\pm±0.14
Acholi \rightarrow Ateso Bleu 1.64±plus-or-minus\pm±2.34 1.94±plus-or-minus\pm±0.25 4.9±plus-or-minus\pm±0.11 7.74±plus-or-minus\pm±0.33 6.33±plus-or-minus\pm±0.6
English \rightarrow Lugbara Bleu 6.19±plus-or-minus\pm±6.33 8.38±plus-or-minus\pm±0.49 5.93±plus-or-minus\pm±0.22 10.95±plus-or-minus\pm±0.32 11.61±plus-or-minus\pm±0.28
English \rightarrow Luganda Bleu 12.08±plus-or-minus\pm±0.03 10.58±plus-or-minus\pm±0.25 2.59±plus-or-minus\pm±0.73 12.41±plus-or-minus\pm±0.35 17.12±plus-or-minus\pm±0.16
English \rightarrow Nyankore Bleu 6.46±plus-or-minus\pm±0.08 5.69±plus-or-minus\pm±0.02 1.4±plus-or-minus\pm±0.39 7.88±plus-or-minus\pm±0.18 9.04±plus-or-minus\pm±0.24
English \rightarrow Ateso (salt) Bleu 10.24±plus-or-minus\pm±0.06 8.28±plus-or-minus\pm±0.19 4.91±plus-or-minus\pm±0.59 11.64±plus-or-minus\pm±0.49 11.12±plus-or-minus\pm±0.38
Lugbara \rightarrow Ateso Bleu 2.21±plus-or-minus\pm±0.35 1.5±plus-or-minus\pm±0.2 2.22±plus-or-minus\pm±0.15 6.67±plus-or-minus\pm±0.32 3.68±plus-or-minus\pm±0.31
Luganda \rightarrow Lugbara Bleu 3.96±plus-or-minus\pm±0.57 2.61±plus-or-minus\pm±0.12 3.44±plus-or-minus\pm±0.32 8.05±plus-or-minus\pm±0.23 7.99±plus-or-minus\pm±0.47
Luganda \rightarrow Ateso Bleu 4.47±plus-or-minus\pm±0.08 3.01±plus-or-minus\pm±0.16 2.5±plus-or-minus\pm±0.22 8.17±plus-or-minus\pm±0.18 8.13±plus-or-minus\pm±0.33
Nyankore \rightarrow Lugbara Bleu 3.45±plus-or-minus\pm±0.29 2.1±plus-or-minus\pm±0.32 2.6±plus-or-minus\pm±0.29 7.5±plus-or-minus\pm±0.09 7.29±plus-or-minus\pm±0.09
Nyankore \rightarrow Luganda Bleu 8.54±plus-or-minus\pm±0.17 6.91±plus-or-minus\pm±0.23 2.01±plus-or-minus\pm±0.25 6.77±plus-or-minus\pm±6.73 6.25±plus-or-minus\pm±10.26
Nyankore \rightarrow Ateso Bleu 3.33±plus-or-minus\pm±0.11 2.25±plus-or-minus\pm±0.23 2.12±plus-or-minus\pm±0.4 6.27±plus-or-minus\pm±0.12 6.36±plus-or-minus\pm±0.4
Paraphrase Multilingual Bleu 41.79±plus-or-minus\pm±0.28 41.75±plus-or-minus\pm±0.21 34.72±plus-or-minus\pm±0.51 43.02±plus-or-minus\pm±1.25 43.23±plus-or-minus\pm±0.09
Berber Bleu 44.84±plus-or-minus\pm±0.31 44.03±plus-or-minus\pm±0.24 36.08±plus-or-minus\pm±0.83 **46.41±plus-or-minus\pm±0.71 46.0±plus-or-minus\pm±0.27
Kabyle Bleu 25.91±plus-or-minus\pm±0.13 25.32±plus-or-minus\pm±0.46 11.56±plus-or-minus\pm±0.73 16.06±plus-or-minus\pm±14.79 26.27±plus-or-minus\pm±0.56
Question Answering QA Swahili F1 79.84±plus-or-minus\pm±0.19 72.04±plus-or-minus\pm±0.54 0 62.64±plus-or-minus\pm±0.78 71.98±plus-or-minus\pm±1.18
Summarization Multilingual RougeL 22.31±plus-or-minus\pm±0.12 22.23±plus-or-minus\pm±0.04 5.34±plus-or-minus\pm±0.48 18.97±plus-or-minus\pm±0.06 24.86±plus-or-minus\pm±0.02
Amharic RougeL 13.81±plus-or-minus\pm±0.04 13.09±plus-or-minus\pm±0.03 4.4±plus-or-minus\pm±1.07 8.29±plus-or-minus\pm±0.51 15.09±plus-or-minus\pm±0.1
Igbo RougeL 18.9±plus-or-minus\pm±0.73 13.22±plus-or-minus\pm±0.46 14.24±plus-or-minus\pm±0.39 16.05±plus-or-minus\pm±0.49 17.36±plus-or-minus\pm±0.43
Oromo RougeL 11.28±plus-or-minus\pm±0.03 10.51±plus-or-minus\pm±0.07 3.52±plus-or-minus\pm±0.49 7±plus-or-minus\pm±1.73 14.53±plus-or-minus\pm±0.1
Rundi RougeL 19.63±plus-or-minus\pm±0.01 18.02±plus-or-minus\pm±0.13 11.82±plus-or-minus\pm±0.39 16.13±plus-or-minus\pm±0.03 22.57±plus-or-minus\pm±0.04
Swahili RougeL 26.38±plus-or-minus\pm±0.02 24.81±plus-or-minus\pm±0.11 15.07±plus-or-minus\pm±0.17 21.59±plus-or-minus\pm±0.13 29.05±plus-or-minus\pm±0.13
Yoruba RougeL 21.57±plus-or-minus\pm±0.05 20.06±plus-or-minus\pm±0.12 13.52±plus-or-minus\pm±0.18 17.3±plus-or-minus\pm±0.11 22.49±plus-or-minus\pm±0.0
Hausa RougeL 26.46±plus-or-minus\pm±0.06 25.76±plus-or-minus\pm±0.02 19.96±plus-or-minus\pm±0.26 25.19±plus-or-minus\pm±0.11 30.07±plus-or-minus\pm±0.31
Nigerian Pidgin RougeL 26.54±plus-or-minus\pm±0.05 25.79±plus-or-minus\pm±0.1 14.28±plus-or-minus\pm±1.23 20.29±plus-or-minus\pm±0.12 27.08±plus-or-minus\pm±0.02
Somali RougeL 20.69±plus-or-minus\pm±0.08 19.21±plus-or-minus\pm±0.06 13.62±plus-or-minus\pm±0.81 19.27±plus-or-minus\pm±0.18 23.92±plus-or-minus\pm±0.04
Tigrinya RougeL 15.84±plus-or-minus\pm±0.13 13.93±plus-or-minus\pm±0.11 6.53±plus-or-minus\pm±0.42 10.07±plus-or-minus\pm±0.09 16.88±plus-or-minus\pm±0.12
Title Generation Multilingual Bleu 6.53±plus-or-minus\pm±0.02 6.65±plus-or-minus\pm±0.08 0.1±plus-or-minus\pm±0.02 5.2±plus-or-minus\pm±0.02 7.52±plus-or-minus\pm±0.07
Amharic Bleu 3.13±plus-or-minus\pm±0.23 2.65±plus-or-minus\pm±0.68 0.34±plus-or-minus\pm±0.14 2.31±plus-or-minus\pm±0.14 4.34±plus-or-minus\pm±0.34
Igbo Bleu 6.95±plus-or-minus\pm±0.13 6.9±plus-or-minus\pm±0.22 0.77±plus-or-minus\pm±0.12 4.61±plus-or-minus\pm±0.14 8.47±plus-or-minus\pm±0.07
Oromo Bleu 1.1±plus-or-minus\pm±1.84 2.66±plus-or-minus\pm±0.19 0.21±plus-or-minus\pm±0.06 1.54±plus-or-minus\pm±0.17 3.26±plus-or-minus\pm±0.21
Rundi Bleu 4.4±plus-or-minus\pm±0.28 4.13±plus-or-minus\pm±0.22 0.84±plus-or-minus\pm±0.07 3.33±plus-or-minus\pm±0.23 6.05±plus-or-minus\pm±0.5
Swahili Bleu 9.1±plus-or-minus\pm±0.23 9.31±plus-or-minus\pm±0.11 1.22±plus-or-minus\pm±0.09 7.01±plus-or-minus\pm±0.09 10.59±plus-or-minus\pm±0.6
Yoruba Bleu 6.8±plus-or-minus\pm±0.16 7.23±plus-or-minus\pm±0.59 0.34±plus-or-minus\pm±0.05 5.04±plus-or-minus\pm±2.0 7.97±plus-or-minus\pm±0.32
Hausa Bleu 8.11±plus-or-minus\pm±0.24 7.3±plus-or-minus\pm±0.34 2.59±plus-or-minus\pm±0.01 6.69±plus-or-minus\pm±0.18 8.48±plus-or-minus\pm±0.23
Nigerian Pidgin Bleu 6.75±plus-or-minus\pm±0.6 3.96±plus-or-minus\pm±4.3 0.89±plus-or-minus\pm±0.02 4.72±plus-or-minus\pm±0.84 6.22±plus-or-minus\pm±0.28
Somali Bleu 3.37±plus-or-minus\pm±0.21 3.31±plus-or-minus\pm±0.16 0.38±plus-or-minus\pm±0.11 2.82±plus-or-minus\pm±0.47 5.25±plus-or-minus\pm±0.14
Tigrinya Bleu 2.99±plus-or-minus\pm±0.1 2.94±plus-or-minus\pm±1.09 0.7±plus-or-minus\pm±0.18 1.92±plus-or-minus\pm±0.26 5.1±plus-or-minus\pm±0.05
Cloze-task Mask-one Bleu 13.61±plus-or-minus\pm±0.91 8.18±plus-or-minus\pm±3.94 0.00±plus-or-minus\pm±0.00 8.36±plus-or-minus\pm±3.42 13.98±plus-or-minus\pm±0.32
Mask-at-least-one Bleu 2.36±plus-or-minus\pm±0.11 2.66±plus-or-minus\pm±0.09 0.93±plus-or-minus\pm±0.12 0.68±plus-or-minus\pm±0.09 7.07±plus-or-minus\pm±0.09
AfroNLG Score 12.56 11.05 5.15 10.84 14.25
Table 4: Average performance of finetuned African and multilingual models across three runs on AfroLNG benchmark test sets.

We evaluate Cheetah on six task clusters of AfroNLG benchmark and compare to performance on mT0, mT5, Afri-MT5, and AfriTeVa. We report results in Table 4. For all models, we finetune on the training data split (Train) for 20202020 epochs with an early stopping of 5555 epochs, learning-rate of 5e55𝑒55e-55 italic_e - 5, batch size of 16161616, and sequence length of 512512512512. All experiments were performed on 4444 GPUs (Nvidia V100). We report the results of each experiment as an average of three runs, each with a different seed.888Specifically, we use seed values 41414141, 1512151215121512, and 20235202352023520235. For multilingual datasets in each task cluster, we show evaluation results per language.  Cheetah outperforms other models on many languages across the six task clusters. We provide detailed information of model performance next.

Cloze Test. Cheetah outperforms all other models on both cloze tasks as in Table 4. We show the results for each language that is supported by the models compared in Table D.1 and Table D.2. The performance of all models on mask-one is better than the performance on mask-at-least-one, reflecting how increasing the number of masked tokens makes the task more challenging. It is also important to mention that since evaluation is based on BLEU it does not reflect correct synonyms that each model may have generated to replace the masked tokens.

Machine Translation. Cheetah sets a new SOTA on 23232323 tasks surpassing previous models. The mT0 and AfriTEVA models also demonstrate strong performance on six languages. Notably, pairs with French as the source language tend to yield the lowest BLEU scores, indicating relatively lower translation quality. On the other hand, the language pair involving English to Nigerian Pidgin, specifically on LafandMT and PidginUNMT, showcases the highest BLEU scores. We assume that the similarity between the Nigerian Pidgin and English contributes favourably to translation quality in these scenarios. We also report CHRF and CHRF++ results in Table B.3 and Table B.4 in the Appendix.

Paraphrase. In the three paraphrase tasks, Cheetah demonstrates remarkable superiority over all other models. Specifically, we achieve an impressive ROUGE score of 46.046.046.046.0 on the Berber paraphrase task, surpassing the second-best model by a margin of approximately two points.

Question Answering. In the task of question answering, mT0 exhibits superior performance compared to both Cheetah and other models. While mT5 achieves the second-highest performance, Cheetah attains the third-highest performance in this task.

Summarization. Cheetah sets a new SOTA on 11111111 languages, outperforming other models by an average margin of at least three points. Detailed results can be found in Table 4.

Title Generation. On the Title generation task, Cheetah sets a new SOTA on 11111111 languages. We report results in Table 4.

5.1 Investigating linguistic capabilities

In order to further test the utility of our models, we use grammar templates to construct test data in English. We use nine linguistic rules and 19191919 lexical items to generate 152152152152 sentences. Next, we use our model to translate from source to target and manually evaluate the quality of the generated data. We design new evaluation metrics, faithfulness and fluency, for the manual evaluation. A detailed description follows.

Grammar templates. We use grammar templates McCoy et al. (2019) developed with context-free grammars (CFG) on the source side to construct controlled test sets in English. We use CFG on the source side alone because constituents and constituent order differs across languages. We adopt this method for two reasons. First, utilizing grammar templates provides a standardized framework that ensures that the same grammatical phenomena are tested consistently. By employing a uniform approach, we can effectively isolate and evaluate specific linguistic features, facilitating a more rigorous and meaningful comparison of language model performance. Second, grammar templates exhibit a high degree of flexibility, allowing for easy modification and extension to encompass a wide range of linguistic phenomena. This adaptability not only facilitates the incorporation of new linguistic features but also enables the evolution of our test sets to match the dynamic landscape of natural language processing research.

Other alternatives to templates include using parsed corpora Bender et al. (2011) or naturally occurring sentences. For the languages we explore, there are no good quality parsers, making automatic parsing inaccessible for this analysis. Furthermore, when a corpus is parsed automatically, the likelihood of encountering parsing errors escalates with the intricacy of the sentence structure Bender et al. (2011); Marvin and Linzen (2018). Conversely, if the test set exclusively comprises sentences with accurate gold parses, sourcing an ample quantity of instances showcasing syntactic complexities becomes an arduous task Marvin and Linzen (2018). Furthermore, the utilization of naturally occurring sentences introduces potential complications that might confound the interpretation of experiments Ettinger et al. (2018).

The templates include transitive and intransitive structures, negative and affirmative structures, and structures with gender and number. Table 5 provides examples of generated sentences using the templates999The entire generated grammar is available at our GitHub: anonymous link. .

Category Example
Intransitive He left
Intransitive + Negation We did not leave
Transitive You left Lagos
Transitive + Negation She did not leave them
Table 5: Some examples of sentences generated with the templates

Inference. We test three of our finetuned machine translation models with the generated dataset. This allows us to evaluate how much linguistic information the models have acquired during pretraining and finetuning. Specifically, we use the English\rightarrowHausa, English\rightarrowSwahili, and English\rightarrowYorùbá based on MT0, MT5, AfriTEVA, and Cheetah models that were finetuned on the LafandMT dataset. We do not include Afri-MT5 in this analysis because it has very low scores across several tasks as shown in Table 4. Notably, Hausa, Swahili, and Yorùbá have distinct typologies and the performance of each model on each language gives further insight of performance across varying typological features (See Section C for details). Table 6 shows some linguistic differences between the three languages. This method can be generalized to any African language.

Lang. Family # Tone Gender Morphology
Hausa Afro-Asiatic Two Two Isolating
Swahili N.C. Bantu None Five Agglutinative
Yourba N.C. Non-Bantu Three None Isolating
Table 6: Some linguistic differences between Hausa, Swahili, and Yoruba. N.C. refers to Niger-Congo

5.2 Human evaluation

To evaluate the effectiveness of each model across different languages, we assess the generated output’s faithfulness and fluency using a five-point Likert scale. Faithfulness measures how accurately a model’s output corresponds to the input sentence, while fluency assesses the grammatical coherence and plausibility of the generated output. We use both metrics because a model can produce coherent output that may not be faithful to the input sentence. This way, if faithfulness penalizes a model for outputs that are not true to the input or that include additional unnecessary information, fluency complements our evaluation of the quality of the same model if the output is fluent. For each grammar category, we return the average Likert point for each language and across the different models model.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Faithfulness and fluency for Hausa, Swahili, and Yorùbá

5.3 Annotation

We annotated each model’s output for faithfulness and fluency. For Hausa and Yorùbá, two expert annotators evaluated the model’s output for faithfulness and fluency. We ensured that each annotator has native speaker competency in reading and writing (while some had a linguistic background). We gave specific annotation instructions (See Section E in the Appendix) to ensure the values are not assigned arbitrarily. We also ensured that the annotators do not know who created which models to prevent any biases. We report the Kappa scores for inter-annotator agreement in Table 7. For Swahili, only one annotator made it to the final annotation task since we could not acquire high quality annotations from other annotators. The Swahili annotator who did the final annotation is a university lecturer who teaches Swahili and has a Ph.D. in linguistics.

hau yor
Model Faith. Flu. Faith. Flu.
mT0 90.5490.5490.5490.54 97.6297.6297.6297.62 96.5796.5796.5796.57 93.9293.9293.9293.92
mT5 93.5193.5193.5193.51 96.4896.4896.4896.48 82.2382.2382.2382.23 81.1081.1081.1081.10
AfriTeVa 87.2787.2787.2787.27 96.9496.9496.9496.94 88.5688.5688.5688.56 84.7384.7384.7384.73
Cheetah 96.6196.6196.6196.61 97.2697.2697.2697.26 87.1187.1187.1187.11 92.6492.6492.6492.64
Table 7: Kappa scores for Faithfulness (i.e., Faith.) and Fluency (i.e., Flu.) across the four models and three languages we evaluate.

5.4 Fluency and Faithfulness Performance

We report the distribution of faithfulness and fluency scores across all models and languages in Figure 3. Overall, Cheetah produces more faithful and more fluent outputs than other models on all languages. We now go on to provide detailed analysis of model performance.

Intransitives In the case of Hausa examples, all three models manage to produce intransitive examples. However, Cheetah consistently appends objects to these intransitive examples. This inclination to add objects might stem from biases within the data used for pretraining or finetuning Cheetah. Nevertheless, it is worth noting that Cheetah outperforms other models by generating more fluent and more faithful Hausa outputs. In the Swahili context, all models successfully generate intransitive translations, with model errors primarily related to tense. This performance discrepancy in Swahili can be attributed to its agglutinative structure, with models potentially lacking exposure to a comprehensive range of grammatical features during pretraining or finetuning. In the context of Yorùbá, all models consistently incorporate at least one object in each intransitive case. Notably, mT0 generates an output without an object approximately 5.88%percent5.885.88\%5.88 % of the time. This may be because intransitive sentences inherently lack a clear direct object, making it more challenging for machine translation models to grasp context and select the accurate translation. In certain instances, some intransitive phrases can be polysemous, further complicating the translation process. Intransitive English verbs do not always retain their intransitive nature in Yorùbá. Furthermore, transitives with optional/truncated objects tend to have a compulsory object in Yorùbá. This phenomenon potentially contributes to the models’ tendency to append objects to intransitive Yorùbá phrases. For instance, whereas the intransitive "slept" in "John slept" maps to the intransitive form "John sùn" in Yorùbá, the intransitive verb "prayed", in "John prayed" becomes "John gbàdúrà", a transitive verb in Yorùbá. On the other hand, the transitive verb "ate" in "John ate", has an optional/truncated object in English but becomes "John j\textsubdoteun", a transitive with an obligatory object. In Yorùbá, both "ate" and "prayed" are transitive verbs that require an object. They are derived from "j\textsubdote" (eat) and "oúnj\textsubdote" (food), which give rise to "j\textsubdoteun" and "gbà" (collect) and "àdúrà" (prayer), resulting in "gbàdúrà" respectively. We report the distribution of scores in Figure F.1.

Transitives In the context of transitives, Cheetah stands out as the top-performing model across all three languages, as illustrated in Figure 3. Cheetah demonstrates the capability to provide three distinct semantic senses for the polysemous transitive verb treated whereas the other models typically produce only a single semantic interpretation. In Swahili examples, certain instances exhibit the deletion or simplification of object markers in an ungrammatical manner. For a visual representation of the annotation of intransitive sentences in Yorùbá, please refer to Figure F.3. Figure F.2 shows the distribution of model performance on transitives.

Negative In the context of Yorùbá, all models are able to produce the correct negation marker including the correct tone marks. The tone patterns on negation markers may vary based on the context of words before and after the negation marker and it was interesting to see these variations in the models outputs. Despite this, mT0, MT5, and AfriTeVa have a tendency to output the negation of the antonym of the verb in each sentence rather than the negation of the verb. Cheetah also makes similar mistakes about  5% of the time.

Affirmative The models generally perform better in the context of the affirmative examples than on the negated examples. However, in the context of Hausa, mT5, mT0, and AfriTeVa consistently output the antonym of the verb to be negated. For instance, the models return “Sara left" rather than “Sara did not leave". In the Swahili examples, we also find cases of double negation (which is not grammatically correct in Swahili). We show the distribution of results in Figure F.5 and Figure F.4.

Gender/Agreement We find interesting cases of gender in the model’s output. For example, whereas Yorùbá grammar does not distinguish gender, Cheetah uses Arábìrin (female) before every occurrence of the name “Sara" to indicate that the it has a high probability of being feminine (see Figure F.3). It is important to mention that “Fred" is not annotated this way. For Hausa, which requires agreement between the gender of the noun and the verb, we find Cheetah outperforming both mt0 and mt5 significantly. AfriTeVa, however, has very low accuracy in the context of gender. Furthermore, mt0, mt5, and Cheetah return connotations for love and relationships for each examples where a male and female pronoun co-occur cross-lingually.

Number Cheetah significantly outperforms all three models in accurately assigning appropriate number markers. We also find that when translating the word "you" into Hausa, Swahili, or Yorùbá, all four models use either singular or plural forms. We assume that this is due to the fact that the second person in English (i.e., “you") can be either singular or plural while each of these languages have a different word for the singular and plural forms.

6 Conclusion

In this work, we introduced Cheetah, a massively multilingual language model designed for African natural language generation. We also propose a new African language generation benchmark, dubbed AfroNLG. Our evaluation benchmark is both sizeable and diverse. We evaluate Cheetah on AfroNLG comparing it to three other models, two multilingual and one dedicated to African languages. The performance of Cheetah surpasses that of all other models we evaluate. This is demonstrated by its superior AfroNLG score, which is approximately three times better than the combined performance of other models. Furthermore, Cheetah outperforms all other models across 48484848 out of 65656565 test sets spanning six task clusters. We further analyze our model’s robustness to lexical complexity and carry out human evaluation to inspect the model’s perform on a controlled test set. Again, our results underscore superiority of our model.

7 Limitations

We identify the following limitations for our work:

  1. 1.

    The limitations of our language model include the limited scope of our evaluation. Future work should focus on increasing the subset of languages evaluated manually in order to ensure quality. We believe automatic analyses are not sufficient for development of models that get deployed in particular applications.

  2. 2.

    Another limitation is related to our inability to perform extensive analysis of biases and hateful speech present in our pretraining data. Again, this is due to relatively restricted access to native speakers (and even automated tools) to perform this analysis. As a result, we cannot fully ensure that our models are free from biases and socially undesirable effects. Therefore, it is important that these models be used with care and caution, and be analyzed for biases and socially undesirable effects before use.

  3. 3.

    Additionally, due to unavailability of sufficient computing resources, we were unable to evaluate larger multilingual language models.

8 Ethics Statement and Wider Impacts

Cheetah aligns with Afrocentric NLP where the needs of African people is put into consideration when developing technology. We believe Cheetah will not only be useful to speakers of the languages supported, but also researchers of African languages such as anthropologists and linguists. We discuss below some use cases for Cheetah and offer a number of broad impacts.

  1. 1.

    Cheetah aims to address the lack of access to technology in about 90%percent9090\%90 % of the world’s languages, which automatically discriminates against native speakers of those languages. More precisely, it does so by focusing on Africa. To the best of our knowledge, Cheetah is the first massively multilingual PLM developed for African languages and language varieties. A model with knowledge of 517517517517 African languages, is by far the largest to date for African NLP.

  2. 2.

    Cheetah enables improved access of important information to the African community in Indigenous African languages. This is especially beneficial for people who may not be fluent in other languages. This will potentially connect more people globally.

  3. 3.

    Cheetah affords opportunities for language preservation for many African languages. To the best of our knowledge, Cheetah consists of languages that have not been used for any NLP task until now. We believe that it can help encourage continued use of these languages in several domains, as well as trigger future development of language technologies for many of these languages.

  4. 4.

    Although LMs are useful for a wide range of applications, they can also be misused. Cheetah is developed using publicly available datasets that may carry biases. Although we strive to perform analyses and diagnostic case studies to probe performance of our models, our investigations are by no means comprehensive nor guarantee absence of bias in the data. In particular, we do not have access to native speakers of most of the languages covered. This hinders our ability to investigate samples from each (or at least the majority) of the languages.

References

  • Abadji et al. (2021) Julien Abadji, Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. 2021. Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), pages 1 – 9, Mannheim. Leibniz-Institut für Deutsche Sprache.
  • Adebara and Abdul-Mageed (2022) Ife Adebara and Muhammad Abdul-Mageed. 2022. Towards afrocentric NLP for African languages: Where we are and where we can go. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3814–3841, Dublin, Ireland. Association for Computational Linguistics.
  • Adebara et al. (2023) Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, and Alcides Alcoba Inciarte. 2023. Serengeti: Massively multilingual language models for africa.
  • Adelani et al. (2022) David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Emezue, Colin Leong, Michael Beukman, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Ouoba Kabore, Godson Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, Victoire Memdjokam Koagne, Edwin Munkoh-Buabeng, Valencia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, and Sam Manthalu. 2022. A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics.
  • Adelani et al. (2021) David Adelani, Dana Ruiter, Jesujoba Alabi, Damilola Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayodele Esther Awokoya, and Cristina España-Bonet. 2021. The effect of domain and diacritics in Yoruba–English neural machine translation. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 61–75, Virtual. Association for Machine Translation in the Americas.
  • Akera et al. (2022) Benjamin Akera, Jonathan Mukiibi, Lydia Sanyu Naggayi, Claire Babirye, Isaac Owomugisha, Solomon Nsumba, Joyce Nakatumba-Nabende, Engineer Bainomugisha, Ernest Mwebaze, and John Quinn. 2022. Machine translation for african languages: Community creation of datasets and models in uganda.
  • Alabi et al. (2022) Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. 2022. Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  • Becker (2002) Tilman Becker. 2002. Practical, template–based natural language generation with TAG. In Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+6), pages 80–83, Universitá di Venezia. Association for Computational Linguistics.
  • Bender et al. (2011) Emily M. Bender, Dan Flickinger, Stephan Oepen, and Yi Zhang. 2011. Parser evaluation over local and non-local deep dependencies in a large corpus. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 397–408, Edinburgh, Scotland, UK. Association for Computational Linguistics.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  • Cahyawijaya et al. (2021) Samuel Cahyawijaya, Genta Indra Winata, Bryan Wilie, Karissa Vincentio, Xiaohong Li, Adhiguna Kuncoro, Sebastian Ruder, Zhi Yuan Lim, Syafri Bahar, Masayu Khodra, Ayu Purwarianti, and Pascale Fung. 2021. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Chen et al. (2023) Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, and Diyi Yang. 2023. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP. Transactions of the Association for Computational Linguistics, 11:191–211.
  • Clark et al. (2020) Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  • Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  • Costa-jussà et al. (2022) Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Dossou et al. (2022) Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, and Chris Chinenye Emezue. 2022. Afrolm: A self-active learning-based multilingual pretrained language model for 23 african languages.
  • Dryer and Haspelmath (2013) Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  • Dušek and Jurčíček (2015) Ondřej Dušek and Filip Jurčíček. 2015. Training a natural language generator from unaligned data. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 451–461, Beijing, China. Association for Computational Linguistics.
  • Eberhard et al. (2021) David M Eberhard, F Simons Gary, and Charles D Fennig (eds). 2021. Ethnologue: Languages of the world. Twenty-fourth edition, Dallas, Texas: SIL International.
  • Ettinger et al. (2018) Allyson Ettinger, Ahmed Elgohary, Colin Phillips, and Philip Resnik. 2018. Assessing composition in sentence vector representations. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1790–1801, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  • Gehrmann et al. (2021) Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Andre Niyongabo Rubungo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, and Jiawei Zhou. 2021. The GEM benchmark: Natural language generation, its evaluation and metrics. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pages 96–120, Online. Association for Computational Linguistics.
  • Gehrmann et al. (2022) Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina Mcmillan-major, Anna Shvets, Ashish Upadhyay, Bernd Bohnet, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna Kanerva, Jenny Chim, Jiawei Zhou, Jordan Clive, Joshua Maynez, João Sedoc, Juraj Juraska, Kaustubh Dhole, Khyathi Raghavi Chandu, Laura Perez Beltrachini, Leonardo F . R. Ribeiro, Lewis Tunstall, Li Zhang, Mahim Pushkarna, Mathias Creutz, Michael White, Mihir Sanjay Kale, Moussa Kamal Eddine, Nico Daheim, Nishant Subramani, Ondrej Dusek, Paul Pu Liang, Pawan Sasanka Ammanamanchi, Qi Zhu, Ratish Puduppully, Reno Kriz, Rifat Shahriyar, Ronald Cardenas, Saad Mahamood, Salomey Osei, Samuel Cahyawijaya, Sanja Štajner, Sebastien Montella, Shailza Jolly, Simon Mille, Tahmid Hasan, Tianhao Shen, Tosin Adewumi, Vikas Raunak, Vipul Raheja, Vitaly Nikolaev, Vivian Tsai, Yacine Jernite, Ying Xu, Yisi Sang, Yixin Liu, and Yufang Hou. 2022. GEMv2: Multilingual NLG benchmarking in a single line of code. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 266–281, Abu Dhabi, UAE. Association for Computational Linguistics.
  • Guzmán et al. (2019) Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc’Aurelio Ranzato. 2019. The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6098–6111, Hong Kong, China. Association for Computational Linguistics.
  • Hasan et al. (2021) Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. 2021. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
  • He et al. (2022) Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations.
  • Jaggar (2017) Philip J. Jaggar. 2017. The Hausa “Grade 5” verb: Morphosyntactic preliminaries, 1 edition, pages 18–27. Harrassowitz Verlag.
  • Joshi et al. (2020) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  • Jude Ogundepo et al. (2022) Odunayo Jude Ogundepo, Akintunde Oladipo, Mofetoluwa Adeyemi, Kelechi Ogueji, and Jimmy Lin. 2022. AfriTeVA: Extending ?small data? pretraining approaches to sequence-to-sequence models. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 126–135, Hybrid. Association for Computational Linguistics.
  • King et al. (2022) Daniel King, Zejiang Shen, Nishant Subramani, Daniel S. Weld, Iz Beltagy, and Doug Downey. 2022. Don’t say what you don’t know: Improving the consistency of abstractive summarization by constraining beam search. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 555–571, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  • Kreutzer et al. (2021) Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suárez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, and Mofetoluwa Adeyemi. 2021. Quality at a glance: An audit of web-crawled multilingual datasets. arXiv preprint arXiv:2103.12028.
  • Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  • Kumar et al. (2022) Aman Kumar, Himani Shrotriya, Prachi Sahu, Amogh Mishra, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M. Khapra, and Pratyush Kumar. 2022. IndicNLG benchmark: Multilingual datasets for diverse NLG tasks in Indic languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5363–5394, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Li et al. (2016) Xiao Li, Kees van Deemter, and Chenghua Lin. 2016. Statistics-based lexical choice for NLG from quantitative information. In Proceedings of the 9th International Natural Language Generation conference, pages 104–108, Edinburgh, UK. Association for Computational Linguistics.
  • Liang et al. (2020) Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Rangan Majumder, and Ming Zhou. 2020. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6008–6018, Online. Association for Computational Linguistics.
  • Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  • Marvin and Linzen (2018) Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics.
  • McCoy et al. (2019) Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
  • Muennighoff et al. (2022) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2022. Crosslingual generalization through multitask finetuning.
  • Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gulçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
  • Nzeyimana and Niyongabo Rubungo (2022) Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. KinyaBERT: a morphology-aware Kinyarwanda language model. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5347–5363, Dublin, Ireland. Association for Computational Linguistics.
  • Ogueji and Ahia (2019) Kelechi Ogueji and Orevaoghene Ahia. 2019. Pidginunmt: Unsupervised neural machine translation from west african pidgin to english. arXiv preprint arXiv:1912.03444.
  • Ogueji et al. (2021) Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Palivela (2021) Hemant Palivela. 2021. Optimization of paraphrase generation and identification using language models in natural language processing. International Journal of Information Management Data Insights, 1(2):100025.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  • Reid et al. (2021a) Machel Reid, Junjie Hu, Graham Neubig, and Yutaka Matsuo. 2021a. AfroMT: Pretraining strategies and reproducible benchmarks for translation of 8 african languages. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic.
  • Reid et al. (2021b) Machel Reid, Junjie Hu, Graham Neubig, and Yutaka Matsuo. 2021b. AfroMT: Pretraining strategies and reproducible benchmarks for translation of 8 African languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1306–1320, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Ruder et al. (2019) Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. 2019. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Scherrer (2020) Yves Scherrer. 2020. TaPaCo: A corpus of sentential paraphrases for 73 languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6868–6873, Marseille, France. European Language Resources Association.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  • Shi et al. (2021) Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Chandan K. Reddy. 2021. Neural abstractive text summarization with sequence-to-sequence models. ACM/IMS Trans. Data Sci., 2(1).
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, page 3104–3112. MIT Press.
  • Van Deemter et al. (2005) Kees Van Deemter, Emiel Krahmer, and Mariët Theune. 2005. Real versus template-based natural language generation: A false opposition? Comput. Linguist., 31(1):15–24.
  • van Miltenburg et al. (2020) Emiel van Miltenburg, Chris van der Lee, Thiago Castro-Ferreira, and Emiel Krahmer. 2020. Evaluation rules! on the use of grammars and rule-based systems for NLG evaluation. In Proceedings of the 1st Workshop on Evaluating NLG Evaluation, pages 17–27, Online (Dublin, Ireland). Association for Computational Linguistics.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.
  • Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
\appendixpage\addappheadtotoc

Appendix A Pretraining Data

We provide details of our pretraining data below: Religious Domain. Our religious data is taken from online Bibles, Qurans, and data crawled from the Jehovah’s witness website. We also include religious texts from the book of Mormon.

News Domain. We collect data from online newspapers Adebara and Abdul-Mageed (2022) and news sites such as Voice of America, Voice of Nigeria, BBC, Global voices, and DW news sites. We collect local newspapers from 27272727 languages from across Africa.

Government Documents. We collect government documents South African Centre for Digital Language Resources (SADiLaR), and the Universal Declaration of human rights (UDHR) in multiple languages.

Health Documents. We collect multiple health documents from the Department of Health, State Government of Victoria, Australia. We collect documents in Amharic, Dinka, Harari, Oromo, Somali, Swahili, and Tigrinya.

Existing Corpora. We collect corpora available on the web for different African languages, including from Project Gutenberg for Afrikaans, South African News data. for Sepedi and Setswana, OSCAR Abadji et al. (2021) for Afrikaans, Amharic, Somali, Swahili, Oromo, Malagasy, and Yoruba. We also used Tatoeba for Afrikaans, Amharic, Bemba, Igbo, Kanuri, Kongo, Luganda, Malagasy, Sepedi, Ndebele, Kinyarwanda, Somali, Swahili, Tsonga, Xhosa, Yoruba, and Zulu; Swahili Language Modelling Data for Swahili; Ijdutse corpus for Hausa; Data4Good corpora for Luganda, CC-100 for Amharic, Fulah, Igbo, Yoruba, Hausa, Tswana, Lingala, Luganada, Afrikaans, Somali, Swahili, Swati, North Sotho, Oromo, Wolof, Xhosa, and Zulu; Afriberta-Corpus for Afaan / Oromo, Amharic, Gahuza, Hausa, Igbo, Pidgin, Somali, Swahili, Tigrinya and Yoruba; mC4 for Afrikaans, Amharic, Hausa, Igbo, Malagasy, Chichewa, Shona, Somali, Sepedi, Swahili, Xhosa, Yoruba and Zulu.

Appendix B AfroNLG Benchmark

We report statistics of AfroNLG benchmark in Table B.1 and 2 respectively.

Dataset Pairs Train Dev Test
Lafand eng-hau 5,866 1,301 1,501
eng-ibo 6,945 1,457 1,412
eng-lug 4,076 1,501 1,501
eng-pcm 4,791 1,485 1,565
eng-swa 30,783 1,792 1,836
eng-tsn 2,101 1,343 1,501
eng-twi 3,338 1,285 1,501
eng-yor 6,645 1,545 1,559
eng-zul 3,540 1,462 1,001
fra-bam 3,014 1,501 1,501
fra-bbj 2,233 1,134 1,431
fra-ewe 2,027 1,415 1,564
fra-fon 2,638 1,228 1,580
fra-mos 2,494 1,493 1,575
fra-wol 3,361 1,507 1,501
AfroMT eng-afr 25,799 3,226 3,226
eng-bem 12,043 1,506 1,506
eng-lin 17,679 2,211 2,210
eng-run 12,475 1,560 1,560
eng-sot 28,844 3,607 3,606
eng-swa 28,084 3,511 3,512
eng-xho 26,091 3,263 3,262
eng-zul 29,127 3,641 3,642
PidginUNMT eng-pcm 1,682 211 211
SALT All-pairs 20,006 2,501 2,502
Table B.1: Statistics of the MT data in our benchmark. All-pairs each have the same size of data. They include ach-eng, ach-lgg, ach-lug, ach-nyn, ach-teo, ach-teo, eng-lgg, eng-lug, eng-nyn, eng-teo, lgg-teo, lug-lgg, lug-teo, nyn-lgg, nyn-lug, and nyn-teo
Task Cluster Test Set Source Train Dev Test
Cloze test 517 languages Ours 103,400 25,850 51,700
Paraphrase Multilingual\dagger\dagger† † Scherrer (2020) 22,390 2,797 2,794
Berber 17,607 2,200 2,200
Kabyle 4,441 555 555
Question Answering Swahili Clark et al. (2020) 49,881 499 n/a
Summarization Multilingual\dagger Hasan et al. (2021) 63,040 7,875 7875
Amharic 5,761 719 719
Igbo 4,183 522 522
Oromo 6,063 757 757
Rundi 5,746 718 718
Swahili 7,898 987 987
Yorùbá 6,350 793 793
Hausa 6,418 802 802
Nigerian Pidgin 9,208 1,151 1,151
Somali 5,962 745 745
Tigrinya 5,451 681 681
Multilingual\star\dagger⋆ † Ours 428
Title Generation Multilingual\dagger Hasan et al. (2021) 63,040 7,875 7875
Amharic 5,761 719 719
Igbo 4,183 522 522
Oromo 6,063 757 757
Rundi 5,746 718 718
Swahili 7,898 987 987
Yorùbá 6,350 793 793
Hausa 6,418 802 802
Nigerian Pidgin 9,208 1,151 1,151
Somali 5,962 745 745
Tigrinya 5,451 681 681
Multilingual\star Ours 5899
Table B.2: Statistics of the data in our benchmark. \dagger\dagger† † includes amh, ber, kab, run. \dagger has amh, ibo, orm, run, swa, yor, hau, pcm, som, and tir. \star\dagger⋆ † is a newly created summarization test set including ‘hau’, ‘nde’ (zero-shot), and ‘swa’. \star is a newly created test set across 15 languages: ‘amh’, ‘gag’ (zero-shot), ‘hau’, ‘ibo’, ‘pcm’, ‘som’, ‘swa’, ‘tir’, ‘yor’, ‘kin’ (zero-shot), ‘afr’, ‘mlg’ (zero-shot), ‘orm’, ‘nde’ (zero-shot), ‘sna’(zero-shot)

B.1 CHRF and CHRF++ Results

Task Metric mT0 mT5 afri-mt5 AfriTeVa Cheetah
Translate English to Afrikaans Chrf 26.97±plus-or-minus\pm±4.75 26.11±plus-or-minus\pm±4.12 14.66±plus-or-minus\pm±8.79 20.75±plus-or-minus\pm±4.02 39.88±plus-or-minus\pm±0.81
Translate English to Bemba Chrf 10.27±plus-or-minus\pm±0.89 6.39±plus-or-minus\pm±1.96 20.23±plus-or-minus\pm±13.97 9.94±plus-or-minus\pm±10.05 15.76±plus-or-minus\pm±0.19
Translate English to Rundi Chrf 21.51±plus-or-minus\pm±1.39 17.56±plus-or-minus\pm±3.13 24.91±plus-or-minus\pm±3.59 31.58±plus-or-minus\pm±2.33 28.65±plus-or-minus\pm±3.55
Translate English to Sesotho Chrf 21.08±plus-or-minus\pm±3.54 12.08±plus-or-minus\pm±10.91 23.75±plus-or-minus\pm±4.77 29.57±plus-or-minus\pm±1.61 29.05±plus-or-minus\pm±2.41
Translate English to Swahili Chrf 23.26±plus-or-minus\pm±0.16 20.35±plus-or-minus\pm±4.87 24.60±plus-or-minus\pm±0.2 20.5±plus-or-minus\pm±4.88 37.24±plus-or-minus\pm±0.04
Translate English to Xhosa Chrf 27.44±plus-or-minus\pm±3.1 25.88±plus-or-minus\pm±4.94 34.97±plus-or-minus\pm±2.49 20.25±plus-or-minus\pm±15.35 33.45±plus-or-minus\pm±0.21
Translate English to Zulu Chrf 27.12±plus-or-minus\pm±3.49 21.54±plus-or-minus\pm±2.16 37.8±plus-or-minus\pm±1.41 25.39±plus-or-minus\pm±16.55 43.75±plus-or-minus\pm±0.11
Translate English to Hausa Chrf 28.53±plus-or-minus\pm±0.26 27.65±plus-or-minus\pm±0.53 19.99±plus-or-minus\pm±0.42 31.68±plus-or-minus\pm±0.29 34.9±plus-or-minus\pm±0.32
Translate English to Igbo Chrf 40.31±plus-or-minus\pm±0.17 37.18±plus-or-minus\pm±0.34 22.01±plus-or-minus\pm±0.7 33.24±plus-or-minus\pm±0.23 44.37±plus-or-minus\pm±0.31
Translate English to Luganda Chrf 25.94±plus-or-minus\pm±2.41 23.33±plus-or-minus\pm±0.31 15.57±plus-or-minus\pm±1.45 24.16±plus-or-minus\pm±2.55 36.22±plus-or-minus\pm±0.09
Translate English to N. Pidgin Chrf 63.49±plus-or-minus\pm±0.05 63.9±plus-or-minus\pm±0.1 24.79±plus-or-minus\pm±0.68 53.76±plus-or-minus\pm±0.01 62.95±plus-or-minus\pm±0.17
Translate English to Swahili Chrf 50.52±plus-or-minus\pm±3.33 51.76±plus-or-minus\pm±0.12 21.00±plus-or-minus\pm±0.7 44.84±plus-or-minus\pm±0.33 56.36±plus-or-minus\pm±0.15
Translate English to Setswana Chrf 30.89±plus-or-minus\pm±0.36 16.62±plus-or-minus\pm±0.22 13.17±plus-or-minus\pm±1.73 23.75±plus-or-minus\pm±0.45 35.87±plus-or-minus\pm±0.64
Translate English to Twi Chrf 23.56±plus-or-minus\pm±0.24 15.8±plus-or-minus\pm±1.29 12.74±plus-or-minus\pm±1.33 17.47±plus-or-minus\pm±3.26 25.89±plus-or-minus\pm±0.2
Translate English to Yoruba Chrf 19.41±plus-or-minus\pm±1.97 16.51±plus-or-minus\pm±0.38 11.49±plus-or-minus\pm±0.29 20.62±plus-or-minus\pm±0.36 25.09±plus-or-minus\pm±0.07
Translate English to Zulu Chrf 35.4±plus-or-minus\pm±1.27 16.13±plus-or-minus\pm±7.84 15.04±plus-or-minus\pm±1.1 12.75±plus-or-minus\pm±0.56 38.81±plus-or-minus\pm±0.21
Translate French to Bambara Chrf 16.49±plus-or-minus\pm±0.39 7.44±plus-or-minus\pm±1.12 10.16±plus-or-minus\pm±1.58 19.41±plus-or-minus\pm±0.53 19.91±plus-or-minus\pm±0.05
Translate French to Ghomálá’ Chrf 8.3±plus-or-minus\pm±0.76 6.53±plus-or-minus\pm±0.57 6.72±plus-or-minus\pm±3.75 13.16±plus-or-minus\pm±0.4 8.57±plus-or-minus\pm±3.15
Translate French to Ewe Chrf 10.19±plus-or-minus\pm±2.32 5.46±plus-or-minus\pm±3.02 6.96±plus-or-minus\pm±3.02 13.44±plus-or-minus\pm±1.64 21.6±plus-or-minus\pm±0.22
Translate French to Fon Chrf 5.67±plus-or-minus\pm±2.65 6.09±plus-or-minus\pm±0.72 5.82±plus-or-minus\pm±1.58 11.88±plus-or-minus\pm±1.83 12.71±plus-or-minus\pm±0.41
Translate French to Moore Chrf 7.86±plus-or-minus\pm±1.43 5.16±plus-or-minus\pm±2.20 7.79±plus-or-minus\pm±0.97 11.42±plus-or-minus\pm±0.7 12.34±plus-or-minus\pm±0.56
Translate French to Wolof Chrf 17.55±plus-or-minus\pm±0.2 3.15±plus-or-minus\pm±0.12 11.26±plus-or-minus\pm±1.91 17.58±plus-or-minus\pm±0.44 16.67±plus-or-minus\pm±0.21
Translate English to N. Pidgin (pidginUNMT) Chrf 41.83±plus-or-minus\pm±0.17 37.12±plus-or-minus\pm±0.77 21.65±plus-or-minus\pm±1.33 39.04±plus-or-minus\pm±0.50 40.2±plus-or-minus\pm±0.17
Translate Acholi to English Chrf 39.12±plus-or-minus\pm±0.1 33.07±plus-or-minus\pm±5.49 21.65±plus-or-minus\pm±1.33 34.19±plus-or-minus\pm±0.06 42.17±plus-or-minus\pm±0.05
Translate Acholi to Lugbara Chrf 25.05±plus-or-minus\pm±0.85 20.61±plus-or-minus\pm±5.92 28.71±plus-or-minus\pm±0.34 34.01±plus-or-minus\pm±0.29 32.31±plus-or-minus\pm±1.11
Translate Acholi to Luganda Chrf 22.13±plus-or-minus\pm±0.63 25.75±plus-or-minus\pm±0.02 24.31±plus-or-minus\pm±0.1 32.77±plus-or-minus\pm±0.68 37.34±plus-or-minus\pm±0.47
Translate Acholi to Nyankore Chrf 27.52±plus-or-minus\pm±0.45 20.03±plus-or-minus\pm±3.88 24.50±plus-or-minus\pm±0.02 32.39±plus-or-minus\pm±0.92 35.0±plus-or-minus\pm±0.33
Translate Acholi to Ateso Chrf 26.0±plus-or-minus\pm±1.99 22.16±plus-or-minus\pm±1.63 28.33±plus-or-minus\pm±0.01 35.37±plus-or-minus\pm±0.61 34.62±plus-or-minus\pm±1.05
Translate English to Lugbara Chrf 38.84±plus-or-minus\pm±0.01 37.12±plus-or-minus\pm±0.77 39.11±plus-or-minus\pm±0.01 38.94±plus-or-minus\pm±0.3 40.2±plus-or-minus\pm±0.17
Translate English to Luganda Chrf 43.71±plus-or-minus\pm±0.08 41.05±plus-or-minus\pm±0.19 35.34±plus-or-minus\pm±1.11 43.14±plus-or-minus\pm±0.22 49.38±plus-or-minus\pm±0.02
Translate English to Nyankore Chrf 40.43±plus-or-minus\pm±0.21 38.38±plus-or-minus\pm±0.13 36.8±plus-or-minus\pm±0.07 40.36±plus-or-minus\pm±0.17 43.67±plus-or-minus\pm±0.32
Translate English to Ateso (salt) Chrf 41.98±plus-or-minus\pm±0.13 38.91±plus-or-minus\pm±0.05 39.76±plus-or-minus\pm±1.35 42.1±plus-or-minus\pm±0.42 42.96±plus-or-minus\pm±0.48
Translate Lugbara to Ateso Chrf 22.67±plus-or-minus\pm±1.51 20.47±plus-or-minus\pm±0.7 28.13±plus-or-minus\pm±0.58 34.3±plus-or-minus\pm±0.64 29.04±plus-or-minus\pm±0.3
Translate Luganda to Lugbara Chrf 28.65±plus-or-minus\pm±1.5 25.74±plus-or-minus\pm±0.5 30.87±plus-or-minus\pm±0.12 34.26±plus-or-minus\pm±0.24 34.94±plus-or-minus\pm±0.6
Translate Luganda to Ateso Chrf 31.74±plus-or-minus\pm±0.22 27.66±plus-or-minus\pm±0.64 34.04±plus-or-minus\pm±0.01 37.19±plus-or-minus\pm±0.07 39.05±plus-or-minus\pm±0.49
Translate Nyankore to Lugbara Chrf 27.47±plus-or-minus\pm±0.45 24.63±plus-or-minus\pm±0.76 15.01±plus-or-minus\pm±0.01 33.17±plus-or-minus\pm±0.21 33.2±plus-or-minus\pm±0.19
Translate Nyankore to Luganda Chrf 39.34±plus-or-minus\pm±0.14 37.34±plus-or-minus\pm±0.16 35.26±plus-or-minus\pm±0.13 40.48±plus-or-minus\pm±0.63 45.29±plus-or-minus\pm±0.01
Translate Nyankore to Ateso Chrf 28.6±plus-or-minus\pm±0.11 24.64±plus-or-minus\pm±1.05 30.69±plus-or-minus\pm±0.16 34.37±plus-or-minus\pm±0.14 35.52±plus-or-minus\pm±0.64
Average 28.07 23.88 22.62 28.77 34.08
Table B.3: Performance of various models on MT data using CHRF
Task Metric mT0 mT5 afri-mt5 AfriTeVa Cheetah
Translate English to Afrikaans Chrf++ 22.86±plus-or-minus\pm±3.74 22.32±plus-or-minus\pm±2.80 11.62±plus-or-minus\pm±6.72 17.27±plus-or-minus\pm±2.91 34.02±plus-or-minus\pm±0.7
Translate English to Bemba Chrf++ 9.04±plus-or-minus\pm±0.79 5.46±plus-or-minus\pm±1.78 23.65±plus-or-minus\pm±1.87 7.85±plus-or-minus\pm±7.45 13.9±plus-or-minus\pm±0.13
Translate English to Rundi Chrf++ 18.06±plus-or-minus\pm±1.16 14.41±plus-or-minus\pm±2.53 20.36±plus-or-minus\pm±2.88 25.39±plus-or-minus\pm±1.57 23.94±plus-or-minus\pm±3.03
Translate English to Sesotho Chrf++ 17.34±plus-or-minus\pm±3.09 10.2±plus-or-minus\pm±8.75 19.31±plus-or-minus\pm±3.94 23.85±plus-or-minus\pm±1.43 23.9±plus-or-minus\pm±2.03
Translate English to Swahili Chrf++ 18.5±plus-or-minus\pm±0.31 16.28±plus-or-minus\pm±4.48 19.42±plus-or-minus\pm±2.2 16.16±plus-or-minus\pm±3.93 30.6±plus-or-minus\pm±0.11
Translate English to Xhosa Chrf++ 21.34±plus-or-minus\pm±2.66 19.96±plus-or-minus\pm±4.05 26.94±plus-or-minus\pm±1.92 15.76±plus-or-minus\pm±11.49 27.0±plus-or-minus\pm±1.01
Translate English to Zulu Chrf++ 21.14±plus-or-minus\pm±2.6 17.32±plus-or-minus\pm±3.17 28.97±plus-or-minus\pm±1.14 19.29±plus-or-minus\pm±12.69 40.97±plus-or-minus\pm±1.10
Translate English to Hausa Chrf++ 25.98±plus-or-minus\pm±0.27 25.22±plus-or-minus\pm±0.5 18.28±plus-or-minus\pm±0.41 28.56±plus-or-minus\pm±0.22 32.23±plus-or-minus\pm±0.29
Translate English to Igbo Chrf++ 37.82±plus-or-minus\pm±0.15 34.8±plus-or-minus\pm±0.32 20.25±plus-or-minus\pm±0.68 29.89±plus-or-minus\pm±0.22 41.87±plus-or-minus\pm±0.31
Translate English to Luganda Chrf++ 23.15±plus-or-minus\pm±2.19 20.74±plus-or-minus\pm±0.36 13.43±plus-or-minus\pm±1.28 20.27±plus-or-minus\pm±2.21 33.12±plus-or-minus\pm±0.08
Translate English to N. Pidgin Chrf++ 60.57±plus-or-minus\pm±0.15 60.12±plus-or-minus\pm±0.07 23.85±plus-or-minus\pm±0.64 49.72±plus-or-minus\pm±0.36 59.74±plus-or-minus\pm±0.18
Translate English to Swahili Chrf++ 47.67±plus-or-minus\pm±3.33 48.95±plus-or-minus\pm±0.13 19.01±plus-or-minus\pm±1.69 40.84±plus-or-minus\pm±0.31 53.67±plus-or-minus\pm±0.15
Translate English to Setswana Chrf++ 29.02±plus-or-minus\pm±0.35 14.87±plus-or-minus\pm±0.16 11.77±plus-or-minus\pm±1.61 21.25±plus-or-minus\pm±0.36 34.05±plus-or-minus\pm±0.64
Translate English to Twi Chrf++ 21.25±plus-or-minus\pm±0.22 13.63±plus-or-minus\pm±1.18 11.7±plus-or-minus\pm±1.13 15.39±plus-or-minus\pm±3.02 23.96±plus-or-minus\pm±0.2
Translate English to Yoruba Chrf++ 18.41±plus-or-minus\pm±1.89 15.47±plus-or-minus\pm±0.4 10.19±plus-or-minus\pm±0.25 18.99±plus-or-minus\pm±0.27 24.1±plus-or-minus\pm±0.06
Translate English to Zulu Chrf++ 30.99±plus-or-minus\pm±1.13 13.86±plus-or-minus\pm±6.85 11.34±plus-or-minus\pm±2.1 10.58±plus-or-minus\pm±0.77 34.31±plus-or-minus\pm±0.2
Translate French to Bambara Chrf++ 15.75±plus-or-minus\pm±0.36 6.8±plus-or-minus\pm±0.97 10.2±plus-or-minus\pm±1.41 18.28±plus-or-minus\pm±0.49 19.65±plus-or-minus\pm±0.14
Translate French to Ghomálá’ Chrf++ 7.0±plus-or-minus\pm±0.77 5.64±plus-or-minus\pm±0.44 5.84±plus-or-minus\pm±3.04 11.13±plus-or-minus\pm±0.34 7.28±plus-or-minus\pm±2.83
Translate French to Ewe Chrf++ 9.09±plus-or-minus\pm±2.21 4.75±plus-or-minus\pm±2.76 6.56±plus-or-minus\pm±3.19 11.72±plus-or-minus\pm±1.4 20.53±plus-or-minus\pm±0.23
Translate French to Fon Chrf++ 5.24±plus-or-minus\pm±2.33 5.57±plus-or-minus\pm±0.63 5.28±plus-or-minus\pm±1.38 10.94±plus-or-minus\pm±1.93 11.76±plus-or-minus\pm±0.45
Translate French to Moore Chrf++ 7.08±plus-or-minus\pm±1.33 4.63±plus-or-minus\pm±2.02 7.18±plus-or-minus\pm±0.79 10.31±plus-or-minus\pm±0.64 11.2±plus-or-minus\pm±0.54
Translate French to Wolof Chrf++ 16.27±plus-or-minus\pm±0.24 2.65±plus-or-minus\pm±0.11 10.23±plus-or-minus\pm±1.73 15.73±plus-or-minus\pm±0.33 15.58±plus-or-minus\pm±0.19
Translate English to N. Pidgin (pidginUNMT) Chrf++ 42.12±plus-or-minus\pm±0.18 37.67±plus-or-minus\pm±1.64 22.53±plus-or-minus\pm±1.31 28.38±plus-or-minus\pm±0.98 39.58±plus-or-minus\pm±0.49
Translate Acholi to English Chrf++ 37.96±plus-or-minus\pm±0.1 27.18±plus-or-minus\pm±0.36 28.24±plus-or-minus\pm±0.38 31.83±plus-or-minus\pm±0.07 41.06±plus-or-minus\pm±0.06
Translate Acholi to Lugbara Chrf++ 23.41±plus-or-minus\pm±0.84 19.57±plus-or-minus\pm±5.04 27.18±plus-or-minus\pm±0.36 31.45±plus-or-minus\pm±0.29 30.68±plus-or-minus\pm±1.02
Translate Acholi to Luganda Chrf++ 25.67±plus-or-minus\pm±0.34 19.59±plus-or-minus\pm±0.56 21.52±plus-or-minus\pm±0.02 28.52±plus-or-minus\pm±0.63 33.93±plus-or-minus\pm±0.48
Translate Acholi to Nyankore Chrf++ 24.02±plus-or-minus\pm±0.41 17.35±plus-or-minus\pm±3.35 21.38±plus-or-minus\pm±0.23 27.73±plus-or-minus\pm±0.84 31.04±plus-or-minus\pm±0.29
Translate Acholi to Ateso Chrf++ 23.65±plus-or-minus\pm±1.87 20.07±plus-or-minus\pm±1.53 25.81±plus-or-minus\pm±0.04 31.56±plus-or-minus\pm±0.57 31.83±plus-or-minus\pm±0.99
Translate English to Lugbara Chrf++ 36.83±plus-or-minus\pm±0.03 38.3±plus-or-minus\pm±0.13 37.29±plus-or-minus\pm±0.12 34.3±plus-or-minus\pm±0.77 35.85±plus-or-minus\pm±0.01
Translate English to Luganda Chrf++ 40.1±plus-or-minus\pm±0.06 37.56±plus-or-minus\pm±0.19 32.18±plus-or-minus\pm±1.05 38.28±plus-or-minus\pm±0.2 45.82±plus-or-minus\pm±0.04
Translate English to Nyankore Chrf++ 35.93±plus-or-minus\pm±0.18 34.07±plus-or-minus\pm±0.12 32.59±plus-or-minus\pm±0.05 34.88±plus-or-minus\pm±0.15 39.17±plus-or-minus\pm±0.33
Translate English to Ateso (salt) Chrf++ 37.98±plus-or-minus\pm±0.11 38.93±plus-or-minus\pm±0.01 36.83±plus-or-minus\pm±1.23 37.85±plus-or-minus\pm±0.4 39.87±plus-or-minus\pm±0.47
Translate Lugbara to Ateso Chrf++ 20.55±plus-or-minus\pm±1.38 18.54±plus-or-minus\pm±0.65 25.6±plus-or-minus\pm±0.64 30.48±plus-or-minus\pm±0.59 26.43±plus-or-minus\pm±0.32
Translate Luganda to Lugbara Chrf++ 26.79±plus-or-minus\pm±1.49 23.94±plus-or-minus\pm±0.48 29.13±plus-or-minus\pm±0.11 31.56±plus-or-minus\pm±0.24 33.04±plus-or-minus\pm±0.58
Translate Luganda to Ateso Chrf++ 28.94±plus-or-minus\pm±0.22 25.11±plus-or-minus\pm±0.59 31.26±plus-or-minus\pm±0.01 33.18±plus-or-minus\pm±0.05 35.99±plus-or-minus\pm±0.45
Translate Nyankore to Lugbara Chrf++ 22.89±plus-or-minus\pm±0.73 25.75±plus-or-minus\pm±0.44 12.07±plus-or-minus\pm±0.11 30.54±plus-or-minus\pm±0.2 31.35±plus-or-minus\pm±0.2
Translate Nyankore to Luganda Chrf++ 35.7±plus-or-minus\pm±0.12 33.73±plus-or-minus\pm±0.15 31.99±plus-or-minus\pm±0.07 35.74±plus-or-minus\pm±0.54 41.63±plus-or-minus\pm±0.0
Translate Nyankore to Ateso Chrf++ 26.03±plus-or-minus\pm±0.08 22.35±plus-or-minus\pm±0.98 28.05±plus-or-minus\pm±0.09 30.53±plus-or-minus\pm±0.13 32.65±plus-or-minus\pm±0.62
Average 25.58 21.67 20.50 25.16 31.24
Table B.4: Performance of various models on MT data using CHRF++

Appendix C Linguistic Details

Morphology Morphologically, both Hausa and Swahili are classified as agglutinative languages Jaggar (2017); Dryer and Haspelmath (2013), characterized by the systematic addition of prefixes, suffixes, and affixes to root words or stems. This process imparts precise grammatical meanings, encompassing tense, case, mood, person, number, and more. Conversely, Yorùbá exhibits an analytic structure, relying on word order and discrete function words to denote grammatical relationships, with minimal use of inflections or affixes. The following are examples from the generated (1) Hausa, (2) Swahili, and (3) Yorùbá, respectively.

\pex

[interpartskip=2ex] \a\begingl\glaBai barshi ba // \glbneg.masculine leave at-all // \glft‘he did not leave him’ // \endgl\a\begingl[everygl=,everygla=,everyglb=, everyglft=,aboveglftskip=1.5ex] \glaBata barshi ba // \glbNeg.feminine leave at-all // \glft‘she did not leave him’ // \endgl\xe

\pex

[interpartskip=2ex] \a\begingl\glaHa-ku-mu-a-cha // \glb3pl.sg.sub-neg-3pl.sg.obj-leave // \glft‘He did not leave him’ // \endgl\a\begingl[everygl=,everygla=,everyglb=, everyglft=,aboveglftskip=1.5ex] \glaHa-ku-mu-a-cha // \glb3pl.sg.sub-neg-3pl.sg.obj-leave // \glft‘She did not leave him’ // \endgl\xe

\pex

[interpartskip=2ex] \a\begingl\glaÒhun ò kúrò l\textsubdotÓd\textsubdotÒ \textsubdotè // \glb3pl.sg.sub neg leave from 3pl.sg.obj // \glft‘He did not leave him’ // \endgl\a\begingl[everygl=,everygla=,everyglb=, everyglft=,aboveglftskip=1.5ex] \glaÒhun ò kúrò l\textsubdotÓd\textsubdotÒ \textsubdotè // \glb3pl.sg.sub neg leave from 3pl.sg.obj // \glft‘She did not leave him’ // \endgl\xe

Phonology In terms of phonology, Yorùbá and Hausa are tonal languages, where pitch distinctions contribute to word differentiation. However, Hausa features a relatively simpler tone system compared to Yorùbá and in most cases tone is not marked in Hausa orthography. Only dictionaries and pedagogical materials indicate tone in text. Yorùbá on the other hand has three tones and indicating tones in orthography significantly reduces ambiguity Adebara and Abdul-Mageed (2022). Swahili, in contrast, is devoid of tones altogether.

Appendix D Cloze Task Results

We provide results on the performance of each model on individual languages. We use a dash ’-’ to indicate that a specific model does not support a language.

ISO MT0 MT5 AfriMT5 AfriTeVa Cheetah
afr 0 0 - - 20.45
amh 0 0 - 0 0
bam - - 0 - 0
bbj - 5.21 0 - 8.45
ewe - - 0 - 0
fon - - 0 - 0
hau 0 0 0 0 13.41
ibo 0 0 0 0 0
lin 0 - - - 25.35
lug - - 0 - 0
luo - - 0 - 9.35
mos - - 0 - 14.53
mlg 0 0 - - 15.65
nya - - - - 7.64
nyj - - -
orm 0 - - - 0
pcm - - 0 0 10.10
sna 0 0 - - 0
som 0 0 - 0 10.39
sot 4.69 - - - 15.23
swa - - 0 0 7.02
swh - -
tir - - - - 6.33
tsn - - 0 - 0
twi - - 0 - 0
wol - - 0 - 0
xho 0 0 - - 6.92
yor 0 3.61 0 0 6.42
zul 0 0 0 - 8.05
Table D.1: Bleu scores for mask-one cloze task on the union of languages represented in the four models we compare Cheetah with. Red describes zero-shot performance greater than 00.
ISO MT0 MT5 AfriMT5 AfriTeVa Cheetah
afr 0 0 - - 0
amh 0 0 - 0 0
bam - - 0 - 0
bbj - - 0 - 0
ewe - - 0 - 0
fon - - 0 - 0
hau 0 - 0 0 6
ibo 0 - 0 0 8
lin 0 - - - 0
lug - - 0 - 0
luo - - 0 - 0
mos - - 0 - 0
mlg 0 - - - 0
nya 0 0 - - 12
nyj - - -
orm 0 - 0 - 0
pcm - - 0 0 0
sna 0 0 - - 0
som 0 0 - 0 4
sot - - - - 10
swa - - 0 0 12
swh - -
tir - - 0 0 0
tsn - - 0 - 0
twi - - 0 - 0
wol - - 0 - 0
xho 0 0 0 - 6
yor 0 0 0 0 0
zul 0 0 0 - 0
Table D.2: Bleu scores for mask-at-least-one cloze task on the union of languages represented in the four models we compare Cheetah with.

Appendix E Annotation

We gave the following annotation rules to our annotators: Faithfulness refers to how close to the English sentence the model output is. It should be annotated with values between 1 and 5. Faithfulness should be evaluated independently of the fluency of the model output. Below are some detailed explanations for the scale for faithfulness:

  • Give a value 1 if model output is not related to the source sentence.

  • Give a value 2 if the model output is the opposite of the source sentence.

  • Give a value 3 if the model output is somewhat related to the source sentence. It should have some words or phrases that make it related to the source.

  • Give a value 4 if the model output is closely related but changes the meaning slightly (e.g difference in gender, number etc)

  • Give a value 5 if the model output is an exact translation

Fluency is how grammatically correct the model is. Faithfulness and fluency should be judged independently. That is, even if the output is not faithful, don’t use it to determine the fluency score and vice versa. Here are some detailed explanations on how to assign the values:

  • Give a value 1 if model output is completely ungrammatical and nonsensical.

  • Give a value 2 if the model output is reasonable but includes some foreign words or gibberish.

  • Give a value 3 if the model output contains some grammatical phrases but also contains some ungrammatical phrases.

  • Give a value 4 if the model output is almost grammatical (but may have a few errors like spelling mistakes)

  • Give a value 5 if the model output is very fluent and sounds looks like what a native speaker will say.

Appendix F Results on Quality Evaluation

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure F.1: Faithfulness and fluency for Intransitives in Hausa, Swahili, and Yorùbá
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure F.2: Faithfulness and fluency for Transitives in Hausa, Swahili, and Yorùbá
Refer to caption
Figure F.3: Performance on some intransitive examples in the Yorùbá test set. The correct words have no highlights, plausible words or phrases are highlighted with yellow ink while wrong words and phrases are highlighted with grey highlights. We use plausible to refer to words or phrases that can be used in place of the gold or which give additional information.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure F.4: Faithfulness and fluency for Intransitives + Negation in Hausa, Swahili, and Yorùbá
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure F.5: Faithfulness and fluency for Transitives + Negation in Hausa, Swahili, and Yorùbá