Academia.eduAcademia.edu
arXiv:2304.09972v2 [cs.CL] 20 Sep 2023 MasakhaNEWS: News Topic Classification for African languages David Ifeoluwa Adelani1∗ , Marek Masiak1∗, Israel Abebe Azime2 , Jesujoba Oluwadara Alabi2 , Atnafu Lambebo Tonja3,6 , Christine Mwase4 , Odunayo Ogundepo5 , Bonaventure F. P. Dossou6,7,8,9 , Akintunde Oladipo5 , Doreen Nixdorf, Chris Chinenye Emezue9,10 , Sana Sabah al-azzawi11 , Blessing K. Sibanda, Davis David12 , Lolwethu Ndolela, Jonathan Mukiibi13 , Tunde Oluwaseyi Ajayi14 , Tatiana Moteu Ngoli15 , Brian Odhiambo, Abraham Toluwase Owodunni, Nnaemeka C. Obiefuna, Muhidin Mohamed16 , Shamsuddeen Hassan Muhammad17 , Teshome Mulugeta Ababu18 , Saheed Salahudeen Abdullahi19 , Mesay Gemeda Yigezu3 , Tajuddeen Gwadabe, Idris Abdulmumin20 , Mahlet Taye Bame, Oluwabusayo Olufunke Awoyomi21 , Iyanuoluwa Shode22 , Tolulope Anu Adelani, Habiba Abdulganiy Kailani, Abdul-Hakeem Omotayo23 , Adetola Adeeko, Afolabi Abeeb, Anuoluwapo Aremu, Olanrewaju Samuel24 , Clemencia Siro25 , Wangari Kimotho26 , Onyekachi Raphael Ogbu, Chinedu E. Mbonu27 , Chiamaka I. Chukwuneke27,28 , Samuel Fanijo29 , Jessica Ojo, Oyinkansola F. Awosan, Tadesse Kebede Guge30 , Sakayo Toadoum Sari26,31 , Pamela Nyatsine, Freedmore Sidume32 , Oreen Yousuf, Mardiyyah Oduwole33 , Kanda P. Tshinu, Ussen Kimanuka34 , Thina Diko, Siyanda Nxakama, Sinodos Gebre18 , Abdulmejid Tuni Johar, Shafie Abdi Mohamed34 , Fuad Mire Hassan35 , Moges Ahmed Mehamed36 , Evrard Ngabire37 , Jules Jules, Ivan Ssenkungu, and Pontus Stenetorp1 ∀ 3 7 Masakhane NLP, Africa, 1 University College London, United Kingdom, 2 Saarland University, Germany, Instituto Politécnico Nacional, Mexico, 4 Fudan University, China, 5 University of Waterloo, Canada, 6 Lelapa AI, McGill University, Canada, 8 Mila Quebec AI Institute, Canada, 9 Lanfrica, 10 Technical University of Munich, Germany 11 Luleå University of Technology, Sweden, 12 Tanzania Data Lab, Tanzania 13 Makerere University, Uganda, 14 Insight Centre for Data Analytics, Ireland, 15 Paderborn University, Germany, 16 Aston University, UK, 17 20 23 University of Porto, Portugal, 18 Dire Dawa University, Ethiopia 19 Kaduna State University, Nigeria, Ahmadu Bello University, Nigeria, 21 The College of Saint Rose, USA 22 Montclair State University, USA, University of California, Davis, 24 University of Rwanda, Rwanda 25 University of Amsterdam, The Netherlands, 26 29 AIMS, Cameroon, 27 Nnamdi Azikiwe University, Nigeria , 28 Lancaster University, United Kingdom, Iowa State University, USA, 30 Haramaya University, Ethiopia, 31 AIMS, Senegal, 32 BIUST, Botswana, 33 NOUN, Nigeria, 34 PAUSTI, Kenya, 34 Jamhuriya University, Somalia, 35 Somali National University, 36 Wuhan University of Technology, China, 37 Deutschzentrum an der Universität Burundi, Ethiopia Correspondence: d.adelani@ucl.ac.uk Abstract Despite representing roughly a fifth of the world population, African languages are underrepresented in NLP research, in part due to a lack of datasets. While there are individual language-specific datasets for several tasks, only a handful of tasks (e.g. named entity recognition and machine translation) have datasets covering geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS—the largest dataset for news topic classification covering 16 languages widely spoken in Africa. We provide and evaluate a set of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives ∗ Equal contribution to full fine-tuning of language models that are better suited for zero-shot and few-shot learning, such as: cross-lingual parameter-efficient fine-tuning (MAD-X), pattern exploiting training (PET), prompting language models (ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and the co:here embedding API). Our evaluation in a few-shot setting, shows that with as little as 10 examples per label, we achieve more than 90% (i.e. 86.0 F1 points) of the performance of fully supervised training (92.6 F1 points) leveraging the PET approach. Our work shows that existing supervised approaches work well for all African languages and that language models with only a few supervised samples can reach competitive performance, both findings which demonstrate the applicability of existing NLP techniques for African languages. 1 Introduction News topic classification is a text classification task in NLP that involves categorizing news articles into different categories like sports, business, entertainment, and politics. It has shaped the development of several machine learning algorithms over the years, such as topic modeling (Blei et al., 2001; Dieng et al., 2020) and deep learning models (Zhang et al., 2015; Joulin et al., 2017). Similarly, news topic classification is a popular downstream task for evaluating the performance of large language models (LLMs) for both fine-tuning and prompttuning setups (Yang et al., 2019; Sun et al., 2019; Brown et al., 2020; Liu et al., 2023). Despite the popularity of the task in benchmarking LMs, most of the evaluation have only been performed on English and a few other highresource languages. It is unclear how this approach extends to pre-trained multilingual language models for low-resource languages. For instance, BLOOM (Scao et al., 2022) was pre-trained on 46 languages, including 22 African languages (mostly from the Niger-Congo family). However, extensive evaluation on these set of African languages was not performed due to lack of evaluation datasets. In general, only a handful of NLP tasks such as machine translation (Adelani et al., 2022a; NLLBTeam et al., 2022), named entity recognition (Adelani et al., 2021, 2022b), and sentiment classification (Muhammad et al., 2023) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. Another popular task that can be used for evaluating the downstream performance of language models is news topic classification, but human-annotated datasets for benchmarking topic classification using language models for African languages are scarce. In this paper, we address two problems: the lack of evaluation datasets and lack of extensive evaluation of LMs for African languages. We create a large-scale news topic classification dataset covering 16 typologically-diverse languages widely spoken in Africa, including English and French, with the same label categories across all languages. Our dataset is also suitable for news headline generation task (Aralikatte et al., 2023): a special type of text summarization. We provide several baseline models using both classical machine learning approaches and fine-tuning LMs. Furthermore, we explore several alternatives to full finetuning of language models that are better suited for zero-shot and few-shot learning (e.g. 5-examples per label) such as cross-lingual parameter-efficient fine-tuning (MAD-X (Pfeiffer et al., 2020)), pattern exploiting training (PET) (Schick and Schütze, 2021a), prompting ChatGPT LLM, and promptfree, sentence transformer fine-tuning (SetFit) (Tunstall et al., 2022a), and the co:here embedding API. Our evaluation in a zero-shot setting shows the potential of prompting ChatGPT for news topic classification for low-resource African languages. We found that GPT-3.5-Turbo has impressive result for languages that make use of Latin script, but perform poorly for non-Latin based scripts like Amharic and Tigrinya. However, GPT-4 was able to overcome this challenge for non-Latin script with impressive performance matching the result of cross-lingual transfer experiments from a related African language. In a few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach. We hope this encourages the NLP community to benchmark and evaluate LLMs on more low-resource languages. For reproducibility, we release our data and code under academic license or CC BY-NC 4.0 on Github.1 2 Related Work News topic classification , an application of text classification, is a popular task in natural language processing. There are various news topic classification datasets, including BBC News (Greene and Cunningham, 2006), AG News (Zhang et al., 2015), and the multimodal N24News (Wang et al., 2022), all of which are English datasets. In addition, there is the IndicNLP News (Kunchukuttan et al., 2020) which is a multilingual dataset for Indian langauges. For African languages, only a handful of human annotated datasets exists, such as the Hausa & Yorùbá dataset (Hedderich et al., 2020) (only covering news headline), KINNEWS & KIRNEWS datasets for Kinyarwanda and Kirundi (Niyongabo et al., 2020), and Tigrinya News (Fesseha et al., 2021). Others are semi-automatically created using predefined topics from news websites like Amharic news (Azime and Mohammed, 2021) and ANTC dataset (Alabi et al., 2022)—that covered five African languages (Lingala, Somali, Naija, 1 https://github.com/masakhane-io/ masakhane-news Malagasy, and isiZulu). These datasets, however, have limitations due to the fact that they were created with little or no human supervision and using different labeling schemes. In contrast, in this work we present news topic classification data for 16 typologically diverse African languages with a consistent labeling scheme across all languages. Prompting Language Models using manually designed prompts to guide text generation has recently been applied to a myriad of NLP tasks, including topic classification. Models such as GPT3 (Brown et al., 2020) and T5 (Raffel et al., 2020; Sanh et al., 2022) are able to learn more structural and semantic relationships between words and have shown impressive results even in multilingual scenarios when tuned for different tasks (Chung et al., 2022; Muennighoff et al., 2023). One approach to prompt-tuning a language model for topic classification is to design a “template” for classification and insert a sequence of text into template (Gao et al., 2021; Shin et al., 2020). There are some other approaches to few-shot learning without prompting. One of them is SetFit (Tunstall et al., 2022a), which takes advantage of sentence transformers to generate dense representations for input sequences. These representations are then passed through a classifier to predict class labels. The sentence transformers are trained on a few examples using contrastive learning where positive and negative training pairs are sampled by in-class and out-class sampling. Another common approach is Pattern-Exploiting Training also known as PET (Schick and Schütze, 2021a). PET is a semisupervised training approach that used restructured input sequences to condition language models to better understand a given task, while iPET (Schick and Schütze, 2021b) is an iterative variant of PET that is also shown to perform better. 3 Languages Table 1 presents the languages covered in along with information on their language families, their primary geographic regions in Africa, and the number of speakers. Our dataset consists of a total of 16 typologically-diverse languages, and they were selected based on the availability of publicly available news corpora in each language, the availability of native-speaking annotators, geographical diversity and most importantly, because they are widely spoken in Africa. English and French are official languages in 42 African countries, Swahili is native to 12 countries, and Hausa is native to 6 countries. In terms of geographical diversity, we have four languages spoken in West Africa, seven languages spoken in East Africa, two languages spoken in Central Africa (i.e. Lingala and Kiswahili), and two spoken in Southern Africa (i.e chiShona and isiXhosa). Also, we cover four language families, Niger-Congo (8) Afro-Asiatic (5), Indo-European (2), and English Creole (1). The only English creole language is Nigerian-Pidgin, also known as Naija. Each language is spoken by at least 10 million people, according to Ehnologue (Eberhard et al., 2021). 4 MasakhaNEWS dataset 4.1 Data Source The data used in this research study were sourced from multiple reputable news outlets. The collection process involved crawling the British Broadcasting Corporation (BBC) and Voice of America (VOA) websites. We crawled between 2k–12k articles depending on the number of articles available on the websites. Some of the websites already have some pre-defined categories, we make use of this to additionally filter articles that do not belong to categories we plan to annotate. We took inspiration of news categorization from BBC English with six (6) pre-defined and well-defined categories (“business”, “entertainment”, “health”, “politics”, “sports”, and “technology”) with over 500 articles in each category. For English, we only crawled articles belonging to these categories while for the other languages, we crawled all articles. Our target is to have around 3,000 articles for annotation but three languages (Lingala, Rundi, and Somali) have less than that. Table 2 shows the news source per language and the number of articles crawled. 4.2 Data Annotation We recruited volunteers from the Masakhane community—an African grassroots community focused on advancing NLP for African languages.2 The annotators were asked to label 3k articles into eight categories: “business”, “entertainment”, “health”, “politics”, “religion”, “sports”, “technology”, and “uncategorized”. Six of the categories are based on BBC English major news categories, the “religion” label was added since many African 2 all annotators are were included as authors of the paper. Language Family/branch Region # speakers News Source Amharic (amh) English (eng) French (fra) Hausa (hau) Igbo (ibo) Lingala (lin) Luganda (lug) Naija (pcm) Oromo (orm) Rundi (run) chiShona (sna) Somali (som) Kiswahili (swa) Tigrinya (tig) isiXhosa (xho) Yorùbá (yor) Afro-Asiatic / Ethio-Semitic Indo-European / Germanic Indo-European /Romance Afro-Asiatic / Chadic Niger-Congo / Volta-Niger Niger-Congo / Bantu Niger-Congo / Bantu English Creole Afro-Asiatic / Cushitic Niger-Congo / Bantu Niger-Congo / Bantu Afro-Asiatic / Cushitic Niger-Congo / Bantu Afro-Asiatic / Ethio-Semitic Niger-Congo / Bantu Niger-Congo / Volta-Niger East Africa Across Africa Across Africa West Africa West Africa Central Africa Central Africa West Africa East Africa East Africa Southern Africa East Africa East & Central Africa East Africa Southern Africa West Africa 57M 1268M 277M 77M 31M 40M 11M 121M 37M 11M 11M 22M 71M-106M 9M 19M 46M BBC BBC BBC BBC BBC VOA Gambuuze BBC BBC BBC VOA & Kwayedza BBC BBC BBC Isolezwe BBC # articles 8,204 5,073 5,683 6,965 4,628 2,022 2,621 7,783 7,782 2,995 11,146 2,915 6,431 4,372 24,658 6,974 Table 1: Languages covered in and Data Source: including language family, region, number of L1 & L2 speakers, and number of articles from each news source. news websites frequently cover this topic. Other articles that do not belong to the first seven categories, are assigned to the “uncategorized” label. For each language, the annotation followed two stages. In the first stage, we randomly shuffled the entire dataset and asked annotators to label the first 200 articles manually. In the second stage, we made use of active learning by combining the first 200 annotated articles with articles with predefined labels where available, and trained a classifier (i.e. by fine-tuning AfroXLMR-base (Alabi et al., 2022)). We ran predictions on the rest of the articles, and asked annotators to correct the mistakes of the classifier. This approach helped to speed up the annotation process. Annotation tool We make use of an in-house annotation tool to label the articles. Appendix A shows an example of the interface of the tool. To further simplify the annotator effort, we ask annotators to label articles based on the headlines instead of the entire article. However, since some headlines are not very descriptive, we decided to concatenate the headline and the first two sentences of the news text to provide additional context to annotators. Inter-agreement score We report Fleiss Kappa score (Fleiss et al., 1971) to measure the agreement of annotation. Table 2 shows that all languages have a moderate to perfect Fleiss Kappa score (i.e. 0.55 - 0.85), which shows a high agreement among the annotators recruited for each language. Languages with only one annotator (i.e. Luganda and Rundi) were excluded in the evaluation. Deciding a single label per article After annotation, we assigned the final label to each article by majority voting. Each label of an article needs to be agreed by a minimum of two annotators to be assigned the label. We only had exceptions for Luganda and Rundi, since they had one annotator. Our final dataset for each language consist of a minimum of 72 articles per topic, and a maximum of 500, except for English language where the classes are roughly balanced. We excluded the infrequent labels so we do not have a highly unbalanced dataset. The choice of a minimum of 72 articles ensures a minimum of 50 articles in the training set. 3 Our target is to have at least four topics per language with a minimum of 72 articles. This approach worked smoothly except for two languages: Lingala (“politics”, “health” and “sports”) and chiShona (“business”, “health” and “politics”), where we had only three topics with more than 72 articles. To ensure we have more articles per class, we had to resolve the conflict in annotation between Lingala annotators to ensure we have more labels for the “business” category. This approach still results in infrequent classes for chiShona. We had to crawl additional “sports” articles from a local chiShona website (Kwayedza), followed by manual filtering of unrelated sports news. Data Split Table 2 provides the data split for languages. We also provide the distribution of articles by topics. We divided the annotated data into TRAIN, DEV and TEST split following 70% / 10% / 20% split ratio. 5 Baseline Experiments We trained baseline text classification models by concatenating the news headline and news text using different approaches. 3 since we require 50 instances per class or 50-shots for the few-shot experiments in (§6.2.2) Language Train/Dev/Test # topics Amharic (amh) English (eng) French (fra) Hausa (hau) Igbo (ibo) Lingala (lin) Luganda (lug) Oromo (orm) Naija (pcm) Rundi (run) chiShona (sna) Somali (som) Kiswahili (swa) Tigrinya (tir) isiXhosa (xho) Yorùbá (yor) 1311/ 188/ 376 3309/ 472/ 948 1476/ 211/ 422 2219/ 317/ 637 1356/ 194/ 390 608/ 87/ 175 771/ 110/ 223 1015/ 145/ 292 1060/ 152/ 305 1117/ 159/ 322 1288/ 185/ 369 1021/ 148/ 294 1658/ 237/ 476 947/ 137/ 272 1032/ 147/ 297 1433/ 206/ 411 4 6 5 7 6 4 5 4 5 6 4 7 7 6 5 5 Topics (number of articles per topic) # bus # ent # health # pol # rel 404 799 500 399 292 82 169 97 76 500 114 316 80 72 - 750 500 366 119 460 158 139 98 167 500 500 500 746 500 493 424 193 228 447 159 372 425 354 500 395 100 398 500 821 500 500 500 500 500 500 309 500 500 500 500 500 308 500 493 73 91 73 73 292 317 # sport # tech # Annotator Fleiss Kappa 471 1000 500 497 285 95 116 386 492 419 417 148 500 125 496 335 613 109 291 135 165 89 - 5 7 3 5 4 2 1 3 4 1 3 3 4 2 3 5 0.81 0.81 0.83 0.85 0.65 0.56 0.63 0.66 0.63 0.55 0.72 0.63 0.89 0.80 Table 2: MasakhaNEWS dataset. The size of the annotated data, news topics, and number of annotators. Topics are labelled by their prefixes in the table (topics): business, entertainment, health, politics, religion, sport, technology. 5.1 Baseline Models We trained three classical ML models: Naive Bayes, multi-layer perceptron, and XGBoost using the popular sklearn tool (Pedregosa et al., 2011). We employed the “CountVectorizer” method to represent the text data, which converts a collection of text documents to a matrix of token counts. This method allows us to convert text data into numerical feature vectors. Furthermore, we fine-tune nine kinds of multilingual text encoders, seven of them are BERT/RoBERTa-based i.e. XLM-R (base & large) (Conneau et al., 2020), AfriBERTalarge (Ogueji et al., 2021), RemBERT (Chung et al., 2021), AfroXLMR (base & large) (Alabi et al., 2022), and AfroLM (Dossou et al., 2022), the other two are mDeBERTaV3 (He et al., 2021a), and LaBSE (Feng et al., 2022). mDeBERTaV3 pretrained a DeBERTa-style model (He et al., 2021b) with replaced token detection objective proposed in ELECTRA (Clark et al., 2020). On the other hand, LaBSE is a multilingual sentence transformer model that is popular for mining parallel corpus for machine translation. Finally, we fine-tuned four multilingual Textto-Text (T2T) models, mT5-base (Xue et al., 2021), Flan-T5-base (Chung et al., 2022), AfriMT5-base (Adelani et al., 2022a), AfriTeVAbase (Jude Ogundepo et al., 2022). The fine-tuning and evaluation of the multilingual text-encoders and T2T models were performed using HuggingFace Transformers (Wolf et al., 2020) and PyTorch Lightning4 . The models were fine-tuned on 4 https://pypi.org/project/pytorch-lightning/ Nvidia V100 GPU for 20 epochs, batch size of 32, 1e − 5/5e − 5 lr, and max. sequence length of 256. The LMs evaluated were both massively multilingual (i.e. typically trained on over 100 languages around the world) and African-centric (i.e. trained mostly on languages spoken in Africa). The African-centric multilinual text encoders are all modeled after XLM-R. AfriBERTa was pretrained from scratch on 11 African languages, AfroXLMR was adapted to African languages through finetuning the original XLM-R model on 17 African languages and 3 languages commonly spoken in Africa, while AfroLM was pretrained on 23 African languages utilizing active learning. Similar to the multilingual text encoders, the T2T models used in this study were pretrained on hundreds of languages, and they are all based on the T5 model (Raffel et al., 2020), which is an encoder-decoder model trained with the span-mask denoising objective. mT5 is a multilingual version of T5, and FlanT5 was fine-tuned on multiple tasks using T5 as a base. The study also included adaptations of the original models, such as AfriMT5-base, as well as AfriTeVA-base, a T5 model pre-trained on 10 African languages. 5.2 Baseline Results Table 3 shows the result of training several models on TRAIN split and evaluation on the TEST split for each language. Our evaluation shows that classical ML models are worse in general than finetuning multilingual LMs on average, however, the drop in performance is sometimes comparable to LMs if the language was not covered during the pre-training. For example, MLP, NaiveBayes and Model size amh eng fra hau ibo lin lug orm pcm run sna som swa tir xho yor AVG <20K <20K <20K 92.0 91.8 90.1 88.2 83.7 86.0 84.6 84.3 81.2 86.7 85.3 84.7 80.1 79.8 78.6 84.3 82.8 74.8 82.2 84.0 83.8 86.7 85.6 83.2 93.5 92.8 93.3 85.9 79.9 79.2 92.6 91.5 94.3 71.1 74.8 68.5 77.9 76.6 74.9 81.9 71.4 75.2 94.5 91.0 91.1 89.3 84.0 85.2 85.7 83.7 82.8 multilingual text encoders AfriBERTa 126M XLM-R-base 270M AfroXLMR-base 270M AfroLM 270M mDeBERTa 276M LABSE 471M XLM-R-large 550M AfroXLMR-large 550M RemBERT 559M 90.6 90.9 94.2 90.3 91.7 92.5 93.1 94.4 92.4 88.9 90.6 92.2 87.7 90.8 91.6 92.2 93.1 92.4 76.4 90.4 92.5 77.5 89.2 90.9 91.4 91.1 90.8 89.2 88.4 91.0 88.3 88.6 90.0 90.6 92.2 90.5 87.3 82.5 90.7 85.4 88.3 91.6 84.2 93.4 91.1 87.0 87.9 93.0 85.7 81.6 89.6 91.8 93.7 91.5 85.1 65.3 89.4 88.0 65.7 86.8 73.9 89.9 86.7 89.4 82.2 92.1 83.5 84.7 86.7 88.4 92.1 88.7 98.1 97.8 98.2 95.9 96.8 98.4 98.4 98.8 98.2 91.3 85.9 91.4 86.8 89.4 91.1 87.0 92.7 90.6 89.3 88.9 95.4 92.5 93.9 94.6 88.9 95.4 93.9 83.9 73.8 85.2 72.0 72.0 82.1 76.1 86.9 75.9 83.3 85.6 88.2 83.2 84.6 87.6 85.6 87.7 86.7 87.0 54.6 86.5 83.5 78.7 83.8 62.7 89.5 69.9 86.9 78.6 94.7 91.4 90.5 94.7 89.2 97.3 92.5 90.3 84.5 93.0 86.5 89.3 92.1 84.5 94.0 93.0 87.8 83.0 91.7 86.1 86.0 90.3 86.1 92.6 89.1 multilingual text-to-text LMs AfriTeVa-base 229M mT5-base 580M Flan-T5-base 580M AfriMT5-base 580M 87.0 78.2 54.5 90.2 80.3 89.8 92.4 90.3 71.9 59.0 88.9 87.4 85.8 82.7 84.5 87.9 79.9 76.8 86.6 88.0 82.8 80.8 90.6 88.6 60.2 75.0 84.1 84.8 82.9 79.2 85.8 83.9 95.2 96.1 97.8 96.6 80.0 85.7 87.3 91.0 84.4 90.4 90.6 91.5 58.0 75.0 76.0 77.8 80.7 76.1 79.0 84.4 55.2 65.1 41.5 80.8 69.4 71.8 90.8 91.6 86.4 86.2 88.0 88.8 77.5 80.0 82.4 87.7 classical ML MLP NaiveBayes XGBoost Table 3: Baseline results on . We compare several ML approaches using both classical ML and LMs. Average is over 5 runs. Evaluation is based on weighted F1-score. Africa-centric models are in gray color 95 Headline Headline+Text 90 85.7 F1-score 85 80 91.7 89.4 83.7 91.1 92.6 82.8 78.3 75 73.4 73.1 70 65 60 MLP NaiveBayes XGBoost AfroXLMR-B AfroXLMR-L News topic classification models Figure 1: Comparison of article content type used for training news topic classification models. We report the average across all languages when either headline or headline+text is used XGBoost have better performance than AfriBERTa on fra and sna since they were not seen during pretraining of the LM. Similarly, AfroLM had worse result for fra for the same reason. On average, XLM-R-base, AfroLM, mDeBERTaV3, XLM-Rlarge gave 83.0 F1, 86.1 F1, 86.0 F1, and 86.1 F1 respectively, with worse performance compared to the other LMs (87.8 − 92.6 F1) because they do not cover some of the African languages during pre-training (see Table 6) or they have been pretrained on a small data (e.g. AfroLM pretrained on less than 0.8GB despite seeing 23 African languages during pre-training). Larger models such as LABSE and RemBERT that cover more languages performed better than the smaller models, for example, LABSE achieved over of 2.5 F1 points over AfriBERTa. The best result achieved is by AfroXLMRbase/large with over 4.0 F1 improvement over AfriBERTa. The larger variant gave the overall best result due to the size. AfroXLMR models benefited from being pre-trained on most of the languages we evaluated on. We also tried multilingual T2T models, but none of the models reach the performance of AfroXLMR-large despite their larger sizes. We observe the same trend that the adapted mT5 model (i.e. AfriMT5) gave better result compared to mT5 similar to how AfroXLMR gave better result than XLM-R. We found FlanT5-base to be competitive to AfriMT5 despite seeing few African languages, however, the performance was very low for languages that uses the Ge’ez script like amh and tir since the model do not support Ge’ez. Headline-only training We compare our results using headline+text (as shown in Table 3) with training on the article headline—with shorter content, we find out that fine-tuned LMs gave impressive performance with only headlines while classical ML methods struggle due to shorter content. Figure 1 shows the result of our comparison. AfroXLMR-base and AfroXLMR-large both improve by (2.3) and (1.5) F1 points respectively if we use headline+text instead of headline. Classical ML models improve the most when we make use of headline+text instead of headline; MLP, NaiveBayes and XGBoost improve by large F1 points (i.e. 7.4 − 9.7). Thus, for the remainder of this paper, we make use of headline+text. Appendix B provides the breakdown of the result by languages for the comparison of headline and headline+text. eng fra hau ibo lin lug orm pcm run sna som swa tir xho yor AVG AVGsrc Fine-tune (AfroXLMR-base) hau 81.8 78.8 swa 89.5 82.4 72.9 86.7 91.5 80.8 83.2 81.5 74.4 74.5 57.5 66.5 63.3 63.8 93.2 92.7 81.6 86.2 85.5 83.6 63.3 74.7 80.7 87.3 73.2 71.8 77.4 72.6 80.4 80.4 77.4 79.7 76.2 79.1 MAD-X hau swa 81.0 91.0 79.5 80.9 72.2 86.1 90.3 81.2 87.4 83.0 82.6 85.0 84.4 75.1 80.2 82.6 91.2 94.2 76.0 86.9 89.9 90.1 66.5 74.6 81.2 88.4 72.6 77.6 82.8 80.7 87.4 88.8 81.6 84.1 81.0 84.0 PET None 67.2 53.3 51.7 42.1 50.4 28.6 27.0 43.9 63.1 57.9 62.2 39.2 53.8 45.2 56.0 49.7 49.5 49.7 SETFIT None 75.8 61.6 60.1 SRC LANG amh 53.3 53.1 59.6 40.1 38.9 72.0 55.1 66.6 49.4 55.2 37.8 49.3 63.7 55.7 55.9 ChatGPT (GPT 3.5 Turbo) - Mar 23 version None 33.3 79.3 67.6 59.4 65.0 62.3 59.4 62.9 93.2 73.6 73.0 62.0 69.3 41.4 73.9 80.1 66.0 66.2 ChatGPT (GPT 3.5 Turbo) - May 24 version None 36.1 79.5 69.6 70.1 78.3 75.1 64.7 72.0 93.1 82.2 84.5 72.3 75.9 45.0 78.0 81.7 72.4 72.3 GPT 4 – May 24 version None 88.5 79.1 84.0 82.6 77.9 70.0 96.2 88.6 90.8 77.3 75.0 76.7 83.1 83.7 81.7 82.5 77.3 76.5 Table 4: Zero-shot learning on . We compare several approaches such as using MAD-X, PET and SetFit. We excluded the source languages hau and swa from the average (AVGsrc ). 6 Zero-shot and Few-shot transfer 6.1 Methods Here, we compare different zero-shot and few-shot methods: Fine-tune (Fine-tune on a source language, and evaluate on a target language) using AfroXLMRbase. This is only used in the zero-shot setting. MAD-X (Pfeiffer et al., 2020, 2021) - a parameter efficient approach for cross-lingual transfer leveraging the modularity, and portability of adapters (Houlsby et al., 2019). We followed the same zero-shot setup as Alabi et al. (2022), however, we make use of hau and swa as source languages since they cover all the news topics used by all languages. The setup is as follows: (1) We train language adapters using monolingual news corpora of our focus languages. We perform language adaptation on the news corpus to match the domain of our dataset, similar to (Alabi et al., 2022). (2) We train a task adapter on the source language labelled data using source language adapter. (3) We substitute the source language adapter with the target language to run prediction on the target language test set, while retaining the task adapter. PET/iPET (Schick and Schütze, 2021a,b), also known as (Iterative) Pattern Exploiting Training is a semi-supervised approach that makes use of few labelled examples and a prompt/pattern to a LM for few-shot learning. It involves three steps. (1) designing of a prompt/pattern and a verbalizer (that maps each label to a word from LM vocabulary). (2) train an LM on each pattern based on few labelled examples (3) distill the knowledge of the LM on unlabelled data. Therefore, PET leverages unlabelled examples to improve few-shot learning. iPET on the other hand, repeats step 2 and 3 iteratively. We make use of the same set of patterns used for AGNEWS English dataset (Zhang et al., 2015) provided by the PET/iPET authors. The patterns are (1) P1 (x) = ____ : a, b (2) P2 (x) = a(____)b (3) P3 (x) = ____ − ab (4) P4 (x) = ab(____) (5) P5 (x) = ____N ews : ab (6) P6 x) = [Category : ____]ab, where a is the news headline and b is the news text. In evaluation, we take average over all patterns. SetFit (Tunstall et al., 2022b) is a few-shot learning framework based on sentence transformer models (Reimers and Gurevych, 2019) like LaBSE following two steps. Step 1 fine-tunes the sentence transformer model using a few labelled examples with contrastive learning—where positive examples, are K-examples from a class c, and negative examples pairs are labelled examples with random labels from other classes. Contrastive learning approach enlarges the size of training data in few-shot scenarios. In Step 2, the fine-tuned sentence transformer model is used to extract rich sentence representation for each labelled example, followed by logistic regression for classification. The advantage of this approach is that it is faster and requires no prompt unlike PET. We use this in both zero- and few-shot setting. For the zero-shot setting, SetFit creates dummy example N -times (we set N = 8, similar to the SetFit paper) like “this sentence is {}” where {} can be any news topic like “sports”. Co:here multilingual sentence transformer co:here introduced a multilingual embedding model multilingual-22-12 5 , which supports over a hundred languages, including most of the languages included in . This is only for the few-shot setting. OpenAI ChatGPT API6 is an LLM trained on a large chunk of texts to predict the next word like GPT-3 (Brown et al., 2020), followed by a set of instructions in a prompt based on human feedback. It leverages Reinforcement Learning from Human Feedback (RLHF), similar to InstructGPT (Ouyang et al., 2022) to make the LLM to interact in a conversational way. We prompt the OpenAI API based on GPT-3.5 Turbo and GPT-4 to categorize articles into news topics. For the prompting, we make use of a simple template from Sanh et al. (2022): ’Is this a piece of news regarding {{“business, entertainment, health, politics, religion, sports or technology”}}? {{INPUT}}’. We make use of the first 100 tokens of headline+text as {{INPUT}}. The completion of the LLM can be a single word, a sentence, or multiple sentences. We check if a descriptive word relating to any of the news topics has been predicted. For example, “economy”, “economic”, “finance” is mapped to “business” news. We provide more details on the ChatGPT evaluation in Appendix C. For all few-shot settings, we tried K samples/shots per class where K = 5, 10, 20, 50. We make use of LaBSE as the sentence transformer for SetFit, and AfroXLMR-large as the LM for PET. 6.2 Results 6.2.1 Zero-shot evaluation GPT-3.5-Turbo performs poorly on non-Latin scripts Table 4 shows the result of zero-shot evaluation using F INE - TUNE, MAD-X, PET, S ET F IT and GPT-3.5-T URBO (March 2023 version). Our result shows that cross-lingual zero-shot transfer from a source language with same domain and task (i.e F INE - TUNE & MAD-X), gives superior result (+11 F1) than PET, SetFit, and GPT-3.5-T URBO. GPT-3.5-T URBO gave better results with over +9.0 F1 point better than S ET F IT and PET showing that capabilities of instruction-tuned LLMs over smaller LMs. However, the results of C HAT GPT were poor (< 42.0) for non-Latin based languages like Amharic and Tigrinya which makes use of the Ge’ez script. The languages that make use of 5 https://docs.cohere.ai/docs/ text-classification-with-classify 6 https://openai.com/blog/chatgpt Latin script have over 59.0%. Surprisingly, some results of GPT-3.5-T URBO are comparable to the F INE - TUNE approach for some languages (English, Luganda, Oromo, Naija, Somali, isiXhosa, and Yorùbá), without leveraging any additional technique apart from prompting the LLM. GPT-3.5-Turbo evaluation improves with newer versions We repeated GPT-3.5-T URBO evaluation using a newer version (May 23, 2023 version), our results suggest a significant improvement of the result for 14 (out of 16) languages in our evaluation. This implies that the newer version of the model seems to be better than older versions for the news topic classification task. GPT-4 overcomes the limited non-Latin capabilities of GPT-3.5-Turbo We also evaluated on GPT-4 on the 16 languages in zero-shot setting. Our results shows a significant improvement in performance over GPT-3.5-T URBO by over +9 points. Surprinsingly, GPT-4 was able to overcome the limitation of GPT-3.5-T URBO for languages with non-Latin script (i.e Amharic and Tigrinya) with impressive performance, matching the performance of cross-lingual transfer experiment from a related African language (i.e. F INE TUNE hau/swa→ xx and MAX-X hau→ xx). The large performance gap between GPT-3.5Turbo and GPT-4 may be due to either the former being a distilled version of a more powerful model created to reduce inference cost, which also significantly affected its performance on non-Latin scripts.7 8 Alternatively, GPT-4 may just be a bigger and better model with more multilingual and non-Latin capabilities. Leveraging labelled data from other languages is more effective In general, it may be advantageous to consider leveraging knowledge from other languages with available training data when no labelled data is available for the target language. Also, we observe that Swahili (swa) achieves better result as a source language than Hausa (hau) especially when transferring to fra (+13.8), lug (+9.0), and eng (+3.6). The reason for the impressive performance from Swahili to Luganda might be due to both languages belonging to the same Greater Lake Bantu language sub-group, but it is 7 https://arstechnica.com/informationtechnology/2023/07/is-chatgpt-getting-worse-over-timestudy-claims-yes-but-others-arent-sure/ 8 https://platform.openai.com/docs/models/gpt-3-5 Model fra hau ibo lin lug orm pcm run sna som swa tir xho yor AVG Fine-tune (AfroXLMR-large) 5-shots 68.4 55.1 58.0 10-shots 75.5 75.2 65.9 20-shots 88.5 85.6 78.3 50-shots 91.4 87.5 86.9 amh eng 35.8 64.6 85.2 88.8 71.3 86.1 90.4 87.3 52.7 72.6 80.8 91.0 29.2 31.3 48.4 75.2 39.2 56.8 41.1 71.3 92.5 95.8 97.4 96.4 71.2 87.3 90.0 89.8 70.2 80.8 92.3 95.5 18.1 38.9 63.6 85.3 42.5 73.8 82.9 86.6 30.2 36.3 67.3 86.2 46.5 61.7 83.1 94.1 62.7 69.4 84.3 90.2 52.7 67.0 78.7 87.7 Fine-tune (LaBSE) 5-shots 71.6 10-shots 79.0 20-shots 90.3 50-shots 89.6 67.4 77.1 84.7 86.3 61.3 76.8 83.1 85.6 60.7 79.7 85.1 87.1 63.6 77.1 82.0 86.4 65.9 70.2 82.2 88.4 59.5 68.3 70.4 80.6 43.3 58.5 72.3 77.8 86.5 94.5 95.5 96.7 65.6 81.9 86.0 87.9 83.1 84.8 90.6 93.0 25.4 44.8 66.6 80.1 49.1 77.2 84.3 85.3 36.1 51.8 69.0 79.6 46.0 69.9 80.5 87.4 71.2 79.8 86.0 88.6 59.7 73.2 81.8 86.3 PET 5-shots 10-shots 20-shots 50-shots 89.9 91.1 92.7 92.9 80.8 81.7 86.4 89.2 72.3 83.3 82.8 89.1 82.6 86.6 89.1 90.9 85.0 86.1 88.6 90.6 82.9 87.6 89.2 89.6 79.0 84.0 83.8 86.7 89.2 91.8 94.9 96.0 94.5 96.6 96.7 97.2 87.7 90.8 88.7 90.9 88.9 91.4 93.3 94.8 69.5 74.9 81.6 84.2 79.6 81.1 83.5 84.2 59.7 69.2 72.4 76.4 84.3 88.9 91.5 93.5 84.0 90.5 91.0 92.4 81.9 86.0 87.9 89.9 SetFit 5-shots 10-shots 20-shots 50-shots 68.3 84.8 87.9 88.6 69.6 82.0 78.5 76.6 64.3 80.5 83.9 83.8 76.0 79.4 83.3 83.0 78.9 71.4 81.8 77.3 48.3 77.8 86.6 81.9 28.9 49.5 71.7 60.8 38.8 57.3 61.0 63.6 91.2 92.8 97.4 93.6 74.8 83.8 87.0 85.6 85.8 89.2 83.2 90.6 68.9 65.1 69.4 67.9 76.8 81.2 79.2 76.5 73.1 64.9 64.9 69.8 84.0 83.6 78.4 83.8 60.2 76.5 85.0 86.0 68.0 76.2 80.0 79.3 Cohere sentence embedding API 5-shots 66.0 65.9 60.2 10-shots 80.1 72.5 71.4 20-shots 87.6 78.0 78.4 50-shots 90.2 80.9 83.2 74.2 80.4 82.9 85.6 72.0 75.7 77.7 81.9 69.8 78.4 86.9 87.7 50.2 65.5 70.2 78.0 50.0 57.2 63.9 70.6 74.0 84.9 88.7 94.9 61.2 78.2 82.7 84.1 78.1 85.0 86.6 90.5 52.8 60.4 65.3 68.9 67.7 73.8 79.0 77.6 60.1 59.8 64.8 72.8 68.3 83.2 88.2 90.4 71.9 80.1 83.9 88.4 65.2 74.2 79.1 82.9 Table 5: Few-shot learning on . We compare several few-shot learning approaches: PET, SetFit and Cohere Embedding API. unclear why Hausa gave worse results than Swahili when adapting to English or French. However, with few examples, PET and SetFit methods are powerful without leveraging training data and models from other languages. 6.2.2 Few-shot evaluation Table 5 shows the result of the few-shot learning approaches. With only 5-shots, we find all the fewshot approaches to be better than the usual FINE TUNE baselines for most languages. However, as the number of shots increases, they have comparable results with S ET F IT and CO : HERE API especially for K = 20, 50 shots. However, we found that PET achieved very impressive results with 5shots (81.9 on average), matching the performance of S ET F IT/CO : HERE API with 50-shots. The results are even better with more shots i.e (k = 10, 86.0 F1), (k = 20, 87.9 F1), and (k = 50, 89.9 F1). Surprisingly, with 50-shots, PET gave competitive result to the full-supervised setting (i.e. fine-tuning all TRAIN data) that achieved (92.6 F1) (see Table 3). It’s important to note that PET make use of additional unlabelled data while SetFit and Cohere API do not. In general, our result highlight the importance of getting few labelled examples for a new language we are adapting to, even if it is as little as 10 examples per class—which is typically not time-consuming to annotate (Lauscher et al., 2020; Hedderich et al., 2020). 7 Conclusion In this paper, we created the largest news topic classification dataset for 16 typologically diverse languages spoken in Africa. We provide an extensive evaluation using both full-supervised and few-shot learning settings. Furthermore, we study different techniques of adapting prompt-based tuning and non-prompt methods of LMs to African languages. Our experimental results shows that prompting LLMs like ChatGPT perform poorly on the simple task of text classification for several under-resourced African languages especially for non-Latin based scripts. Furthermore, we showed the potential of prompt-based few-shot learning approaches like PET (based on smaller LMs) for African languages. Our work shows that existing supervised approaches work well for all African languages and that language models with only a few supervised samples can reach competitive performance, both findings which demonstrate the applicability of existing NLP techniques for African languages. In the future, we plan to extend this dataset to more African languages, include the evaluation of other multilingual LLMs like BLOOM, mT0 (Muennighoff et al., 2022) and XGLM (Lin et al., 2022), and extend analysis to other text classification tasks like sentiment classification (Shode et al., 2022, 2023; Muhammad et al., 2023). 8 Limitations One major limitation of our work is that we did not evaluate extensively the performance of ChatGPT LLM on several African languages and tasks such as question answering, and text generation tasks. Our evaluation is only limited to text classification and may not generalize to many tasks. However, we feel that if it perform poorly on text classification, the result may even be worse on more difficult NLP tasks. Also, there is a challenge that our result may not be fully reproducible since we use the ChatGPT API where the underlining LLM are often updated or improved with time. It might be that the support for non-Latin based script may improve significantly in few months. This limitation also applied to the co:here embedding API. 9 Ethics Statement Our work aims to provide benchmark dataset for African languages, we do not see any potential harms when using our news topic classification datasets and models to train ML models, the annotated dataset is based on the news domain, and the articles are publicly available, and we believe the dataset and news topic annotation is unlikely to cause unintended harm. Also, we do not see any privacy risks in using our dataset and models because it is based on news domain. Acknowledgments We would like to thank Yuxiang Wu for the suggestions on the few-shot experiments. We are grateful for the feedback from the anonymous reviewers of AfricaNLP and IJCNLP-AACL that helped improved this draft. David Adelani acknowledges the support of DeepMind Academic Fellowship programme. This work was supported in part by Oracle Cloud credits and related resources provided by Oracle. Finally, we are grateful to OpenAI for providing API credits through their Researcher Access API programme to Masakhane for the evaluation of GPT-3.5 and GPT-4 large language models. References David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Emezue, Colin Leong, Michael Beukman, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Ouoba Kabore, Godson Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, Victoire Memdjokam Koagne, Edwin Munkoh-Buabeng, Valencia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, and Sam Manthalu. 2022a. A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics. David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Rabiu Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dibora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, and Salomey Osei. 2021. MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics, 9:1116–1131. David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba O. Alabi, Shamsuddeen H. Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire M. Koagne, Allahsera Auguste Tapo, Tebogo Macucwa, Vukosi Marivate, Elvis Mboning, Tajuddeen Gwadabe, Tosin Adewumi, Orevaoghene Ahia, Joyce NakatumbaNabende, Neo L. Mokono, Ignatius Ezeani, Chiamaka Chukwuneke, Mofetoluwa Adeyemi, Gilles Q. Hacheme, Idris Abdulmumin, Odunayo Ogundepo, Oreen Yousuf, Tatiana Moteu Ngoli, and Dietrich Klakow. 2022b. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. 2022. Adapting pretrained language models to African languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. Rahul Aralikatte, Ziling Cheng, Sumanth Doddapaneni, and Jackie Chi Kit Cheung. 2023. Vārta: A largescale headline-generation dataset for indic languages. ArXiv, abs/2305.05858. Israel Abebe Azime and Nebil Mohammed. 2021. An amharic news text classification dataset. CoRR, abs/2103.05639. David Blei, Andrew Ng, and Michael Jordan. 2001. Latent dirichlet allocation. In Advances in Neural Information Processing Systems, volume 14. MIT Press. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. 2021. Rethinking embedding coupling in pre-trained language models. In International Conference on Learning Representations. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre- training text encoders as discriminators rather than generators. In ICLR. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440– 8451, Online. Association for Computational Linguistics. Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2020. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8:439–453. Bonaventure FP Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, and Chris Chinenye Emezue. 2022. Afrolm: A self-active learning-based multilingual pretrained language model for 23 african languages. arXiv preprint arXiv:2211.03263. David M. Eberhard, Gary F. Simons, and Charles D. Fennig. 2021. Ethnologue: Languages of the world. twenty-third edition. Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics. Awet Fesseha, Shengwu Xiong, Eshete Derb Emiru, Moussa Diallo, and Abdelghani Dahou. 2021. Text classification based on convolutional neural networks and word embedding for low-resource languages: Tigrinya. Information, 12(2). J.L. Fleiss et al. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382. Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics. Derek Greene and Pádraig Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 377–384, New York, NY, USA. Association for Computing Machinery. Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021a. Debertav3: Improving deberta using electra-style pretraining with gradient-disentangled embedding sharing. ArXiv, abs/2111.09543. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021b. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations. Michael A. Hedderich, David Adelani, Dawei Zhu, Jesujoba Alabi, Udia Markus, and Dietrich Klakow. 2020. Transfer learning and distant supervision for multilingual transformer models: A study on African languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2580–2591, Online. Association for Computational Linguistics. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain. Association for Computational Linguistics. Odunayo Jude Ogundepo, Akintunde Oladipo, Mofetoluwa Adeyemi, Kelechi Ogueji, and Jimmy Lin. 2022. AfriTeVA: Extending ?small data? pretraining approaches to sequence-to-sequence models. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 126–135, Hybrid. Association for Computational Linguistics. Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. Ai4bharatindicnlp corpus: Monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:2005.00085. Anne Lauscher, Vinit Ravishankar, Ivan Vulić, and Goran Glavaš. 2020. From zero to hero: On the limitations of zero-shot language transfer with multilingual Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483–4499, Online. Association for Computational Linguistics. Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. 2022. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9). Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2022. Crosslingual generalization through multitask finetuning. Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa’id Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Felermino Dário Mário António Ali, Davis Davis, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Baye Messelle, Hailu Beshada Balcha, Sisay Adugna Chala, Hagos Tesfahun Gebremichael, Bernard Opoku, and Steven Arthur. 2023. Afrisenti: A twitter sentiment analysis benchmark for african languages. Rubungo Andre Niyongabo, Qu Hong, Julia Kreutzer, and Li Huang. 2020. KINNEWS and KIRNEWS: Benchmarking cross-lingual text classification for Kinyarwanda and Kirundi. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5507–5521, Barcelona, Spain (Online). International Committee on Computational Linguistics. NLLB-Team, Marta Ruiz Costa-jussà, James Cross, Onur cCelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Alison Youngblood, Bapi Akula, Loïc Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon L. Spruit, C. Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzm’an, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No language left behind: Scaling human-centered machine translation. ArXiv, abs/2207.04672. Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021. Small data? no problem! exploring the viability of pretrained multilingual language models for lowresourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic. Association for Computational Linguistics. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. 2022. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computational Linguistics. Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2021. UNKs everywhere: Adapting multilingual language models to new scripts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10186–10203, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67. M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. 2022. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations. Teven Le Scao, Angela Fan, Christopher Akiki, Elizabeth-Jane Pavlick, Suzana Ili’c, Daniel Hesslow, Roman Castagn’e, Alexandra Sasha Luccioni, Franccois Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Rose Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurenccon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa Etxabe, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris C. Emezue, Christopher Klamm, Colin Leong, Daniel Alexander van Strien, David Ifeoluwa Adelani, Dragomir R. Radev, Eduardo G. Ponferrada, Efrat Levkovizh, ..., Younes Belkada, and Thomas Wolf. 2022. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100. Timo Schick and Hinrich Schütze. 2021a. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online. Association for Computational Linguistics. Timo Schick and Hinrich Schütze. 2021b. It’s not just size that matters: Small language models are also fewshot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, Online. Association for Computational Linguistics. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online. Association for Computational Linguistics. Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, Iyanuoluwa Shode, David Ifeoluwa Adelani, and Anna Feldman. 2022. yosm: A new yoruba sentiment corpus for movie reviews. Iyanuoluwa Shode, David Ifeoluwa Adelani, JIng Peng, and Anna Feldman. 2023. NollySenti: Leveraging transfer learning and machine translation for Nigerian movie sentiment classification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 986–998, Toronto, Canada. Association for Computational Linguistics. Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune bert for text classification? In China National Conference on Chinese Computational Linguistics. Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, and Oren Pereg. 2022a. Efficient few-shot learning without prompts. ArXiv, abs/2209.11055. Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, and Oren Pereg. 2022b. Efficient few-shot learning without prompts. Zhen Wang, Xu Shan, Xiangxie Zhang, and Jie Yang. 2022. N24News: A new dataset for multimodal news classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6768–6775, Marseille, France. European Language Resources Association. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Neural Information Processing Systems. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc. A Annotation Tool Figure 2 provides an example of the interface of our in-house annotation tool. B Comparing different article content types Table 7 provides the comparison between using only news headline and headline+text for training. We find significantly improvement on average when we make use of headline+text for training across all models and languages especially for classical ML methods (MLP, NaiveBayes, and XGBoost). C ChatGPT Evaluation We prompted ChatGPT for news topic classification using the following template: ’Is this a piece of news regarding {{“business, entertainment, health, politics, religion, sports or technology”}}? {{INPUT}}’. The completion may take different forms e.g. a single word, sentence or multiple sentences. Examples of such predictions are: 1. sports 2. This is a piece of news regarding sports. 3. This is a piece of sports news regarding the CHAN 2021 football tournament in Cameroon. It reports that the Mali national football team has advanced to the semi-finals after defeating the Congo national team in a match that ended in a penalty shootout. 4. This is a piece of news regarding sports. It talks about the recent match between Tunisia and Angola in the African Cup of Nations. Both teams scored a goal, and the article mentions some of the details of the game, such as the penalty and missed chances. 5. I’m sorry, but I’m having trouble understanding this piece of news as it appears to be in a language I don’t recognize. Can you please provide me with news in English so I can assist you better? To extract the right category, we make use of a simple verbalizer that maps the news topic to several indicative words (capitalization ignored) for the category like: (a) ’business’: {’business’, ’finance’, ’economy’. ’economics’ } Figure 2: Interface of our in-house Annotation tool. Annotators can correct the pre-defined category assigned and also edit their annotation (b) ’entertainment’: {’entertainment’ , ’music’ } (c) ’health’: {’health’ } (d) ’politics’: {’politics’, ’political’ } (e) ’religion’: {’religion’ } (f) ’sports’: {’sports’, ’sport’ } (g) ’technology’: {’technology’ } When the right category is not obvious, like (5 : “I’m sorry, but I’m having trouble understanding this piece of news as it appears to be in a language I don’t recognize. ”), we choose a random category before computing F1-score. LLM XLM-R-base/large AfriBERTa-large mDeBERTa RemBERT AfriTeVa-base AfroXLMR-base/large AfriMT5-base FlanT5-base LLM size # Lang. # African Lang. 270M/550M 126M 276M 575M 229M 270M/550M 580M 580M 100 11 110 110 11 20 20 60 8 11 8 12 11 17 17 5 Focus languages covered amh, eng, fra, hau, orm, som, swa, xho amh, hau, ibo, orm, pcm, run, swa, tir, yor amh, eng, fra, hau, orm, swa, xho amh, eng, fra, hau, ibo, sna, swa, xho, yor amh, run, hau, ibo, orm, pcm, swa, tir, yor amh, eng, fra, hau, ibo, orm, pcm, run, sna, swa, xho, yor amh, eng, fra, hau, ibo, orm, pcm, run, sna, swa, xho, yor eng, fra, ibo, swa, yor Table 6: Languages covered by different multilingual Models and their sizes Model size amh eng fra hau ibo lin lug orm pcm run sna som swa tir xho yor AVG Headline MLP NaiveBayes XGBoost AfroXLMR-base AfroXLMR-large <20K <20K <20K 270M 550M 86.7 88.8 83.6 91.8 93.0 72.6 71.6 71.3 87.0 89.3 69.8 70.0 67.8 92.0 91.8 80.4 76.6 77.4 89.2 91.0 77.8 75.8 71.3 87.8 90.7 79.4 74.0 76.7 89.0 91.4 74.6 74.6 68.7 87.4 87.7 81.9 74.2 77.7 87.4 90.9 87.5 82.6 80.8 97.4 98.2 73.8 64.3 71.3 87.8 89.3 84.9 79.5 84.6 94.5 95.9 71.4 61.7 63.4 85.9 87.1 69.3 60.6 66.4 85.0 86.6 80.7 66.0 62.1 85.7 88.5 79.1 72.5 69.4 93.5 96.2 83.0 81.4 77.5 88.6 90.3 78.3 73.4 73.1 89.4 91.1 Headline+Text MLP NaiveBayes XGBoost AfroXLMR-base AfroXLMR-large <20K <20K <20K 270M 550M 92.0 91.8 90.1 94.2 94.4 88.2 83.7 86.0 92.2 93.1 84.6 84.3 81.2 92.5 91.1 86.7 85.3 84.7 91.0 92.2 80.1 79.8 78.6 90.7 93.4 84.3 82.8 74.8 93.0 93.7 82.2 84.0 83.8 89.4 89.9 86.7 85.6 83.2 92.1 92.1 93.5 92.8 93.3 98.2 98.8 85.9 79.9 79.2 91.4 92.7 92.6 91.5 94.3 95.4 95.4 71.1 74.8 68.5 85.2 86.9 77.9 76.6 74.9 88.2 87.7 81.9 71.4 75.2 86.5 89.5 94.5 91.0 91.1 94.7 97.3 89.3 84.0 85.2 93.0 94.0 85.7 83.7 82.8 91.7 92.6 Table 7: Baseline results on . We compare different article content types (i.e headline and headline+text) used to train news topic classification models. Average is over 5 runs. Evaluation is based on weighted F1-score.