Abstract
Characterisation of SARS-CoV-2 genetic diversity through space and time can reveal trends in virus importation and domestic circulation, and permit the exploration of questions regarding the early transmission dynamics. Here we present a detailed description of SARS-CoV-2 genomic epidemiology in Ecuador, one of the hardest hit countries during the early stages of the COVID-19 pandemic. We generate and analyse 160 whole genome sequences sampled from all provinces of Ecuador in 2020. Molecular clock and phylgeographic analysis of these sequences in the context of global SARS-CoV-2 diversity enable us to identify and characterise individual transmission lineages within Ecuador, explore their spatiotemporal distributions, and consider their introduction and domestic circulation. Our results reveal a pattern of multiple international importations across the country, with apparent differences between key provinces. Transmission lineages were mostly introduced before the implementation of non-pharmaceutical interventions (NPIs), with differential degrees of persistence and national dissemination.
Introduction
The rapid generation of substantial numbers of virus genomic sequences during the COVID-19 pandemic is without precedent. During 2020, laboratories and institutes around the world produced and shared over 300000 whole SARS-CoV-2 genome sequences in the GISAID repository (1), providing an unparalleled data set that permits detailed analyses of virus transmission and dissemination. These achievements have provided insights into the sources of SARS-CoV-2 importation and early transmission dynamics in individual countries and geographical regions (2–5), and have enabled the exploration of viral transmission history at a global scale (6). Phylogenetic methods, including molecular clock models and phylogeographic and phylodynamic methods, are now used routinely to analyse such genomic data from emerging outbreaks (7). The resolution level of evolutionary and transmission history obtained using these methods is contingent on the virus’ evolutionary rate and the depth and representativeness of sampling of cases across space and time (8). While heterogeneous sampling and sequencing among countries can bias and affect the output of some phylogeographic methods (9, 10), general trends in the transmission of viral lineages can still be inferred from smaller samples of genomic sequences from individual locations.
The utility of pathogen genomic surveillance during outbreaks has developed during various past emerging epidemics (11–13) and has gained further momentum during the current global health crisis. Information about epidemiological trends can be effectively complemented with genomic analyses in order to understand case-specific (14) and general transmission patterns (15). This framework can be extended to account for other factors that affect the spread of pathogens, ranging from human mobility on a global scale (16) to particular social networks (17). The analysis of local-and national-scale data sets during the current pandemic has provided insights into the processes affecting the introduction and circulation of the virus into new locations (18) and provided genomic context for other data sources (19–21). Indeed, the integration and analysis of multiple data sources about an emerging epidemic has the potential to compensate for surveillance blind spots and better understand poorly sampled outbreaks (22).
The COVID-19 epidemic in Ecuador was marked by a dramatic and widely publicised early phase (23) with an estimated basic reproductive number (R0) of 3.54 (24). Ecuador is a small middle-income South American country with the seventh largest population in the continent; half of the country’s population lives in Guayas province (host of the country’s most populated city, Guayaquil) and Pichincha province (host of the country’s second most populated city, the capital Quito; Fig 1A). The first case was reported in the country on 27th February 2020 (a patient who returned from abroad through Guayaquil with date of symptom onset of February 15) and was followed by the declaration of a National Health Emergency on 11th March 2020. Public health interventions were implemented shortly thereafter: mass gatherings were restricted on March 13, and a partial lockdown that included the closure of international borders was implemented on March 17. Finally, a full lockdown that included a curfew and the limitation of domestic mobility in private and public vehicles came into effect on March 25 (24). The country’s port city of Guayaquil was the first epicentre of the epidemic, facing a severe increase in the numbers of cases between late February and early April. The province of Guayas reached its highest effective reproductive number Rt (defined as the average number of secondary cases caused by a primary cases at a point in time t; 25) on March 14 (Rt estimates vary between 3.96 and 4.91) with 1462 cases reported that day (26) and reported a cumulative incidence of 146.94 cases per 100000 people by April 18 (24). The actual number of cases are likely to have been much higher when evaluated through the lens of excess mortality data (as obtained from the National Institute of Statistics and Census; Instituto Nacional de Estadística y Censos, INEC), and would explain why local diagnostic and healthcare services became rapidly overwhelmed (23, 24). After the peak and decline of the epidemic’s first wave in Ecuador, restrictions were maintained during April and May and progressively relaxed over the following months, as the epicentre of Ecuador’s epidemic moved to the capital city of Quito, located in Pichincha province (which on 23rd July 2020 overtook Guayaquil as the city with the greatest number of COVID-19 confirmed cases). The last restrictions were finally lifted on September 13, although use of personal protective equipment and social distancing guidelines remained in place.
To date, the source and diversity of circulating transmission lineages in Ecuador and their reach across the country remain unexplored. International importations are expected to have played an important role in seeding transmission chains in Ecuador, as observed in other countries (15). We undertake phylogenetic analyses of 160 SARS-CoV-2 whole genome sequences sampled from Ecuadorian cases and place them within the context of global viral genetic diversity in order to characterise the genomic epidemiology of SARS-CoV-2 in the country. We identify introduction events and transmission lineages within Ecuador and investigate their spatio-temporal distribution, and we hypothesise about the role of domestic seeding on viral transmission dynamics within Ecuador.
Methods
Genomic sequencing of SARS-CoV-2 samples from Ecuador
Clinical samples were collected from patients with a laboratory confirmed SARS-CoV-2 infection in Ecuador during 2020 (Fig 1A-C). Samples collected by the Microbiology Institute at Universidad San Francisco de Quito (IM-USFQ) were obtained from third-level hospitals (i.e. specialised tertiary referral hospitals) across all 24 provinces in the country without unified selection criteria (27). Samples collected by the Omics Sciences Laboratory at Universidad de Especialidades Espiritu Santo (UEES) were obtained from samples collected from the laboratory’s diagnostic service and selected at random for sequencing. Samples collected by the National Institute of Investigations in Public Health (Instituto Nacional de Investigación en Salud Pública, INSPI) were obtained from the national epidemiological SARS-CoV-2 surveillance system. Samples collected by the Biomedical Research Unit at Zurita & Zurita Laboratories (ZZL) were obtained from community patients from Quito who presented clinical signs of reinfection. This complete Ecuadorian sample set was collected between March 9 and December 9 2020, with limited representation during the early months of the epidemic when compared to excess mortality data (Fig 1C).
From these samples we generated 160 complete SARS-CoV-2 genomic sequences using different methodologies. IM-USFQ generated 108 whole genome sequences using Oxford Nanopore MinION sequencing and the ARTIC Network primer scheme approach as previously described (27). UEES generated 33 whole genome sequences through Illumina sequencing on a MiniSeq platform (Illumina, San Diego, CA). INSPI generated 15 sequences either in collaboration with Charité - Berlin University of Medicine (through Illumina sequencing) or on site at the Centre for Multidisciplinary Research of the Direction of Research, Development and Innovation (through Oxford Nanopore MinION sequencing as described in 28). Zurita & Zurita Laboratories generated 4 sequences through Illumina sequencing on a MiSeq platform (Illumina, San Diego, CA). Details of sample collection, sequencing and genome assembly are summarised in Table S1. Sample collection dates and the province-level geographical location of residence of the patient were included as metadata for all sequences in the country.
To determine the viral genetic diversity circulating in the country during the sampling period, all sequences from Ecuador were phylogenetically assigned under the global Pango lineage system using the Pangolin v2.2.2 tool (https://virological.org/t/pangolin-web-application-release/482).
Global SARS-CoV-2 data sets
The Ecuadorian virus sequences were analysed in the context of global SARS-CoV-2 genomic diversity by including all high-quality SARS-CoV-2 genome sequences and their accompanying metadata available in GISAID (1) on January 1 2021 (sequences were retained if they were > 29000 nucleotides long and < 5% of the sequence was missing). Sequences without a complete sample collection date or not attributed to human hosts were excluded, yielding a total of 218771 sequences from samples collected from December 1, 2019 up until December 10, 2020.
The large number of SARS-CoV-2 genomes generated during 2020 makes full-scale phylogenomic analyses computationally prohibitive. We therefore subsampled sequences from the abovementioned full data set (i.e. all GISAID sequences included in our analyses, excluding the complete set of Ecuadorian sequences) using two approaches. First, we randomly sampled one sequence per country per day from the full data set over the complete sampling period, to create a “systematically-subsampled data set” (comprised of 8606 sequences). In parallel, we arbitrarily generated three “randomly-subsampled data sets” consisting of 8606 randomly chosen sequences from the full data set, to match the size of the systematically-subsampled data set. These randomly-subsampled data sets were used to evaluate the performance of the background SARS-CoV-2 sequences as the genomic context for the identification of transmission lineages within Ecuador (see the Phylogenetic identification of transmission lineages section below). Finally, we added the sequences from Ecuador to each data set, resulting in a total of 8766 sequences per data set.
Each data set was aligned to the Wuhan-Hu-1 (GenBank accession: MN908947.3) reference genome sequence (29) using Minimap 2.17 (30) to generate multiple sequence alignments. Sites containing >90% gaps relative to the sequences in their respective alignment were masked, whilst the untranscribed terminal regions (UTRs) were trimmed. After masking and trimming, the resulting alignments had a final length of 29,409 nucleotides, with the shortest partial genome sequences being cut down to 28,955 nucleotides long.
Phylogenetic identification of transmission lineages
We followed a similar rationale and methodology to that described in du Plessis et al (15) to identify local transmission lineages. Phylogenetically linked sequences were inferred to have descended from a common ancestor if they were associated with a single inferred introduction event into Ecuador from an international location (5, 15). Ecuadorian transmission lineages therefore correspond to lineages of sequences sampled within the country that descend from a node inferred to have also occurred in Ecuador, which must in turn have descended from outside of the country. Given the unstructured sampling of the Ecuadorian sequences, some transmission lineages will likely correspond to epidemiologically linked cases (i.e. targeted investigation of epidemiological clusters); these have been identified as such in the text whenever the information was available.
Maximum likelihood (ML) phylogenetic trees were estimated from the systematically-subsampled data set and the randomly-subsampled data sets using IQtree 2.1.1 (31) under a GTR+Γ substitution model. Node support was estimated through an SH-like approximate Likelihood Ratio Test (aLRT) (32). The tree for the systematically-subsampled data set was re-rooted by heuristically searching for the root placement that minimises the mean squared residual of a regression of sequence sample date against root-to-tip genetic distance, calculated using TempEst v1.5.3 (33), to maximise the temporal signal of the data set. The same regression was used to assess the clock-like behaviour of the data set.
Subsequent analyses in our pipeline require an evolutionary rate estimation. We performed an exploratory analysis on a random selection of 866 genomes from the systematically-subsampled data set (~10% of the sequences, ensuring that representatives of the earliest and latest collection dates were included) in order to estimate the evolutionary rate of the data set over the sampling period. We used BEAST v.1.10.4 (34) to obtain a clock rate estimate using the HKY substitution model and a strict molecular clock with a continuous-time Markov chain (CTMC) prior (35). We employed a Skygrid coalescent tree prior (36) that accounts for the 50 epidemiological weeks over which the genomes were sampled, plus a cut-off period that precedes the earliest collected SARS-CoV-2 sequences. Independent Markov Chain Monte Carlo (MCMC) chains were run for 40 million steps and subsequently combined after discarding the initial 10% of each run as a burn-in. Parameter convergence was assessed using the effective sample size (ESS) estimates of the combined chains using Tracer v1.7.1 (37).
The systematically subsampled data set was analysed with BEAST v1.10.5 (https://github.com/beast-dev/beast-mcmc/releases/tag/v1.10.5pre_thorney) using a newly implemented method that estimates the tree likelihood at each point of the MCMC chain. This approach takes a data tree (the aforementioned re-rooted ML tree) instead of an alignment and significantly reduces analysis time by using a simple model to rescale the ML phylogeny into a time-calibrated tree (see 38). Under this approach, the likelihood of each branch length is defined as a function of a Poisson distribution with a mean directly proportional to the clock rate (38, 39); we therefore used a rate of 6.28×10−4 substitutions/site/year, based on the median clock rate estimate obtained from our exploratory analysis. We also defined a coalescent Skygrid prior, similar to the one described for the exploratory analysis. A version of the data tree re-scaled to time by BEAST was used as a starting tree, and independent MCMC chains were run for 100 million steps and combined after discarding 10% of each run as burn-in.
To identify nodes associated with transmission lineages in Ecuador, we used a discrete phylogeographic model consisting of a two-state discrete trait analysis (DTA) (40) implemented in BEAST v1.10.4 (34). Tips were assigned to one of two possible states (Ecuador vs non-Ecuador) and reconstruction of ancestral node states was undertaken using an asymmetric substitution model (40). The estimation of the source location for all nodes was performed on an empirical distribution of 500 time-calibrated trees obtained in the previous step. The expected number of DTA transitions between international locations and Ecuador were estimated using a robust counting approach (41). Two independent MCMC chains of 5 million steps each were combined for this analysis, after discarding the first 500000 steps of each run as burn-in. A Maximum Clade Credibility (MCC) tree was generated from the DTA posterior tree distribution by sampling 1000 trees from the combined MCMC runs in TreeAnnotator v.1.10. Each internal node was assigned a posterior probability for its inferred location, and these were used to evaluate uncertainty regarding the assignment of potential transmission lineages in Ecuador.
Transmission lineages and transmission lineage groups
All phylogenetic clusters of sequences from Ecuador were inspected visually on the MCC tree to assign individual transmission lineages. The nomenclature of these country-specific transmission lineages followed a one-letter code in alphabetical order, defined by the earliest sample collection date in each transmission lineage. General features of each transmission lineage were summarised, such as the earliest and latest sample in each lineage, the number of provinces in which the transmission lineage had been identified and the number of sequences belonging to said lineage (used as a proxy of transmission lineage size). The consistency with which sequences were grouped into these transmission lineages was evaluated by visually inspecting the ML trees for the randomly-subsampled data sets and comparing the clusters of Ecuadorian sequences to those from the DTA analysis.
In two instances, transmission lineages within Ecuador clustered together in the DTA analysis into monophyletic groups that also included some international sequences (see Results). This topology includes the possibility that these groups of transmission lineages in fact correspond to single introduction events, misidentified by our analyses as multiple introductions. This might indeed be the case given the variation in intensity of SARS-CoV-2 sampling across countries (6, 9), including Ecuador, and the limited genetic divergence observed in SARS-CoV-2 over the time span being analysed (42). We therefore identified these pairs of transmission lineages under a single letter (highlighted with an asterisk) for summarisation purposes, and evaluated and discussed their individual trajectories further.
Results
SARS-CoV-2 genetic diversity in Ecuador
The samples from laboratory-confirmed individuals obtained across mainland Ecuador (and one sample from the Galápagos Islands) were collected as they became available through different hospitals and laboratories and yielded representative genomes from all provinces in the country (Fig 1A). The number of sequences per province correlates with the cumulative laboratory-confirmed COVID-19 cases (Spearman’s rho = 0.597, p = 0.002) and number of deaths of patients with a positive or suspected-positive COVID-19 PCR test over the sampling period (Spearman’s rho = 0.492, p = 0.015), suggesting that the number of sequences per province is proportional to the number of infections. Despite the limited number of sequences from Ecuador, the representativeness of our sample is similar to that of other countries in the region. We estimate Ecuador produced 8 sequences for every 10000 reported cases or 12 sequences for every 1000 officially reported COVID-19 deaths. This is better than Peru (4 sequences per 10000 cases/10 sequences per 1000 deaths) or Brazil (3 sequences per 10000 cases/10 sequences per 1000 deaths) but less representative than Uruguay (159 sequences per 10000 cases, and a higher number of SARS-CoV-2 genome sequences than reported deaths) (Fig S1, File S1). It should be noted however that these estimates rely on the testing intensities between provinces in Ecuador and between different countries; limited testing in Ecuador could mean that the overall representation is lower than estimable from official reports.
Wide variation in the total numbers of cases across provinces in Ecuador is reflected in variation in number of sequences obtained. While heavily affected provinces such as Pichincha (72,305 confirmed cases until December 10) and Guayas (26,080 confirmed cases) account for larger numbers of sequences (47 and 18 for Guayas and Pichincha respectively), less affected provinces in the southern Highlands (Azuay – 12,670 confirmed cases, and Loja – 7252) and the Amazon (Morona Santiago – 3422, Napo – 1605, Orellana – 2100, Pastaza – 2360 and Zamora Chinchipe – 1628) are represented by few sequences (28 in total). The single sequence obtained from the Galápagos corresponds to the low number of cases there. Manabí province appears to be underrepresented (14,061 confirmed cases) while Imbabura (5695 confirmed cases) and Los Ríos (4707 confirmed cases) are represented by higher numbers of genomes per death (Fig. 1B). Sequence sampling rates for each province (excluding Galápagos) varied between 3 and 86 sequences per 1000 deaths (File S2). The temporal distribution of samples collected in Ecuador during 2020 do not strongly match trends in reported excess deaths. More samples were collected in July and August, but fewer genomes were sampled in the early epidemic months (March to May) despite the high number of excess deaths reported then (Fig 1C). Sequence representation is greater for the coastal provinces during the early months of the epidemic (March to June), when the epicentre of the epidemic was based in the port city of Guayaquil (in the Guayas province, (23), and shifted towards higher sampling in the highlands and Amazon provinces, as the epicentre of the epidemic shifted towards the capital city of Quito (in Pichincha province) and as more cases were reported in the Amazon.
Virus genomes from Ecuador were assigned to specific Pango lineages (43) using the pangolin tool (https://virological.org/t/pangolin-web-application-release/482). The genomes were assigned to 33 different global lineages and predominantly B.1.1.74 (39.4% of all Ecuadorian sequences; Fig. 1D), one of the lineages descended from B.1.1 which became one of the most dominant lineages during the early phase of the pandemic in Europe and North America (after the virus was introduced from Asia; 3, 43). The geographic distribution of the SARS-CoV-2 lineage diversity in the country shows distinctive patterns. While the majority of the lineages observed at low-frequencies in Ecuador were found in Pichincha, the heavily affected (and highly populated) province of Guayas (where Guayaquil is located) exhibits a predominance of the B.1.1.74 lineage (59.1% of all sequences from this province; Fig 1D). B.1.1.74 is also abundant in the provinces of Los Ríos, which neighbours Guayas (49%), and Imbabura, which neighbours Pichincha (38.9%). Other common lineages, such as B.1 (16.9% of all sequences from Ecuador) and B.1.1.1 (6.9% of all sequences from Ecuador), are distributed across various geographical regions.
Identification of Ecuadorian transmission lineages
We undertook exploratory phylogenetic analyses using different sequence subsampling schemes, as the exceptionally large number of available SARS-CoV-2 sequences prevents full analysis of the complete global data set. We estimated maximum likelihood (ML) trees of the Ecuador sequences in the context of different background data sets and performed Bayesian phylogenetic inference on a systematically-subsampled data set. The clustering patterns of Ecuadorian sequences in the ML trees showed some variation between data sets, but in the majority of cases remained consistent (File S3). We therefore derive our results from the systematically subsampled data set and discuss these in light of the randomly subsampled data sets.
We consistently found that a sizeable proportion of sequences from Ecuador did not cluster with other sequences from the country (54/160 sequences for the systematically subsampled data set, 48 to 51/160 for the randomly-subsampled data sets; File S3), and were therefore assigned as singletons and not associated with further virus spread within Ecuador detectable through genomic analysis. These singletons could in fact represent introduction events after which forward transmission in Ecuador did occur but this was not captured by the sample size of this study. We note that the majority of the singleton sequences were collected before mid-July (Fig S2); we speculate that they could represent predominantly early introduction events that occurred before the implementation of a national lockdown on March 16. While it is possible to establish a possible limit of dates on which each singleton was introduced, based on the last ancestral node inferred to have occurred outside of Ecuador, the precise importation date will fall somewhere between the inferred age of this preceding node and the collection date of the singleton sequence. The low sampling density in Ecuador and our subsampling schemes are likely to introduce uncertainty in estimating the age of these nodes and we therefore excluded these analyses from our results.
The remaining sequences (106/160 sequences for the systematically-subsampled data set) fell into two distinct categories. Firstly, 20 monophyletic clusters of Ecuadorian sequences were identified, capturing multiple introduction events and some local viral circulation patterns. These clusters were assigned to be separate Ecuadorian transmission lineages, named A through V (with exceptions detailed in the paragraph below). Each represents a single introduction event of the virus from an international destination, followed by local forward transmission in Ecuador (15).
Secondly, we identified two large monophyletic clusters that included sequences from international locations and Ecuador. These were not identified strictly as individual transmission lineages through our DTA approach, but rather as genetically similar groups of individual transmission lineages (Fig S3–S9). These results were likely driven by the phylogenetic placement of the non-Ecuadorian sequences resulting in the ancestral nodes being inferred to have existed outside of Ecuador (Fig 2A). While there is a possibility that these in fact represent multiple closely related yet independently introduced transmission lineages, we here label them as transmission lineage groups (highlighted with an asterisk, D* and H*) for summary purposes.
Table 1 provides details for each Ecuador transmission lineage (named sequentially according to the collection date of the earliest sequence in each lineage). We identify 82 (95% HPD: 81-84) SARS-CoV-2 introduction events from other countries into Ecuador through a robust counting approach (41). This estimate assumes that transmission lineage groups D* and H* are comprised of 2 and 3 individual transmission lineages respectively (with an additional singleton inferred as part of H*; Fig 2A). The detection lag (defined as the number of days between the inferred transmission lineage TMRCA and its earliest sampled sequence) ranged between 1 and 140 days (Table 1), with a median of 16 days (IQR: 7-31 days; Fig 2B, Fig S10).
Size and persistence of transmission lineages
Initial molecular clock analyses showed our data set contained strong temporal signal overall, although many sequences from Ecuador showed lower than average genetic divergence from the root (Fig 2C). The inferred TMRCAs of Ecuadorian transmission lineages ranged from February to November 2020 (Table 1); from this list, transmission lineages C and S are composed of pairs of sequences that share an epidemiological link.
The TMRCAs estimated for the two large transmission lineage groups D* and H* are the earliest in our data, however, these might not represent true lineage ancestors within Ecuador, because each group could represent more than one introduction from other countries. After excluding these larger groups, we still identified six transmission lineages for which the 95% HPDs of the TMRCA include a time point that predates the implementation of the national lockdown, set on March 16. Therefore, these transmission lineages likely correspond to introduction events that occurred before restrictions on incoming international flights were adopted in Ecuador. An additional four transmission lineages have TMRCA estimates between late March and November 2020. These may correspond to more recent introductions (following the progressive relaxation of the lockdown in Ecuador between May and September). The most likely exceptions to this are transmission lineages C (TMRCA: 2020.2412, 95% HPD: 2020.2328-2020.2446) and M (TMRCA: 2020.3278, 95% HPD: 2020.1864-2020.4187). Incomplete sampling of these lineages and detection lags could result in the date of introduction being substantially earlier than the date of the TMRCA (44), which would place the introduction date for these transmission lineages prior to the implementation of the lockdown.
Transmission lineages varied in size from sequence pairs (transmission lineages C, J, K, Q, R, S, T and U) to larger clusters of 16 to 21 sequences (transmission lineage group D*, depending on whether D* is considered as a single lineage or as multiple lineages). The number of sequences in each transmission lineage was correlated with the number of days between the earliest and most recent sampling dates of sequences within the lineage (assuming that D* and H* are composed of multiple individual transmission lineages each). However, this result could be driven by the single largest transmission lineage in the data set. A similar pattern is observed when considering the time between the inferred TMRCA and the most recent sampled sequence of each transmission lineage (equivalent to the persistence time plus the detection lag; Fig S11). We also observed that lineages that were detected earlier tended to be larger (contain more sequences) and persisted longer but were not more geographically widespread (Fig 3A; Fig S12–S14). However, transmission lineages first detected between June and August did appear to be found in a greater number of provinces (Fig 3A). A similar pattern is observed when considering the TMRCA of each transmission lineage, where a greater persistence was also observed for lineages with a TMRCA between May and July (Fig 3A; Fig S13).
Geographical distribution of transmission lineages
Singletons and transmission lineages were found across multiple provinces and regions of Ecuador (Fig 3B-3C). Singletons represent an important proportion of the sequences in various provinces across central Ecuador ranging between 33.3% in Tungurahua and 52.3% in Guayas (Fig 3B), an observation that is particularly important for provinces with large numbers of sequences (Fig S14). On the other hand, different transmission lineages were found either in single provinces or across multiple regions (Fig 3C). The large transmission lineage groups D* and H* include sequences from provinces across three geographical regions each (the coastal region, the highlands and the Amazon region), and potentially show internal seeding events of the virus across provincial boundaries (Fig 2A, 3C). Even when accounting for the possibility that these lineage groups are comprised of multiple transmission lineages, sequences from different provinces and regions clustered together (Fig 2A).
Consistent with the spatio-temporal sampling patterns (Fig 1C), the older transmission lineages (shown in darker blue in Fig 3B) were identified predominantly in provinces on the coast and highland regions, while younger transmission lineages (shown in lighter blue in Fig 3B) have been identified in specific provinces in the north and more broadly in the south. The first epicentre of the COVID-19 epidemic in Ecuador, the province of Guayas, is represented by a high frequency of singleton lineages, with a high diversity of individual transmission lineages first identified at different times during 2020. A similar pattern is observed for the second epicentre of the epidemic, the province of Pichincha, but with fewer different transmission lineages and less representation of the youngest transmission lineages (Fig 3A). We note that these patterns could be affected by differences in the number of sequences available for each province (Fig 1A; Fig S15).
Discussion
The early weeks of the COVID-19 epidemic in Ecuador were characterised by a severe spike in the number of cases in city of Guayaquil, the largest in the country located in the province of Guayas, and by high attack rates (i.e. new cases in a population at risk divided by the size of that population at risk) across various coastal provinces (24). The outbreak overwhelmed local healthcare systems, resulting in one of the highest excess death rates in the world during early 2020 (23). Information about the importation of SARS-CoV-2 into Ecuador and the domestic spread of the virus is needed to explain the drastic effects of the pandemic in the country during March and April 2020, and to explain the large difference in disease burden between Guayaquil and the capital, Quito.
An important determinant of the early dynamics of COVID-19 outbreaks has been human mobility and the number of introduction events of the virus into a new location with an immunologically naïve population, as was observed during the early stages of the emergence of SARS-CoV-2 in China (45). The large proportion of singletons observed in Guayas and various other coastal provinces, particularly given their tendency to occur early in the epidemic (Fig S2), could be suggestive of multiple independent introduction events with limited forward transmission. This would also explain why the earliest sequences in the local transmission lineages and the sequences assigned to the most common Pango lineage in Ecuador (B.1.1.74) were predominantly sampled in coastal provinces during the early weeks of the epidemic (Fig 2A, Fig S16).
Two additional factors support the hypothesis that Guayas played an important role in seeding of viral transmission to other regions in Ecuador: i) the city of Guayaquil hosts the second busiest international airport in the country and one of only two in the coastal region (the second international airport located in the province of Manabí hosts limited flights to a few international destinations; 46), and ii) the overall timing of the seeding events (which necessarily have to predate the inferred TMRCA of a lineage) corresponds to the school holiday period in the coastal region (February to April), when international travel and large social gatherings are more likely to occur. The inferred TMRCAs, which can serve as an upper bound for the true introduction dates of the largest lineages (assuming these are better represented with the available sequences), show that these transmission lineages most likely arrived in Ecuador before the date when NPIs were implemented and before travel restrictions came into place.
The genetic diversity of SARS-CoV-2 can be quantified using the Pango dynamic nomenclature system. Pango lineages reflect the history of significant events in the epidemic and geographic spread of the virus (47) and can be used to explore likely source locations of virus importations, for example, the high representation of lineages descended from B.1.1 in the province of Guayas (Fig 1D). The emergence of B.1.1 in Europe and North America around February 2020 (43) might suggest these regions contributed to seeding the epidemic in this province of Ecuador (despite the limitations on identifying exact source locations due to poor surveillance in many countries). B.1.1.74, the most prevalent lineage in Ecuador, is descended from B.1.1 and was more frequently sampled in Guayas during the early months of the epidemic (Fig S16), further revealing the role of locations where B.1.1 was prevalent as importation sources during the epidemic’s first wave. However, the high proportion of singletons observed in Guayas also suggests that onward transmission of introduced virus was less common. Insights from regions sampled at very high intensities, such as the United Kingdom, show that the majority of introductions lead to small, transient, dead-end transmission lineages, whereas a smaller number of introductions lead to larger and longer-lasting transmission lineages (15). If this phenomenon is a general property of the first wave of SARS-CoV-2 transmission, as appears to be the case given similar early observations in Brazil (5), Panama (48), Uruguay (49), Spain (50) and the Netherlands (51), we can expect that many of the introductions to Guayas led to few new cases, and that most of the ongoing transmission was derived from only a few introductions.
The larger transmission lineages identified here suggest that virus transmission was high between neighbouring and well-connected provinces. This might have been an important factor determining the transmission dynamics between the main cities in the country. In contrast to Guayaquil, Quito is the second-largest city in the country and presented a much less severe first epidemic wave, despite hosting the busiest international airport in the country. The city is located in the province of Pichincha, which exhibits a large proportion of singletons but fewer distinct transmission lineages overall. Pichincha also shows a lower diversity of global lineages and fewer different transmission lineages; furthermore, the transmission lineages observed in Pichincha are mostly not shared with Guayas. This suggests that either independent international introductions or domestic seeding events likely drove the early epidemic in Pichincha. Our phylogenetic analyses suggest that some transmission lineages in Pichincha were introduced from other provinces (Fig 2A, e.g. the monophyletic Pichincha clade in lineage group D*; Fig S6), suggesting that domestic travel might have played an important role in the establishment of SARS-CoV-2 transmission in this region. While it is also possible that a later burst of introductions of new lineages occurred from locations abroad, once international travel and lockdown measures were lifted, the sampling dates of singleton genomes from Pichincha (between March and July) suggest that introductions into Pichincha likely occurred before the lockdown rather than during the relaxation of NPIs. The Pichincha singletons account for the earliest sequences of this kind in our data set, but could represent instances where limited or no additional spread occurred following their introduction.
Ultimately, more comprehensive analyses on the sources and drivers of transmission would require a deeper sampling of key locations where transmission was high, and the inclusion of complementary data sources such as real-time mobility and transportation data, could provide a better overview of the forces shaping the observed viral genetic diversity in Ecuador. Provinces such as Azuay, Guayas and Pichincha represent the main air travel entry points, but the role of land mobility across the northern border with Colombia and the southern border with Peru should also be considered to further understand the role of other Latin American countries in regional viral transmission.
Our analysis highlights some important patterns but is limited by various factors. Most notably, the number of genome sequences in our study, although large by historical standards, is small compared to the new expectations set for virus genome sequencing during the COVID-19 pandemic. The sample size restricts our ability to infer further details about local virus transmission dynamics from sequence data alone. The trajectories of individual lineages, from their international sources to their spread across the country and their subsequent local circulation, are best analysed from larger data sets, or in conjunction with additional data sources to manage and ameliorate the potential consequences of sampling biases (10, 52). Incorporating data from self-reported travel histories and human mobility can help to maximise the utility of smaller samples, collected in settings where the sequencing of large numbers of genomes lies beyond local technological or financial capacity, or where high sampling densities are unfeasible (e.g. in remote locations).
The first year of the COVID-19 pandemic has shown how global connectivity plays a key role in the development of national epidemics caused by respiratory viruses, reminiscent of other pathogens such as Influenza viruses (16). Our results from Ecuador showcase the relevance of importations in establishing local viral circulation and the potential consequences of interprovincial mobility for highly connected locations. In particular, it shows that importations have been a common occurrence even after the implementation of lockdown measures and travel restrictions, and that seeding events across provinces can occur frequently. While air travel is limited between provinces, the connectivity provided by land travel can serve as a means for pathogen spread, highlighting the vulnerability of highly connected and remote locations alike. The notion that two large cities with busy international airports can manifest such different transmission dynamics and viral genetic diversity is also relevant, as it shows that a combination of multiple factors determines the outcome of an epidemic within a specific location. Ultimately, the type of interventions chosen to mitigate this high degree of connectivity, the necessity of an early implementation of these interventions and the adherence to these by the general population are paramount in determining their efficacy.
Data Availability
The newly generated sequences have been publicly shared through the GISAID platform. The data sets (including Ministry of Health and National Institute of Statistics and Census data) and code used to generate the analyses presented in this study, as well as a list of accession numbers for the Ecuador SARS-CoV-2 genome sequences analysed here are available through GitHub at https://github.com/BernardoGG/SARS-CoV-2_Genomic_lineages_Ecuador. We thank Andres N. Robalino and Carlos Oporto for their contribution to the open licence Ecuacovid repository (https://github.com/andrab/ecuacovid) used to extract epidemiological data for this study.
https://github.com/BernardoGG/SARS-CoV-2_Genomic_lineages_Ecuador
Data availability
The newly generated sequences have been publicly shared through the GISAID platform. The data sets (including Ministry of Health and National Institute of Statistics and Census data) and code used to generate the analyses presented in this study, as well as a list of accession numbers for the Ecuador SARS-CoV-2 genome sequences analysed here are available through GitHub at https://github.com/BernardoGG/SARS-CoV-2_Genomic_lineages_Ecuador. We thank Andrés N. Robalino and Carlos Oporto for their contribution to the open licence Ecuacovid repository (https://github.com/andrab/ecuacovid) used to extract epidemiological data for this study.
Supplementary Information
Acknowledgements
We would like to thank all the clinical personnel from numerous public and private healthcare institutions who provided access to laboratory-confirmed samples to generate SARS-CoV-2 genomic sequences. We also thank the Centre for Arbovirus Discovery, Diagnosis, Genomics and Epidemiology (CADDE), Eva Harris from the A2CARES network and Lourdes Torres, Diego Quiroga and Stella de la Torre for their methodological, logistical and financial support (through the USFQ Emergency Grants) towards the sequencing of viral genomes. We are grateful to all laboratories and institutions worldwide involved in the generation of virus genome data shared on GISAID. Financial support was provided by the Clarendon Fund and the Department of Zoology of the University of Oxford (D.S.C.); WT fellowship 204311/Z/16/Z and MRC-FAPESP awards MR/S0195/1 and 18/14389-0 (N.R.F.); Branco Weiss Fellowship and EU grant 874850 MOOD (M.U.G.K.); WT Collaborators Award 206298/Z/17/Z (J.T.M.); Leverhulme Trust ECR Fellowship ECF-2019-542 (M.E.Z.); the Oxford Martin School (O.G.P., M.U.G.K. and L.d.P.); and the NIH Global Health Equity Scholars award FIC D43TW010540 (P.C.). The authors report no conflicts of interest.