Author: Olivia Sochan


0. Installing Required Packages

Note: This set of code has already been run for you. You do not need to add it. You can silence the code using hashtags.

install.packages("ape")
install.packages("dplyr")
install.packages("phytools")
install.packages("tidyr")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ggtree")
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'ggtree'

1. Initiate the Project

library(ape)
library(dplyr)
library(ggplot2)
library(ggtree)
library(phytools)
library(tidyr)

2. Where Did COVID-19 Come From?

Coronaviruses are common pathogens in humans and mostly have benign effects (e.g., the common cold). But some past coronavirus strains have caused substantial illness in humans, including MERS (Middle East Respiratory Syndrome), which caused a limited outbreak in 2012, and SARS (Severe Acute Respiratory Syndrome), which caused a limited outbreak in 2002/2003. The ubiquity of coronaviruses raises questions about the origins of SARS-CoV-2, which is causing the current COVID-19 pandemic.


2.1. Hypothesis

Based on the information above and what you know about the evolution of HIV, phrase a hypothesis about the potential evolutionary origins of SARS-CoV-2.

SARS-CoV-2 likely emerged through natural zoonotic evolution, similar to HIV’s origin from primate viruses. Ancestral bat coronaviruses may have recombined and adapted in an intermediate animal—or during undetected early human infections—before spreading efficiently among humans in late 2019. This process of gradual mutation, recombination, and host adaptation is consistent with known viral evolution patterns observed in other pandemics.


2.2. Identifying the Evolutionary Origins of SARS-CoV-2

Phylogenetic analyses allow us to examine how organisms are related, and in the case of SARS-CoV-2, where its evolutionary origins might lie. In the modern age of DNA sequencing, virus genomes from the current pandemic have become rapidly available. In the section below, you will visualize a phylogenetic tree that is based on whole genome sequencing of different SARS-CoV-2 isolates, isolates of SARS-CoV (which caused the outbreak in 2002/2003), as well as other SARS-like coronaviruses that were isolated from a variety of host species. We will walk you through the visualization and annotation of the tree step by step in the following section.


2.2.1. Importing and Plotting a Phylogenetic Tree

Phylogenetic trees come in a wide variety of formats that are not easily readable with standard programs. In this exercise, we will use trees that are encoded in the Newick format (nwk file ending), and R can read and interpret such trees with the read.tree() function from the ape package. To load the phylogeny of the different coronaviruses, you can simply execute the code below, and a object cov.tree should show up in your Global Environment.

#This line simply reads in the phylogenetic tree (akin to the read.csv function for reading in tables)
cov.tree <- read.tree("coronaviruses_tree.nwk")

Once your tree is imported into R, you can visualize it using the ggtree() function from the ggtree package. ggtree is an extension of ggplot2, which you have already worked with. You only have to specify the data frame containing the phylogenetic tree (which we named cov.tree above). Like with ggplot(), you can add graphical elements using the plus sign. For example, the names of the different viral isolates are added to the tips of the phylogeny using geom_tiplab().

#You can plot phylogenetic trees with the ggtree function (it works just like ggplot). Note that "size=2" associated with geom_tiplab just modifies the font size
p1 <- ggtree(cov.tree) + geom_tiplab(size=2)
p1

At this point, we can see the tree structure. However, the tip labels in this case have limited information, because they are mostly just weird acronyms that researchers used to designate their samples. To address our question above, we need to add some additional information to our tree.


2.2.2. Importing Metadata

Files containing phylogenetic trees only include the names of the taxa included and the length and topology of the branches that connect them. Additional information about each sample (the so-called metadata) is stored in a separate file. You will see that the first column in the metadata file corresponds to the tip labels in the phylogeny. The metadata for this tree includes the country of origin, the host species, virus type, the authors that collected the original data, and the accession (a unique code that allows you to find the specific sample in designated databases). As other datasets you have worked with, the metadata is stored in a simple table that you can import with the read.csv() function.

#This line reads the metadata of samples included in the phylogenetic tree
cov.tree.meta<- read.csv("coronaviruses_tree_meta.csv")
cov.tree.meta
##                         Strain   country     host    virus.type         Author
## 1              bat_SL_CoVZXC21     China      Bat SARS-like CoV     Hu et al A
## 2               bat_SL_CoVZC45     China      Bat SARS-like CoV     Hu et al A
## 3    pangolin/Guangxi/P4L/2017     China Pangolin SARS-like CoV    Jiang et al
## 4    pangolin/Guangxi/P3B/2017     China Pangolin SARS-like CoV    Jiang et al
## 5    pangolin/Guangxi/P1E/2017     China Pangolin SARS-like CoV    Jiang et al
## 6    pangolin/Guangxi/P5L/2017     China Pangolin SARS-like CoV    Jiang et al
## 7    pangolin/Guangxi/P5E/2017     China Pangolin SARS-like CoV    Jiang et al
## 8    pangolin/Guangxi/P2V/2017     China Pangolin SARS-like CoV    Jiang et al
## 9  pangolin/Guangdong/P2S/2019     China Pangolin SARS-like CoV    Jiang et al
## 10      bat/Yunnan/RaTG13/2013     China      Bat SARS-like CoV      Zhu et al
## 11      bat/Yunnan/RmYN02/2019     China      Bat SARS-like CoV     Zhou et al
## 12                USA/CA1/2020       USA    Human    SARS-CoV-2   Uehara et al
## 13               Nepal/61/2020     Nepal    Human    SARS-CoV-2      Sah et al
## 14                USA/IL1/2020       USA    Human    SARS-CoV-2      Tao et al
## 15              Sweden/01/2020    Sweden    Human    SARS-CoV-2  Bengner et al
## 16             Wuhan-Hu-1/2019     China    Human    SARS-CoV-2  Zhang et al A
## 17            Wuhan/WIV05/2019     China    Human    SARS-CoV-2     Zhou et al
## 18         Japan/KY-V-029/2020     Japan    Human    SARS-CoV-2 Sekizuka et al
## 19                      HKU3_7                Bat SARS-like CoV    Lau et al A
## 20                      LYRa11     China      Bat SARS-like CoV       He et al
## 21      bat/Yunnan/RmYN01/2019     China      Bat SARS-like CoV     Zhou et al
## 22                    YNLF_34C     China      Bat SARS-like CoV    Lau et al B
## 23                       Rs672     China      Bat SARS-like CoV     Yuan et al
## 24                      Rs4081     China      Bat SARS-like CoV     Hu et al B
## 25                      As6526     China      Bat SARS-like CoV     Hu et al B
## 26                      Rs4247     China      Bat SARS-like CoV     Hu et al B
## 27                      Rs4237     China      Bat SARS-like CoV     Hu et al B
## 28                      Rf4092     China      Bat SARS-like CoV     Hu et al B
## 29                      Rs7327     China      Bat SARS-like CoV     Hu et al B
## 30                      Rs9401     China      Bat SARS-like CoV     Hu et al B
## 31                    RsSHC014     China      Bat SARS-like CoV     Ge et al A
## 32                      Rs4084     China      Bat SARS-like CoV     Hu et al B
## 33                        WIV1     China      Bat SARS-like CoV     Ge et al B
## 34                      Rs3367     China      Bat SARS-like CoV     Ge et al A
## 35                      Rs4231     China      Bat SARS-like CoV     Hu et al B
## 36                    civet007     China    Civet      SARS-CoV   Wang et al A
## 37                     PC4_227     China    Civet      SARS-CoV     Song et al
## 38                    civet020     China    Civet      SARS-CoV   Wang et al A
## 39                        A001     China    Civet      SARS-CoV   Wang et al B
## 40                      GZ0402     China    Human      SARS-CoV     Song et al
## 41                      GZ0401     China    Human      SARS-CoV     Song et al
## 42                        BJ04     China    Human      SARS-CoV       Wu et al
## 43                        BJ01     China    Human      SARS-CoV       Wu et al
## 44                      Sin852 Singapore    Human      SARS-CoV     Vega et al
## 45                      Sin842 Singapore    Human      SARS-CoV     Vega et al
## 46                      Sin846 Singapore    Human      SARS-CoV     Vega et al
## 47                    Sino1_11     China    Human      SARS-CoV  Zhang et al B
## 48                         TWH    Taiwan    Human      SARS-CoV    Shu et al A
## 49                         TWS    Taiwan    Human      SARS-CoV    Shu et al B
##         Accession
## 1        MG772934
## 2        MG772933
## 3  EPI_ISL_410538
## 4  EPI_ISL_410543
## 5  EPI_ISL_410539
## 6  EPI_ISL_410540
## 7  EPI_ISL_410541
## 8  EPI_ISL_410542
## 9  EPI_ISL_410544
## 10 EPI_ISL_402131
## 11 EPI_ISL_412977
## 12       MN994467
## 13       MT072688
## 14       MN988713
## 15       MT093571
## 16     WH-Human_1
## 17       MN996529
## 18       LC522972
## 19       GQ153542
## 20       KF569996
## 21 EPI_ISL_412976
## 22       KP886809
## 23       FJ588686
## 24       KY417143
## 25       KY417142
## 26       KY417148
## 27       KY417147
## 28       KY417145
## 29       KY417151
## 30       KY417152
## 31       KC881005
## 32       KY417144
## 33       KF367457
## 34       KC881006
## 35       KY417146
## 36       AY572034
## 37       AY613950
## 38       AY572038
## 39       FJ959407
## 40       AY613947
## 41       AY568539
## 42       AY279354
## 43       AY278488
## 44       AY559082
## 45       AY559081
## 46       AY559094
## 47       AY485277
## 48       AP006557
## 49       AP006560

2.2.3. Annotating the Phylogenetic Tree

We can use the metadata to annotate the phylogenetic tree in a way that allows us to address our central question. In the example below, I am using the variable virus.type to color the tip of the phylogeny (adding geom_tippoint()), and rather than labeling the tips with the name of the strain, I am labeling them with the host species the virus was isolated from (using geom_tiplab()).

#This line of code simply plots the same tree as above, but without the sample label
p2 <- ggtree(cov.tree)
## Warning in fortify(data, ...): Arguments in `...` must be used.
## ✖ Problematic arguments:
## • as.Date = as.Date
## • yscale_mapping = yscale_mapping
## • hang = hang
## ℹ Did you misspell an argument name?
#This codes adds the metadata to the tree
p2 %<+% cov.tree.meta + 
  geom_tiplab(aes(label=host), size=2) +
  geom_tippoint(aes(color=virus.type), na.rm = TRUE)


2.3. Interpretation

What does the phylogeny tell you about the evolutionary origins of SARS-CoV-2, which causes the current pandemic? Discuss the findings in the context of your hypothesis.

The phylogenetic tree shows that SARS-CoV-2 is most closely related to bat coronaviruses and more distantly related to SARS-CoV and pangolin viruses. This suggests that SARS-CoV-2 likely evolved from a bat coronavirus, possibly involving an intermediate animal before infecting humans. These findings support the hypothesis that SARS-CoV-2 emerged through natural zoonotic evolution rather than from a human source.


3. Molecular Clock: How Fast Do Mutations Accumulate?

Unlike most previous disease outbreaks, our ability to sequence a large number of virus genomes has allowed us to literally observe the evolution of the pathogen live. Visit the the Nextstrain graphical interface to see how the pathogen spread across the globe and how mutations accumulate as different viral clades diversify.

The neutral theory predicts that mutations steadily accumulate during the course of evolution. In this section, you will analyze data from 2,625 whole-genome sequences of SARS-CoV-2 that researchers have collected all over the world. The data set includes the original isolate from December 26 2019 and representative isolates up until July 6 2022.


3.1. COVID-19 Genomes

In the first step, let’s visualize the phylogenetic relationships among all of these strains in the context of where they were collected. Hold your breath… this will be a mess:) No surprise when you plot over 2,500 data points at once!

Note this uses the same approach as above. You import the tree and the metadata. Then you plot the tree and add the metadata information to the tree. The code below is complete and should run automatically. Note that the scale bar represents the number of mutations between isolates.

#Load the phylogenetic tree
cov19.tree <- read.tree("nextstrain_ncov_open_global_6m_tree.nwk")

#Load the corresponding metadata
cov19.tree.meta <- read.csv("nextstrain_ncov_open_global_6m_metadata.csv")

#Visualize the tree and color the tips by region
#Note that I am using a fan layout here rather than the traditional layout, because it spreads out the tips more
#Depending on your monitor size, you may want to adjust the size of the tip points using the size parameter (currently set to 1.5)
p3 <- ggtree(cov19.tree, layout="fan", open.angle=10)  + geom_treescale(offset=20)
p3 %<+% cov19.tree.meta +
  geom_tippoint(aes(color=region), na.rm = TRUE, size=1.5)


3.2. Quantifying Mutations

To examine how mutations accumulate during virus evolution, we need to tally up the number of nucleotide substitutions for each isolate relative to the original virus. Using the phylogenetic data we have available, this is really simple, because we can sum up the length of branches that separates each isolate from the root of the tree (i.e., the original virus sequence).

The code you need to do this is really simple; it requires a single function (distRoot()). However, I do not recommend that you run this on your computer! For 2,625 genomes, this calculation takes quite a while, and depending on your hardware, it might just crash your computer. So, I already ran this analysis for you, and you can simply import the results in the next section.

##YOU DO NOT NEED TO RUN THIS! THIS TAKES A LONG TIME TO RUN. THIS CODE IS HERE FOR ILLUSTRATION ONLY, IN CASE YOU ARE INTERESTED HOW IT WORKS

#library(adephylo)
#dist <- distRoot(cov19.tree, method = "patristic")
#ncov_mutations <- merge (cov19.tree.meta, dist, by.x = "strain", by.y = 0)
#ncov_mutations$date2 <- as.Date(ncov_mutations$date, format = "%Y-%m-%d")
#ref <- as.Date("2019-12-26", format = "%Y-%m-%d")
#ncov_mutations$time <- ncov_mutations$date2-ref

3.3 Plot the Data

If we want to know how fast SARS-CoV-2 accumulates mutations over time, we can estimate the genome-wide molecular clock (i.e., the rate at which mutations occur over time). This can be done by plotting the number of mutations separating any isolate from the root as a function of time and then calculating the slope of a regression line between time (in days) and the number of mutations. We will do this using virus isolates for the first two years of the pandemic.

First, let’s import and plot the data using a scatter plot. The data file “ncov_mutations.csv” includes the number of of mutations separating any isolate from the root (calculated from the phylogeny as described above) as well as the number of days that passed between the collection of the original isolate. Remember, to draw a regression line, you can add geom_smooth(method=lm, formula= y~0+x, se=FALSE) to the code that makes a scatter plot.

#Import data
cov19.mutations <- read.csv("ncov_mutations.csv")

#Select first two years of data
cov19.mutations2 <- cov19.mutations[which(cov19.mutations$DaysSinceRoot<730),]

#Plot data
ggplot(cov19.mutations2, aes(x=DaysSinceRoot, y=Mutations)) + 
  geom_point() +
  geom_smooth(method="lm", se=FALSE, formula=y~x-1) + 
  theme_classic() + 
  xlab("DaysSinceRoot") + 
  ylab("Mutations")


3.4. Estimate the Slope

The regression line essentially represents the average at which rate mutations accumulate. Numerically, the number of mutations per day (i.e., the slope of the regression line in the graph above) can be calculated with the code below.

#You can run a regression with the lm function, designating the relationship between the variables as "y ~ 0 + x". The 0 indicates that the regression line must go through the origin.
regression <- lm(cov19.mutations2$Mutations ~ 0 + cov19.mutations2$DaysSinceRoot)

#The slope of the regression (in number of mutations per day) can be found by calling regression$coefficients[1]. Multiplying that number by 365 gives us a substitution rate per year.
mutations.per.year <- 365*regression$coefficients[1]
mutations.per.year
## cov19.mutations2$DaysSinceRoot 
##                       23.89179

3.5. Interpretation


3.5.1. General Patterns

As you evaluate the analyses in this section, what do you notice? What happened to all the viruses that were highly similar to the original samples and common in the first 100 days of the pandemic?

The viruses that were highly similar to the original samples—those with few or no mutations—were common early in the pandemic (first 100 days) but disappeared over time as new mutations accumulated. This pattern suggests that the original viral lineages were gradually replaced by descendant variants that carried advantageous or neutral mutations, reflecting the ongoing natural evolution and diversification of SARS-CoV-2 in the human population.


3.5.2. Application of the Molecular Clock

The calculation of the rate of mutation accumulation was really straightforward for the first two years of the pandemic. However, let’s plot the data for the full range of the dataset. I am color coding the data for the first two years separately:

#Here, I am just making a new variable for first two years or not
cov19.mutations$group <- ifelse(cov19.mutations$DaysSinceRoot<730, "Early", "Late")

#Plot data
ggplot(cov19.mutations, aes(x=DaysSinceRoot, y=Mutations, color=group)) + 
  geom_point(aes()) +
  geom_smooth(data=subset(cov19.mutations,group=="Early"), method="lm", se=FALSE, formula=y~x-1, color="black") + 
  theme_classic() + 
  xlab("DaysSinceRoot") + 
  ylab("Mutations")

With your knowledge of the mutation rate, how do you interpret the results? Why are there apparent jumps in the number of mutations of isolates past day 730?

The graph shows that SARS-CoV-2 gained mutations at a steady rate over time, which matches its normal mutation rate of about 1–2 mutations per month. The sudden jumps in mutations after around day 730 happened when new variants appeared that had already built up many changes before spreading. These jumps show how new, more evolved versions of the virus replaced earlier ones.


4. Distribution of Mutations and Positive Selection

Now you saw how SARS-CoV-2 has continuously accumulated mutations as it was spreading across the globe. The next question of course is where in the viral genomes do these mutations occur?


4.1. Distribution of Mutations

Using the 2,625 genomes that were sequenced, we can literally tally up where in the genome we see the variable sites. Here, we are providing you with a data set that measured the genetic diversity at every site in the SARS-CoV-2 genome. Diversity is measured as the number of observed mutations per codon (events).

To properly visualize this type of data, it requires some R wizardry. You don’t need to know the details of how this code works, just press play and focus on interpreting the results of the graph. If you are interested in unpacking the code line by line, I’m happy to meet with you outside of class.

#Import data
diversity <- read.csv("nextstrain_ncov_open_global_6m_diversity_aa.csv")

diversity2 <- diversity %>% 
  group_by(gene) %>% 
  complete(position = 1:max(position), fill = list(events = 0))

#This is so
diversity3 <- diversity2 %>% 
  
  # Compute chromosome size
  group_by(gene) %>% 
  summarise(chr_len=max(position)) %>% 
  
  # Calculate cumulative position of each chromosome
  mutate(tot=cumsum(chr_len)-chr_len) %>%
  select(-chr_len) %>%
  
  # Add this info to the initial dataset
  left_join(diversity, ., by=c("gene"="gene")) %>%
  
  # Add a cumulative position of each SNP
  arrange(gene, position) %>%
  mutate( BPcum=position+tot)

axisdf = diversity3 %>% group_by(gene) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )


ggplot(diversity3, aes(x=BPcum, y=events)) + 
  geom_point( aes(color=as.factor(gene))) + scale_color_manual(values = rep(c("red", "black"), 7 )) + 
  scale_x_continuous(label = axisdf$gene, breaks= axisdf$center) +
  coord_flip()+
  theme_classic() +
  xlab("Gene") +
  ylab("Diversity") +
  theme(legend.position = "none")

As you evaluate the graph, what do you notice? How do you interpret these results in the context of neutral evolution?

The graph shows that some viral genes, like ORF8 and ORF7b, have high genetic diversity, while others like M and N are much more conserved. This pattern suggests that mutations in less essential genes accumulate through neutral evolution—random changes that don’t affect the virus much. In contrast, core genes stay stable because harmful mutations are weeded out by natural selection.


4.2. Detecting the Footprint of Selection

Applying the principles of the neutral theory, we can also identify what parts of the viral genome are likely under selection. In the data set below, researchers quantified ⍵ for some of the larger genes in the SARS-CoV-2 genome. ⍵ is defined as dN/dS, so it is the ratio of rate of non-synonymous mutations to the rate of synonymous mutations. Plot the values of ⍵ for the different genes using a bar graph.

#Import data
dn.ds <- read.csv("ncov_dnds.csv")

ggplot(dn.ds, aes(x=gene, y=w)) +
    geom_bar(stat="identity") +
    theme_classic() +
    xlab("gene") +
    ylab("fitness")

4.3. Interpretation

Based on the information, what genes do you think are under positive selection? What do the results mean for our understanding of COVID-19?

The genes under positive selection are mainly the spike (S) gene, and possibly ORF1b and N. These genes are changing faster, likely because the virus is adapting to the human immune system and vaccines. This means SARS-CoV-2 is still evolving, so tracking these genes helps us stay prepared for new variants.


5. Resources


5.1. Data References

Phylogenetic trees and genetic diversity data of SARS-CoV-2 and other betacoronaviruses were retrieved from Nextstrain:

For additional information on Nextstrain, see:

Data for DN/DS ratios were kindly provided by Manuela Sironi and Diego Forni based on the following work:


5.2. Resources You Consulted

Consulting additional resources to solve this assignment is absolutely allowed, but failure to disclose those resources is plagiarism. Please list any collaborators you worked with and resources you used below or state that you have not used any.

Dr.Bonner