Note: This set of code has already been run for you. You do not need to add it. You can silence the code using hashtags.
install.packages("ape")
install.packages("dplyr")
install.packages("phytools")
install.packages("tidyr")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggtree")
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'ggtree'
library(ape)
library(dplyr)
library(ggplot2)
library(ggtree)
library(phytools)
library(tidyr)
Coronaviruses are common pathogens in humans and mostly have benign effects (e.g., the common cold). But some past coronavirus strains have caused substantial illness in humans, including MERS (Middle East Respiratory Syndrome), which caused a limited outbreak in 2012, and SARS (Severe Acute Respiratory Syndrome), which caused a limited outbreak in 2002/2003. The ubiquity of coronaviruses raises questions about the origins of SARS-CoV-2, which is causing the current COVID-19 pandemic.
Based on the information above and what you know about the evolution of HIV, phrase a hypothesis about the potential evolutionary origins of SARS-CoV-2.
SARS-CoV-2 likely emerged through natural zoonotic evolution, similar to HIV’s origin from primate viruses. Ancestral bat coronaviruses may have recombined and adapted in an intermediate animal—or during undetected early human infections—before spreading efficiently among humans in late 2019. This process of gradual mutation, recombination, and host adaptation is consistent with known viral evolution patterns observed in other pandemics.
Phylogenetic analyses allow us to examine how organisms are related, and in the case of SARS-CoV-2, where its evolutionary origins might lie. In the modern age of DNA sequencing, virus genomes from the current pandemic have become rapidly available. In the section below, you will visualize a phylogenetic tree that is based on whole genome sequencing of different SARS-CoV-2 isolates, isolates of SARS-CoV (which caused the outbreak in 2002/2003), as well as other SARS-like coronaviruses that were isolated from a variety of host species. We will walk you through the visualization and annotation of the tree step by step in the following section.
Phylogenetic trees come in a wide variety of formats that are not
easily readable with standard programs. In this exercise, we will use
trees that are encoded in the Newick format (nwk file ending), and R can
read and interpret such trees with the read.tree() function
from the ape package. To load the phylogeny of the
different coronaviruses, you can simply execute the code below, and a
object cov.tree should show up in your Global
Environment.
#This line simply reads in the phylogenetic tree (akin to the read.csv function for reading in tables)
cov.tree <- read.tree("coronaviruses_tree.nwk")
Once your tree is imported into R, you can visualize it using the
ggtree() function from the ggtree package.
ggtree is an extension of ggplot2, which you
have already worked with. You only have to specify the data frame
containing the phylogenetic tree (which we named cov.tree
above). Like with ggplot(), you can add graphical elements
using the plus sign. For example, the names of the different viral
isolates are added to the tips of the phylogeny using
geom_tiplab().
#You can plot phylogenetic trees with the ggtree function (it works just like ggplot). Note that "size=2" associated with geom_tiplab just modifies the font size
p1 <- ggtree(cov.tree) + geom_tiplab(size=2)
p1
At this point, we can see the tree structure. However, the tip labels in this case have limited information, because they are mostly just weird acronyms that researchers used to designate their samples. To address our question above, we need to add some additional information to our tree.
Files containing phylogenetic trees only include the names of the
taxa included and the length and topology of the branches that connect
them. Additional information about each sample (the so-called metadata)
is stored in a separate file. You will see that the first column in the
metadata file corresponds to the tip labels in the phylogeny. The
metadata for this tree includes the country of origin, the host species,
virus type, the authors that collected the original data, and the
accession (a unique code that allows you to find the specific sample in
designated databases). As other datasets you have worked with, the
metadata is stored in a simple table that you can import with the
read.csv() function.
#This line reads the metadata of samples included in the phylogenetic tree
cov.tree.meta<- read.csv("coronaviruses_tree_meta.csv")
cov.tree.meta
## Strain country host virus.type Author
## 1 bat_SL_CoVZXC21 China Bat SARS-like CoV Hu et al A
## 2 bat_SL_CoVZC45 China Bat SARS-like CoV Hu et al A
## 3 pangolin/Guangxi/P4L/2017 China Pangolin SARS-like CoV Jiang et al
## 4 pangolin/Guangxi/P3B/2017 China Pangolin SARS-like CoV Jiang et al
## 5 pangolin/Guangxi/P1E/2017 China Pangolin SARS-like CoV Jiang et al
## 6 pangolin/Guangxi/P5L/2017 China Pangolin SARS-like CoV Jiang et al
## 7 pangolin/Guangxi/P5E/2017 China Pangolin SARS-like CoV Jiang et al
## 8 pangolin/Guangxi/P2V/2017 China Pangolin SARS-like CoV Jiang et al
## 9 pangolin/Guangdong/P2S/2019 China Pangolin SARS-like CoV Jiang et al
## 10 bat/Yunnan/RaTG13/2013 China Bat SARS-like CoV Zhu et al
## 11 bat/Yunnan/RmYN02/2019 China Bat SARS-like CoV Zhou et al
## 12 USA/CA1/2020 USA Human SARS-CoV-2 Uehara et al
## 13 Nepal/61/2020 Nepal Human SARS-CoV-2 Sah et al
## 14 USA/IL1/2020 USA Human SARS-CoV-2 Tao et al
## 15 Sweden/01/2020 Sweden Human SARS-CoV-2 Bengner et al
## 16 Wuhan-Hu-1/2019 China Human SARS-CoV-2 Zhang et al A
## 17 Wuhan/WIV05/2019 China Human SARS-CoV-2 Zhou et al
## 18 Japan/KY-V-029/2020 Japan Human SARS-CoV-2 Sekizuka et al
## 19 HKU3_7 Bat SARS-like CoV Lau et al A
## 20 LYRa11 China Bat SARS-like CoV He et al
## 21 bat/Yunnan/RmYN01/2019 China Bat SARS-like CoV Zhou et al
## 22 YNLF_34C China Bat SARS-like CoV Lau et al B
## 23 Rs672 China Bat SARS-like CoV Yuan et al
## 24 Rs4081 China Bat SARS-like CoV Hu et al B
## 25 As6526 China Bat SARS-like CoV Hu et al B
## 26 Rs4247 China Bat SARS-like CoV Hu et al B
## 27 Rs4237 China Bat SARS-like CoV Hu et al B
## 28 Rf4092 China Bat SARS-like CoV Hu et al B
## 29 Rs7327 China Bat SARS-like CoV Hu et al B
## 30 Rs9401 China Bat SARS-like CoV Hu et al B
## 31 RsSHC014 China Bat SARS-like CoV Ge et al A
## 32 Rs4084 China Bat SARS-like CoV Hu et al B
## 33 WIV1 China Bat SARS-like CoV Ge et al B
## 34 Rs3367 China Bat SARS-like CoV Ge et al A
## 35 Rs4231 China Bat SARS-like CoV Hu et al B
## 36 civet007 China Civet SARS-CoV Wang et al A
## 37 PC4_227 China Civet SARS-CoV Song et al
## 38 civet020 China Civet SARS-CoV Wang et al A
## 39 A001 China Civet SARS-CoV Wang et al B
## 40 GZ0402 China Human SARS-CoV Song et al
## 41 GZ0401 China Human SARS-CoV Song et al
## 42 BJ04 China Human SARS-CoV Wu et al
## 43 BJ01 China Human SARS-CoV Wu et al
## 44 Sin852 Singapore Human SARS-CoV Vega et al
## 45 Sin842 Singapore Human SARS-CoV Vega et al
## 46 Sin846 Singapore Human SARS-CoV Vega et al
## 47 Sino1_11 China Human SARS-CoV Zhang et al B
## 48 TWH Taiwan Human SARS-CoV Shu et al A
## 49 TWS Taiwan Human SARS-CoV Shu et al B
## Accession
## 1 MG772934
## 2 MG772933
## 3 EPI_ISL_410538
## 4 EPI_ISL_410543
## 5 EPI_ISL_410539
## 6 EPI_ISL_410540
## 7 EPI_ISL_410541
## 8 EPI_ISL_410542
## 9 EPI_ISL_410544
## 10 EPI_ISL_402131
## 11 EPI_ISL_412977
## 12 MN994467
## 13 MT072688
## 14 MN988713
## 15 MT093571
## 16 WH-Human_1
## 17 MN996529
## 18 LC522972
## 19 GQ153542
## 20 KF569996
## 21 EPI_ISL_412976
## 22 KP886809
## 23 FJ588686
## 24 KY417143
## 25 KY417142
## 26 KY417148
## 27 KY417147
## 28 KY417145
## 29 KY417151
## 30 KY417152
## 31 KC881005
## 32 KY417144
## 33 KF367457
## 34 KC881006
## 35 KY417146
## 36 AY572034
## 37 AY613950
## 38 AY572038
## 39 FJ959407
## 40 AY613947
## 41 AY568539
## 42 AY279354
## 43 AY278488
## 44 AY559082
## 45 AY559081
## 46 AY559094
## 47 AY485277
## 48 AP006557
## 49 AP006560
We can use the metadata to annotate the phylogenetic tree in a way
that allows us to address our central question. In the example below, I
am using the variable virus.type to color the tip of the
phylogeny (adding geom_tippoint()), and rather than
labeling the tips with the name of the strain, I am labeling them with
the host species the virus was isolated from (using
geom_tiplab()).
#This line of code simply plots the same tree as above, but without the sample label
p2 <- ggtree(cov.tree)
## Warning in fortify(data, ...): Arguments in `...` must be used.
## ✖ Problematic arguments:
## • as.Date = as.Date
## • yscale_mapping = yscale_mapping
## • hang = hang
## ℹ Did you misspell an argument name?
#This codes adds the metadata to the tree
p2 %<+% cov.tree.meta +
geom_tiplab(aes(label=host), size=2) +
geom_tippoint(aes(color=virus.type), na.rm = TRUE)
What does the phylogeny tell you about the evolutionary origins of SARS-CoV-2, which causes the current pandemic? Discuss the findings in the context of your hypothesis.
The phylogenetic tree shows that SARS-CoV-2 is most closely related to bat coronaviruses and more distantly related to SARS-CoV and pangolin viruses. This suggests that SARS-CoV-2 likely evolved from a bat coronavirus, possibly involving an intermediate animal before infecting humans. These findings support the hypothesis that SARS-CoV-2 emerged through natural zoonotic evolution rather than from a human source.
Unlike most previous disease outbreaks, our ability to sequence a large number of virus genomes has allowed us to literally observe the evolution of the pathogen live. Visit the the Nextstrain graphical interface to see how the pathogen spread across the globe and how mutations accumulate as different viral clades diversify.
The neutral theory predicts that mutations steadily accumulate during the course of evolution. In this section, you will analyze data from 2,625 whole-genome sequences of SARS-CoV-2 that researchers have collected all over the world. The data set includes the original isolate from December 26 2019 and representative isolates up until July 6 2022.
In the first step, let’s visualize the phylogenetic relationships among all of these strains in the context of where they were collected. Hold your breath… this will be a mess:) No surprise when you plot over 2,500 data points at once!
Note this uses the same approach as above. You import the tree and the metadata. Then you plot the tree and add the metadata information to the tree. The code below is complete and should run automatically. Note that the scale bar represents the number of mutations between isolates.
#Load the phylogenetic tree
cov19.tree <- read.tree("nextstrain_ncov_open_global_6m_tree.nwk")
#Load the corresponding metadata
cov19.tree.meta <- read.csv("nextstrain_ncov_open_global_6m_metadata.csv")
#Visualize the tree and color the tips by region
#Note that I am using a fan layout here rather than the traditional layout, because it spreads out the tips more
#Depending on your monitor size, you may want to adjust the size of the tip points using the size parameter (currently set to 1.5)
p3 <- ggtree(cov19.tree, layout="fan", open.angle=10) + geom_treescale(offset=20)
p3 %<+% cov19.tree.meta +
geom_tippoint(aes(color=region), na.rm = TRUE, size=1.5)
To examine how mutations accumulate during virus evolution, we need to tally up the number of nucleotide substitutions for each isolate relative to the original virus. Using the phylogenetic data we have available, this is really simple, because we can sum up the length of branches that separates each isolate from the root of the tree (i.e., the original virus sequence).
The code you need to do this is really simple; it requires a single
function (distRoot()). However, I do not recommend that you
run this on your computer! For 2,625 genomes, this calculation takes
quite a while, and depending on your hardware, it might just crash your
computer. So, I already ran this analysis for you, and you can simply
import the results in the next section.
##YOU DO NOT NEED TO RUN THIS! THIS TAKES A LONG TIME TO RUN. THIS CODE IS HERE FOR ILLUSTRATION ONLY, IN CASE YOU ARE INTERESTED HOW IT WORKS
#library(adephylo)
#dist <- distRoot(cov19.tree, method = "patristic")
#ncov_mutations <- merge (cov19.tree.meta, dist, by.x = "strain", by.y = 0)
#ncov_mutations$date2 <- as.Date(ncov_mutations$date, format = "%Y-%m-%d")
#ref <- as.Date("2019-12-26", format = "%Y-%m-%d")
#ncov_mutations$time <- ncov_mutations$date2-ref
If we want to know how fast SARS-CoV-2 accumulates mutations over time, we can estimate the genome-wide molecular clock (i.e., the rate at which mutations occur over time). This can be done by plotting the number of mutations separating any isolate from the root as a function of time and then calculating the slope of a regression line between time (in days) and the number of mutations. We will do this using virus isolates for the first two years of the pandemic.
First, let’s import and plot the data using a scatter plot. The data
file “ncov_mutations.csv” includes the number of of mutations separating
any isolate from the root (calculated from the phylogeny as described
above) as well as the number of days that passed between the collection
of the original isolate. Remember, to draw a regression line, you can
add geom_smooth(method=lm, formula= y~0+x, se=FALSE) to the
code that makes a scatter plot.
#Import data
cov19.mutations <- read.csv("ncov_mutations.csv")
#Select first two years of data
cov19.mutations2 <- cov19.mutations[which(cov19.mutations$DaysSinceRoot<730),]
#Plot data
ggplot(cov19.mutations2, aes(x=DaysSinceRoot, y=Mutations)) +
geom_point() +
geom_smooth(method="lm", se=FALSE, formula=y~x-1) +
theme_classic() +
xlab("DaysSinceRoot") +
ylab("Mutations")
The regression line essentially represents the average at which rate mutations accumulate. Numerically, the number of mutations per day (i.e., the slope of the regression line in the graph above) can be calculated with the code below.
#You can run a regression with the lm function, designating the relationship between the variables as "y ~ 0 + x". The 0 indicates that the regression line must go through the origin.
regression <- lm(cov19.mutations2$Mutations ~ 0 + cov19.mutations2$DaysSinceRoot)
#The slope of the regression (in number of mutations per day) can be found by calling regression$coefficients[1]. Multiplying that number by 365 gives us a substitution rate per year.
mutations.per.year <- 365*regression$coefficients[1]
mutations.per.year
## cov19.mutations2$DaysSinceRoot
## 23.89179
As you evaluate the analyses in this section, what do you notice? What happened to all the viruses that were highly similar to the original samples and common in the first 100 days of the pandemic?
The viruses that were highly similar to the original samples—those with few or no mutations—were common early in the pandemic (first 100 days) but disappeared over time as new mutations accumulated. This pattern suggests that the original viral lineages were gradually replaced by descendant variants that carried advantageous or neutral mutations, reflecting the ongoing natural evolution and diversification of SARS-CoV-2 in the human population.
The calculation of the rate of mutation accumulation was really straightforward for the first two years of the pandemic. However, let’s plot the data for the full range of the dataset. I am color coding the data for the first two years separately:
#Here, I am just making a new variable for first two years or not
cov19.mutations$group <- ifelse(cov19.mutations$DaysSinceRoot<730, "Early", "Late")
#Plot data
ggplot(cov19.mutations, aes(x=DaysSinceRoot, y=Mutations, color=group)) +
geom_point(aes()) +
geom_smooth(data=subset(cov19.mutations,group=="Early"), method="lm", se=FALSE, formula=y~x-1, color="black") +
theme_classic() +
xlab("DaysSinceRoot") +
ylab("Mutations")
With your knowledge of the mutation rate, how do you interpret the results? Why are there apparent jumps in the number of mutations of isolates past day 730?
The graph shows that SARS-CoV-2 gained mutations at a steady rate over time, which matches its normal mutation rate of about 1–2 mutations per month. The sudden jumps in mutations after around day 730 happened when new variants appeared that had already built up many changes before spreading. These jumps show how new, more evolved versions of the virus replaced earlier ones.
Now you saw how SARS-CoV-2 has continuously accumulated mutations as it was spreading across the globe. The next question of course is where in the viral genomes do these mutations occur?
Using the 2,625 genomes that were sequenced, we can literally tally up where in the genome we see the variable sites. Here, we are providing you with a data set that measured the genetic diversity at every site in the SARS-CoV-2 genome. Diversity is measured as the number of observed mutations per codon (events).
To properly visualize this type of data, it requires some R wizardry. You don’t need to know the details of how this code works, just press play and focus on interpreting the results of the graph. If you are interested in unpacking the code line by line, I’m happy to meet with you outside of class.
#Import data
diversity <- read.csv("nextstrain_ncov_open_global_6m_diversity_aa.csv")
diversity2 <- diversity %>%
group_by(gene) %>%
complete(position = 1:max(position), fill = list(events = 0))
#This is so
diversity3 <- diversity2 %>%
# Compute chromosome size
group_by(gene) %>%
summarise(chr_len=max(position)) %>%
# Calculate cumulative position of each chromosome
mutate(tot=cumsum(chr_len)-chr_len) %>%
select(-chr_len) %>%
# Add this info to the initial dataset
left_join(diversity, ., by=c("gene"="gene")) %>%
# Add a cumulative position of each SNP
arrange(gene, position) %>%
mutate( BPcum=position+tot)
axisdf = diversity3 %>% group_by(gene) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )
ggplot(diversity3, aes(x=BPcum, y=events)) +
geom_point( aes(color=as.factor(gene))) + scale_color_manual(values = rep(c("red", "black"), 7 )) +
scale_x_continuous(label = axisdf$gene, breaks= axisdf$center) +
coord_flip()+
theme_classic() +
xlab("Gene") +
ylab("Diversity") +
theme(legend.position = "none")
As you evaluate the graph, what do you notice? How do you interpret these results in the context of neutral evolution?
The graph shows that some viral genes, like ORF8 and ORF7b, have high genetic diversity, while others like M and N are much more conserved. This pattern suggests that mutations in less essential genes accumulate through neutral evolution—random changes that don’t affect the virus much. In contrast, core genes stay stable because harmful mutations are weeded out by natural selection.
Applying the principles of the neutral theory, we can also identify what parts of the viral genome are likely under selection. In the data set below, researchers quantified ⍵ for some of the larger genes in the SARS-CoV-2 genome. ⍵ is defined as dN/dS, so it is the ratio of rate of non-synonymous mutations to the rate of synonymous mutations. Plot the values of ⍵ for the different genes using a bar graph.
#Import data
dn.ds <- read.csv("ncov_dnds.csv")
ggplot(dn.ds, aes(x=gene, y=w)) +
geom_bar(stat="identity") +
theme_classic() +
xlab("gene") +
ylab("fitness")
Based on the information, what genes do you think are under positive selection? What do the results mean for our understanding of COVID-19?
The genes under positive selection are mainly the spike (S) gene, and possibly ORF1b and N. These genes are changing faster, likely because the virus is adapting to the human immune system and vaccines. This means SARS-CoV-2 is still evolving, so tracking these genes helps us stay prepared for new variants.
Phylogenetic trees and genetic diversity data of SARS-CoV-2 and other betacoronaviruses were retrieved from Nextstrain:
For additional information on Nextstrain, see:
Data for DN/DS ratios were kindly provided by Manuela Sironi and Diego Forni based on the following work:
Consulting additional resources to solve this assignment is absolutely allowed, but failure to disclose those resources is plagiarism. Please list any collaborators you worked with and resources you used below or state that you have not used any.
Dr.Bonner