source(“https://bioconductor.org/biocLite.R”) biocLite(“ggtree”) In this tutorial, we are going to plot a tree and color its tips according to presence or absence of certain mutations.
Of course, you’d normally first collect sequences (for example from genbank, can be done in R), align them (using clustal, can be done in R) then find the best tree (using R, or using other programs such as PAUP or Phylip) and then plot the tree.
In this case, I wanted to recreate a tree from a different publication (Bloom et al 2010), and the authors send me the sequences they used to make the tree. I used Phylip (an online version) to find the best tree using a maximum likelihood method.
The paper is Bloom, Gong, Baltimore. 2010, Science, Permissive Secondary Mutations Enable the Evolution of Influenza Oseltamivir Resistance.
Look up the paper online and read the abstract. Which is the resistance mutation they talk about? Which pathogen becomes resistant to which drug with this mutation? Which are the two permissive mutations they talk about?
Now, you need to load a bunch of packages. If necessary, use install.packages() function to install these packages on your computer.
#install.packages("seqinr")
library(seqinr)
## Loading required package: ade4
library(ape)
##
## Attaching package: 'ape'
##
## The following objects are masked from 'package:seqinr':
##
## as.alignment, consensus
library("ggtree")
##
## Attaching package: 'ggtree'
##
## The following object is masked from 'package:ape':
##
## rotate
library("ggplot2")
Note that this tutorial is written in an rmd file or R markdown file. If you “knit” the file (there is a button that says “knit html”) just above this window. Try it out! You’ll get an html file that has all the comments, the code and the output in one file. Two things you need to keep in mind when you knit: 1. For the knitting to work, you should make sure you have the input files we will use and this RMD file in the same folder. 2. Any line that says install.packages() should be “commented out” by the hash or pound sign #
Try knitting the file.
Set the working directory so that you can easily access the files you need. This is not important for the knitting, but it is important for you to work smoothly on the code.
Set your working directory to the right folder, where you have the necessary files. You can do this by going to the “Session” menu and choose “set working directory” and “choose directory.” The code will appear in the console. Make sure you copy it into this RMD file.
#replace this piece of code with your own to go to the right working directory on your computer.
setwd("~/Desktop")
Read the fasta file with DNA sequences
NAseqs<-read.fasta("NA_alignment.fasta") #NA stands for neuraminidase, the gene for which we have the sequences
Use the class function to look at what class NAseqs is, and what class the different elements of NAseqs are. How many elements does the object NAseqs have? Put your code in the next chunk.
#write your code here
The paper (Bloom et al) found that amino acids 274, 222 and 234 are linked. I want to know which of the sequences have the ancestral or the mutated amino acid at these positions.
#First, we create three empty vectors
HasH274Y<-c(); HasR222Q<-c(); HasV234M<-c()
#Then, we want to look at the state of the three amino acids for each of the sequences, so we use a for loop that loops over all the sequences, from sequence 1 until sequence length(NAseqs).
for (i in 1:length(NAseqs)){
#Now, for each sequence, we check the status of amino acid 275 (this amino acid is refered to as 274 in the paper, because that's where it is in other influenza strains).
#If it has the Y, we add a 1 to the vector HasH274Y
if (translate(NAseqs[[i]])[275]=="Y"){HasH274Y<-c(HasH274Y,1)}
#If it doesn't have the Y, we add a 0 to the vector HasH274Y
else{HasH274Y<-c(HasH274Y,0)}
#Now, we look at amino acid 222
if (translate(NAseqs[[i]])[222]=="Q"){HasR222Q<-c(HasR222Q,1)}
else{HasR222Q<-c(HasR222Q,0)}
#Now we look at amino acid 234
if (translate(NAseqs[[i]])[234]=="M"){HasV234M<-c(HasV234M,1)}
else{HasV234M<-c(HasV234M,0)}
}
We now have three vectors of zero’s and ones. It is easier to work with them if they are organized in a dataframe. We’ll make a dataframe calles Genotypes.
Genotypes<-data.frame(labels(NAseqs),HasH274Y,HasR222Q,HasV234M,ColorTree="black",genotype=0,OrderInTree=0,stringsAsFactors = FALSE)
How many rows and columns does Genotypes have? Try using str(), summary(), class() and head() on the dataframe Genotypes.
#write your code here
Now we should change the values of Genotypes$ColorTree. If the sequence has H274Y, it is resistant to oseltamivir. If such a resistant virus has the other two mutations, it is highly fit. I want to color resistant and highly fit sequences red.
Genotypes$ColorTree[which(Genotypes$HasH274Y==1&Genotypes$HasR222Q==1&Genotypes$HasV234M==1)]<-"red"
Next, it is your task to change Genotypes$ColorTree so that 1. Sequences that have 274V but not the other two mutations should be orange 2. Sequences that have the two other mutations (222Q and 234M) but not 274V should be green 3. All other sequences should be grey
#write your code here
Next, we’ll read the tree I made using PhyML
PhyMLAATree<-read.tree("aaseqs_phy_phyml_tree.txt")
We can directly plot this tree using the plot() function, but the result is not so pretty.
plot(PhyMLAATree)
#We can use names(plot(x)) to find the attributes of the plot object.
names(plot(PhyMLAATree))
## [1] "type" "use.edge.length" "node.pos"
## [4] "node.depth" "show.tip.label" "show.node.label"
## [7] "font" "cex" "adj"
## [10] "srt" "no.margin" "label.offset"
## [13] "x.lim" "y.lim" "direction"
## [16] "tip.color" "Ntip" "Nnode"
## [19] "root.time"
# There is an attribute called show.tip.label, so let's set that to false for now
plot(PhyMLAATree,show.tip.label=FALSE)
#Next, let's "ladderize" the tree so it looks better.
plot(ladderize(PhyMLAATree),show.tip.label=FALSE)
#the function ggtree does these things automatically:
ggtree(PhyMLAATree)
We’ll stick with ggtree to plot our tree, we’ll also root the tree and add the colors we worked on to indicate which sequences carry the resistance mutation.
But before that, there is an important issue to deal with. We have information about the shape of the tree in the object PhyMLTree and information about the genotypes in the object Genotypes. However, the sequences are in a different order in PhyMLTree than in Genotypes. So if we want to add the colors based on what we know about the genotypes, we need to know which sequence in the tree corresponds to which sequence in the Genotypes dataframe. Look at the first five labels in both objects to see that they are different.
PhyMLAATree$tip.label[1:5]
## [1] "gb|CY0436231->1372|" "gb|CY0434671-1413|" "gb|CY0434031-1413|"
## [4] "gb|AB4653191->1410|" "gb|GQ4757501-1413|"
Genotypes$labels.NAseqs.[1:5]
## [1] "gb|CY033624:2-1414|" "gb|EU199388:1-1413|" "gb|EU516138:1-1413|"
## [4] "gb|CY026357:1-1413|" "gb|CY026533:2-1414|"
Before we do a comparison, we need to fix something small (but crucial!). The labels in PhyMLTree don’t have colons, but the labels in Genotypes do. Let’s make a new Genotypes column with names without colons so that we can compare them.
#first let's do this just for the first line
#gsub works like this gsub(pattern to find, pattern to replace with, object to do this to)
gsub(":","",Genotypes$labels.NAseqs.[1])
## [1] "gb|CY0336242-1414|"
#Now we can do this for the entire column, because gsub works on vectors
Genotypes$labelsNoColon<-gsub(":","",Genotypes$labels.NAseqs.)
OK, now we can compare the labels from the two objects. We’ll first create an empty vector and use it to store, for each sequence in the tree file, the position of the sequence in the Genotypes file
#Create empty vector
OrderSequences<-c()
#We use a for loop to look at each of the sequences in the tree object starting from sequence 1 until sequence length(PhyMLAATree$tip.label)
for (i in 1:length(PhyMLAATree$tip.label)){
#Find the position of the ith tiplabel
pos<-which(Genotypes$labelsNoColon==PhyMLAATree$tip.label[i])
#add this position to the empty vector
OrderSequences<-c(OrderSequences,pos)
} #repeat for each sequence
#Now we can use Genotypes$ColorTree[OrderSequences] to tell ggtree the color of the tips
ggtree(PhyMLAATree)+
geom_tippoint(color=Genotypes$ColorTree[OrderSequences], size=2)
Let’s properly root the tree and reduce the linewidths using the lwd option.
#where is the sequence that Bloom et al used as root?
which(PhyMLAATree$tip.label=="gb|CY00931810-1422|")
## [1] 52
ggtree(root(PhyMLAATree,52),lwd=.2)+
geom_tippoint(color=Genotypes$ColorTree[OrderSequences], size=1)
Compare the rooted tree with the unrooted tree. Do they look different? You can use the arrows above the graph in the lower right window to go back to previous figures.
Look at the help function for ggtree and find out how to - turn off ladderization - find out what a slanted tree is - not use branch lengths
#Put your code here
If you want to use the tree for a publication or paper, you should save it as a pdf or png
pdf("ggtree_tree.pdf")
ggtree(root(PhyMLAATree,52),lwd=.2)+
geom_tippoint(color=Genotypes$ColorTree[OrderSequences], size=1)
dev.off()
## quartz_off_screen
## 2
Now check that the file ggtree_tree.pdf has been created on your computer
Create a png file of the tree which is 800 by 800 pixels. Use the help function for the png() function.
Next, I wanted to highlight the part of the tree with sequences that have the 234M mutation and the 222Q mutations.
First we have to find outwhich node we’re looking for. I did this in a not-so-elegant way (but it works!). The trick is to print the tree in a large pdf and print the node labels. I can see that node 523 seems to be the common ancestor of all viruses that have 234M and 222Q
pdf("findnode.pdf",height=20, width=10)
ggtree(root(PhyMLAATree,52),lwd=.2)+
geom_text(aes(label=node, color=c(1),cex=0.3))+ #show the labels of the nodes
geom_tippoint(color=Genotypes$ColorTree[OrderSequences], size=3)
dev.off()
## quartz_off_screen
## 2
Now we can use the hilight() function on the tree to show the clade of all viruses that have node 523 as ancestor. This could be done by writing hilight(ggtree(ladderize etc etc etc)), but instead, I find it easier to save the tree plot to a variable called p and then to apply hilight to p.
p<-ggtree(root(PhyMLAATree,52),lwd=.2)+
geom_tippoint(color=Genotypes$ColorTree[OrderSequences], size=2)
hilight(p,node=523, fill="darkgreen", alpha=.2)
- Rewrite the code so that the q equals the rooted version of PhyMLAATree
root(PhyMLAATree,52) and p is the ggtree plot of q.
- Find out what the alpha attribute does
- Try out different colors for the filling of the highlighted clade
- Change the size of the tip points
- Make all tips purple
- Make a pdf of the best version of your tree
We are done with this tree-plotting tutorial. I hope you found it useful. - Which part of the tutorial did you find most interesting? Most confusing?