Editing Phylogenies

This is some basic code for messing with phylogenies and making them look nice in R. It uses a bastard combination of multiple packages:

library(ape)             # v3.4
library(ggtree)          # v1.2.10
library(ggplot2)         # v2.0.0
library(wesanderson)     # v0.3.2 only needed for certain colors
library(phangorn)        # v2.0.1

Here’s how you can read in a generic tree file in newick format:

tree=read.tree("/Volumes/hoekstrafs1/taste/trees/01_15_16/RAxML_bipartitions.ProteinTree")

If for whatever reason you have single nodes in the tree (nodes with a single descendant), here’s a quick ape function that takes care of that with magic:

tree<-collapse.singles(tree)

GGtree also has a specific function for reading in RAxML phylogenies. I’ve found that while this makes initital interpretation easier, it seriously limits downstream actions because the RAxML tree object is harder to use with all the packages here. Regardless, here’s how to read in a RAxML tree:

tree<-read.raxml("RAxML/RAxML_bipartitionsBranchLabels.firstML")

Let’s say you have ugly labels at the tips of your branches, like so:

head(tree$tip.label)

## [1] "BW_NW_006501958.1_638666_641604_ORF_1"
## [2] "CA_TRINITY_GG_101_c0_g1_i1_ORF_1"     
## [3] "PO_TRINITY_GG_90_c3_g1_i1_ORF_1"      
## [4] "LL_TRINITY_GG_87_c4_g1_i1_ORF_1"      
## [5] "LL_TRINITY_GG_286_c0_g1_i1_ORF_1"     
## [6] "BW_NW_006501958.1_646973_649914_ORF_1"

It’s easy to clean them up for easier presentation down the line:

seq = tree$tip.label  #Create vector of tip labels
dd = data.frame(seq)  #Make that the first column of a data frame

# Below, we add a column to the data frame called 'taxa'. Using grep, we can
# easily populate this column with the names we want to use for our final
# tree
dd$Species = 0
dd$Species[grep("PO", dd$seq)] = "Peromyscus polionotus"
dd$Species[grep("BW", dd$seq)] = "Peromyscus maniculatus"
dd$Species[grep("LL", dd$seq)] = "Peromyscus leucopus"
dd$Species[grep("CA", dd$seq)] = "Peromyscus californicus"
print(dd[1:4, ])

##                                     seq                 Species
## 1 BW_NW_006501958.1_638666_641604_ORF_1  Peromyscus maniculatus
## 2      CA_TRINITY_GG_101_c0_g1_i1_ORF_1 Peromyscus californicus
## 3       PO_TRINITY_GG_90_c3_g1_i1_ORF_1   Peromyscus polionotus
## 4       LL_TRINITY_GG_87_c4_g1_i1_ORF_1     Peromyscus leucopus

Now on to plotting. Here I’ve created a function that will assign colors to each taxon in the tree. Just input the dataframe you created earlier:

cls <- function(dd){
  ss<-sort(unique(dd$Species))
  colors<-setNames(palette()[1:length(ss)],ss)
  colors <- colors[dd$Species]
  return(colors)
}

colors = cls(dd)

And now we can plot a tree! Below is the most basic plotting function I use, from the ape package. You’ll notice that the tip labels are still the ugly ones from earlier; this is because I often use this method to very quickly plot a tree for my interpretation, rather than presenting to someone else. Plotting a nicely formatted tree takes a bit more tinkering, which I’ll show later.

plot(tree,tip.color=colors,cex=0.23,font=2,edge.width=c(1.5),no.margin=TRUE)
legend("topright",sort(unique(dd$Species)),fill=unique(colors))

This tree looks ugly as shit because of how it’s rooted. Midpoint can fix that:

tree = midpoint(tree)

If we want to plot shapes for tip labels instead of the species names themselves, we use the following ggtree code. Notice I still use “dd”" as the data frame.

p <- ggtree(tree)
p %<+% dd + geom_tippoint(aes(shape = Species, color = Species), alpha = 1, 
    size = 2.5) + geom_text2(aes(label = label, subset = !isTip), size = 1.5, 
    hjust = 1.2, vjust = -0.5) + theme(legend.position = "right")

# I set the argument 'subset' to !isTip, telling geom_text to only provide
# the text associated with the nodes (i.e. bootstrap values). If I didn't
# have a subset argument, the ugly sequence names from earlier would show up
# next to the shapes at each tip. You can change the position of the
# bootstap labels using hjust (horizontal shift) and vjust (vertical shift).
# These same arguments can be used with many GGtree & GGplot items.

With bigger trees, things start to get cluttered. Ape has a nice function called subtreeplot that allows you to interactively explore a phylogeny, clicking on subtrees that you would like to view. The following code…

subTree=subtreeplot(tree,wait=TRUE,cex=0.23,tip.color=colors)

…will display your starting phylogeny on the left, and a selected node on the right. Clicking on a node in the left hand phylogeny will allow you to zoom in on different parts of the tree. Be patient, this process can take a few seconds with each click. When you are done exploring, pressing esc will return the current subtree. You can mess with text size, which labels to display, and many more features for easier visualization using the same arguments that are available for plot.phylo.

After some exploring, you might want to create an annotated tree for presenting to others. Annotation with ggtree allows the most customization, but does require knowing the node numbers for each section of the tree. There are several ways we can find these node numbers. If you generated a subtree using the output of subtreeplot() above, getting the node is easy:

coolNode=subTree$name

If we are interested in finding which node has “A” and “B” as its descendants, we can do:

otherNode = MRCA(tree, tip = c("A", "B"))  #Finds the most recent common ancestor for tips A and B

Now that we have numbers for the two nodes we’re interested in, we can annotate them on the phylogeny. In this exercise, we’ll annotate coolNode and hilight otherNode. We’ll go back to shapes for now.

p <- ggtree(tree)
p %<+% dd + 
  #Use a shape for each species
  geom_tippoint(aes(shape=Species, color=Species), alpha=1,size=2.5) +  
  #highlight
  geom_hilight(node=otherNode,alpha=0.6) +   
  #annotate
  geom_cladelabel(node=coolNode,label="Look at this!",offset.text=0.1,barsize=2,geom="text") + 
  #add bootstrap values
  geom_text2(aes(label=label,subset=!isTip),size=2,col="#FF0000",hjust=1.2,vjust=1.2) + 
  #add a legend
  theme(legend.position=c(0.8,0.2),legend.key.size=unit(0.5,"cm"),legend.text=element_text(face="italic"))

If you want to see all the node numbers at once, it’s pretty straightforward:

ggtree(tree) + geom_text2(aes(label = node, subset = !isTip), col = "red", size = 2, 
    hjust = -0.3) + geom_tiplab(cex = 1)

This tree is extremely ugly, but it shows the ggtree assigned node numbers in red. If you can recognize which nodes you want to annotate based on the topology, then feel free to remove the geom_text argument above. Otherwise, simply export the tree as a pdf…

pdf("myUglyTree.pdf")
ggtree(tree) + geom_text(aes(label = label), col = "red", size = 3, hjust = -0.3) + 
    geom_text(aes(label = label), size = 1.3)
dev.off()

…open the pdf, and use Control+F to search for any sequence name in the node you wish to annotate. With the tree as a pdf, you can zoom in on portions of the phylogeny while maintaining resolution. Kind of like what we were doing earlier with subtreeplot, but more cumbersome.

This has just been a primer, as the options are endless with ggtree, and the amount to which you can customize your phylogeny is absurd. There is also specific formatting for some evolutionary software, which I’ll try to add to in the future.

Editing Phylogenies

Brock Wooldridge

January 15, 2016