Introduction

In this document, I will demonstrate how to construct and visualize a phylum-level phylogenetic tree to be implemented as part of the HGTsearch software. Given a set of ~100 orthologous genes along with their taxonomy lineage, the goal is to create a phylogenetic tree that demonstrates the evolutionary relationships between the gene of interest and its orthologous genes across different phyla. The Smith-Waterman (SW) alignment score is used to infer evolutionary distance between organisms.

Setting up

Packages

The following R packages will be needed for this task:

library(ggtree) #to visualize tree
library(gridExtra) #to arrange plots
library(magrittr) #pipe operator
library(dplyr) #data wrangling
library(ape) #create & manipulate tree
library(plotly) #interactive tree

Data

For this document, I will use the retrieved orthologous genes of the microorganism Acanthamoeba castellanii gene ACA1_245650 to demonstrate functionality. This table is available on the HGTsearch GitHub page here (NEED TO UPDATE LINK). The method is generally applicable to any gene of interest provided the orthologs table, which HGTsearch generates based on a user query.

#read in example table
keggTable <- read.csv('./ACA1_245650.csv', stringsAsFactors = FALSE)

#print first 3 rows in table
knitr::kable(head(keggTable, 3))

DB.Hit	Organism	Taxonomy	Length	Overlap	Identity….	SW.Score	Annotation
afo:Afer_2006	Acidimicrobium ferrooxidans	Prokaryotes;Bacteria;Actinobacteria;Acidimicrobium	1194	416	24.8	327	cellulose-binding family II; K01179 endoglucanase [EC:3.2.1.4]
dpp:DICPUDRAFT_154318	Dictyostelium purpureum (cellular slime mold)	Eukaryotes;Protists;Amoebozoa;Dictyostelium	470	303	25.7	316	hypothetical protein
cfl:Cfla_3024	Cellulomonas flavigena	Prokaryotes;Bacteria;Actinobacteria;Cellulomonas	829	341	25.2	297	glycoside hydrolase family 62; K01181 endo-1,4-beta-xylanase [EC:3.2.1.8]

Strategy 1

Constructing the tree

A modified version of the neighbor-joining method will be implemented to construct the tree. The approach is as follows:

calculate distance of each gene to target gene as inverse of SW score
group genes into phyla
branch length from target gene to a phylum is the minimum distance among genes in that phylum

1. Calculate gene evolutionary distances

For this step, we rely on the common assumption that more closely-related species will have higher alignment scores. Building upon this principle, we will use the inverse of the SW alignment score as an evolutionary distance metric. Note that these alignment scores are from each gene to the query gene of interest; a more accurate phylogeny would include the alignment of all the genes

#calculate inverse of SW score and store in table
keggTable$distance <- keggTable$SW.Score ^ (-1)

#show distribution of evolutionary distances
hist(keggTable$distance, main = "Distribution of inferred evolutionary distances")

2. Group genes into phyla

The next step is to group the genes into phyla since we would like to construct a phylum-level phylogenetic tree. This step could be easily modified to construct species-level or other taxonomy-level trees if desired. The taxonomy information is contained in the orthologs table as a single taxonomy lineage, with taxonomy levels separated by a semi-colon (e.g. Prokaryotes;Bacteria;Actinobacteria;Acidimicrobium), so we will need to extract the phylum from the lineage prior to grouping.

#remove entries from table which do not contain taxonomy information
keggTable <- keggTable[keggTable$Organism!="character(0)",]

#remove entries from the animal kingdom to keep only biologically-reasonable phyla
keggTable <- keggTable[-grep('Animals', keggTable$Taxonomy),]

#extract phylum from taxonomy lineage (always the third term in the lineage) and add to table
keggTable$phylum <- as.factor(apply(keggTable, 1, function(ortholog){
  strsplit(ortholog["Taxonomy"], split = ';')[[1]][3]
}))

#store all genes and phyla in a vector
taxonomies <- factor(c(as.character(keggTable$DB.Hit),"Acanthamoeba",levels(keggTable$phylum)))

#show first 4 phyla as example
head(keggTable$phylum, 4)

## [1] Actinobacteria Amoebozoa      Actinobacteria Actinobacteria
## 17 Levels: Actinobacteria Amoebozoa Armatimonadetes ... Unclassified Terrabacteria group

#show the 5 highest-occuring distinct phyla present among the 100 genes
knitr::kable(sort(table(keggTable$phylum), decreasing = TRUE)[1:5])

Var1	Freq
Actinobacteria	24
Ascomycetes	12
Eudicots	9
Betaproteobacteria	6
Firmicutes - Bacilli	6

3. Determine tree branches

Our tree will contain branches at two levels: first branching out to the phyla, and second branching from the phylum to its genes. We will construct a matrix that contains these edges. For visualization purposes later on, we will store the nodes as integers in this matrix.

#determine no.of phyla
phylaCount <- length(levels(keggTable$phylum))

#branches from root node (target gene) to each of the phyla
firstLevel <- matrix(c(rep("Acanthamoeba",phylaCount), levels(keggTable$phylum)), ncol = 2)

#branches from each phylum to its genes
secondLevel <- as.matrix(keggTable[,c('phylum', 'DB.Hit')])

#put both tables together for total edges matrix
branches <- rbind(firstLevel, secondLevel)

#convert branch to integers for technical visualization purposes
branches[,] <- as.integer(factor(branches[,], levels = taxonomies))
branches <- apply(branches, 2, as.integer)

4. Compute branch lengths

The branch lengths of the tree will be directly proportional to the evolutionary distance between the orthologous phyla and the target gene. The distance from the target gene to each phylum is the minimum distance of all genes within a phylum. The distance of a phylum to its genes is the same for all genes, which is the minimum distance in the table.

#group genes by phyla and store minimum distance from each phylum in a variable
phylaDistances <-  
  keggTable %>%
  group_by(phylum) %>%
  summarize(min = min(distance))

#show 4 closest phyla
knitr::kable(phylaDistances[order(phylaDistances$min),][1:4,])

phylum	min
Actinobacteria	0.0030581
Amoebozoa	0.0031646
Stramenopiles	0.0039526
Eudicots	0.0040000

#create vector with branch lengths
branchLengths <- c(phylaDistances$min, rep(min(phylaDistances[,2]),nrow(keggTable)))

#show distribution of evolutionary distances to phyla
hist(branchLengths, main = "Distribution of inferred evolutionary distances to phyla")

5. Create the tree

Now it’s time to actually create the tree! We will create an object of type phylo which will store our tree. phylo is the most common data type for representing phylogenetic trees in R. A phylo object is just a list of 4 items:

edge: a two-dimensional matrix which contains the ‘from’ node in the first column and the ‘to’ node in the second
tip.label: the names of the leaves in the tree
edge.length: a vector with the edge weights (branch lengths) of the tree; should have the same order as the edge table (optional)
Nnode: number of internal nodes in the tree (NOT leaves)

The phylo object will be constructed by putting together all the elements we calculated in the previous steps. Refer to the ape package manual for more functionality.

#create tree
tree <- list(
  "edge" = branches, # branches need to be stored as numbers for plotting
  "tip.label" = keggTable$Organism, #(Genus species)
  "edge.length" = branchLengths,
  "Nnode" = as.integer(length(unique(keggTable$phylum)) + 1) #no. of phyla + root node
)

#set class to 'phylo'
attr(tree, 'class') <- 'phylo'

#check if this is a valid phylo object
checkValidPhylo(tree)

## Starting checking the validity of tree...
## Found number of tips: n = 81 
## Found number of nodes: m = 18 
## Done.

Visualizing the tree

Basics

In this section, I will visualize the tree created in the previous section. I will use the ggtree package for this task. There are multiple ways the tree can be visualized, as exhibited below.

#store different kinds of trees
t1 <- ggtree(tree, layout = "circular") 
t2 <- ggtree(tree, layout = "rectangular") 
t3 <- ggtree(tree, layout = "slanted") 
t4 <- ggtree(tree, layout = "equal_angle") 

#plot
grid.arrange(t1,t2,t3,t4, nrow=2)

A circular tree will be implemented for the following steps.

Species

Use geom_tiplab() to add labels to tree leaves. geom_tiplab2() rotates the labels to make them more readable for the human eye on circular plots. Notice that the tree branches are all roughly of equal length due to similar SW scores across the top 100 orthologs.

ggtree(tree, layout = "circular") +
  geom_tiplab2(color="darkgreen")

Phyla

Use geom_nodepoint to add points at the internal nodes (the phyla in this case). Use geom_nodelab for adding labels to the nodes instead. Add a 2 to the end of most ggtree functions to automatically adjust label/point for circular trees. Oddly, to label nodes the value passed to it should be of the same length as the number of edges in the tree, which frankly does not make sense.

ggtree(tree, layout = "circular") +
  geom_tiplab2(color="darkgreen") +
  geom_nodelab2(aes(label = c(1:(nrow(keggTable)+1), as.character(phylaDistances$phylum))), 
                hjust = 1.15, vjust = -0.5)

Final Tree

As a reminder, in this tree, genes were segragated based on their phylum, with an artificial phylum node created from which all the genes belonging to the phylum fan out.

Nodes can be highlighted using geom_hilight. Clades can be highlighted using geom_cladelabel. Unfortunately, these commands cannot take more than one node at a time, so I used a loop to call it once for each phylum as a quick fix. It is worth noting that the node ids in phylo objects start with all the tips followed by the internal nodes. So, in this case we have 99 genes so the index of the first internal node is 100.

#group by clade for species color-coding
tree <- groupClade(tree, .node = (nrow(keggTable)+2):((nrow(keggTable)+phylaCount+1)))

#save initial tree
t <- ggtree(tree, aes(color = group),layout = "circular") +
  geom_tiplab2()

#add phylum color-coding and labeling one-by-one
for(i in (nrow(keggTable)+2):((nrow(keggTable)+phylaCount+1)))
  t <- t + 
  geom_cladelabel(label = phylaDistances$phylum[i-(nrow(keggTable)+1)],
                  node = i, color = "black", offset.text = -0.004, barsize = 0,
                  fontsize = 7,  family='Palatino', angle = "auto") +
  geom_hilight(node = i, fill = i)

#show tree
t

Next Steps

As always, there is room for improvement. The next steps to do for this tree are (1) to coordinate the colors of the species labels and the phylum highlights, which turned out to be a little trickier than I expected, (2) to align the phylum labels in a way that is easier to read, (3) and to automatically fit the phylum label font size to its highlighted area for a cleaner visualization (make the font size depend on the number of species in that phylum). Also considering moving the species labels into the ‘slices of the pie’ and the phylum labels outside instead.

Strategy 2

The previous strategy contains three main limitations:

Due to the inferred evolutionary distances between the phyla being very similar to each other, the circular tree does not accurately convey the evolutionary distances. The above tree for Acanthamoeba castelanii, for example, looks more like a pie chart than a phylogenetic tree.
The creation of artificial ‘phylum nodes’ from which the species fan out.
The presence of a root node representing the organism of interest from which the tree branches out to the phyla. The queried gene is not actually an ancestor of all the orthologs so the tree should be unrooted.

An alternative strategy is now considered that will attempt to address these limitations. The premise is the same, with the inverse of the alignment score being used to infer evolutionary distances. The modifications are the following:

In order to accentuate the differences in distances between the genes, hierachical clustering will be applied to the data.
To avoid creating artificial phylum nodes, separate trees will be constructed for each of the phyla and then they will be merged.
An unrooted tree layout will be used instead of the circular layout to avoid creating a fake root.

Constructing the tree

1. Calculate gene evolutionary distances and extract phyla

This step is carried out the same as in strategy 1. Note thst the Euglenozoa phylum only has 2 members and they both have the same SW score, thus their distance is zero and their eventual unrooted phylogenetic tree will be essentially empty.

#read in example table
keggTable <- read.csv('./ACA1_245650.csv', stringsAsFactors = FALSE)

#remove entries which do not contain taxonomy information
keggTable <- keggTable[keggTable$Organism!="character(0)",]

#remove entries from the animal kingdom to keep only biologically-reasonable phyla
keggTable <- keggTable[-grep('Animals', keggTable$Taxonomy),]

#calculate inverse of SW score and store in table
keggTable$distance <- keggTable$SW.Score ^ (-1)

#extract phylum from taxonomy lineage (always the third term in the lineage) and add to table
keggTable$phylum <- as.factor(apply(keggTable, 1, function(ortholog){
  strsplit(ortholog["Taxonomy"], split = ';')[[1]][3]
}))

2. Compute distance matrix

Using the base R function dist, a distance matrix will be computed for the entire dataset. This matrix contains the euclidean distances between all pairs of genes in the dataset based on the inferred evolutionary distances to the query and will be used for hierarchical clustering.

#compute distance matrix
distMatrix <- dist(keggTable$distance, method = "euclidean")

3. Hierarchical clustering

The distance matrix from the previous step is now subjected to hierarchical clustering using the complete linkage method (default clustering method in R). Note that hierarchical clustering creates a rooted tree by default which I will unroot prior to downstream analysis. Note that the plot function of the ape package automatically unroots the tree when asked to plot with the type parameter set to unrooted, ggtree however is not smart and will just plot the tree with the root even when asked to plot it as unrooted.

#perform hierarchical clustering
clustered <- hclust(distMatrix, method = "complete")

#convert clustering results to object of type 'phylo'
clustered <- as.phylo(clustered)

#unroot tree
clustered <- unroot(clustered)

#label leaves by the phylum they belong to
clustered$tip.label <- as.character(keggTable$phylum)

Visualize the tree

Clusters

ggtree(clustered, layout = "daylight") + geom_tiplab()

From this (rather messy) tree we can see that simply clustering by distance does not give us the desired tree. We want the tree to show the relationships between phyla, so the nodes in each of the main clades (branches) should all come from the same phylum, but that is not the case in the above tree. Put simply, the genes do not cluster by phylum. Since we want to construct a phylum-level tree that visualizes the present phyla, an alternative solution is needed.

Create & merge phyla trees

In this approach, I will create separate trees for each phylum and then merge them into one tree. First, create a phylo tree for each phylum:

#store names of all phyla in a vector
phyla <- as.data.frame(table(keggTable$phylum))

#group phyla with just one gene into a 'other' category
others <- phyla[phyla$Freq==1,1]

#remove those phyla from the original list
phyla <- phyla[!(phyla[,1] %in% others), 1]

#iterate over phyla, create a tree for each, and store in a list
trees <- lapply(phyla, function(phylum){
  #get inferred evolutionary distances for genes within phylum
  distances <- keggTable[keggTable$phylum == phylum,'distance']
  #compute distance matrix
  distMatrix <- dist(distances, method = 'euclidean')
  #perform hierarchical clustering
  clusters <- hclust(distMatrix, method = "complete")
  #convert clustering results to object of type 'phylo'
  clusters <- as.phylo(clusters)
  #label leaves with kegg id
  clusters$tip.label <- as.character(keggTable[keggTable$phylum == phylum,'DB.Hit'])
  #unroot tree if it has at least 3 nodes (else it is a straight line and passes an error)
  if(length(clusters$tip.label) > 2)
    clusters <- unroot(clusters)
  
  return(clusters)
})

#create tree for 'others' category and add to list of trees
othersTree <- keggTable[keggTable$phylum %in% others, 'distance'] %>%
  dist(method = 'euclidean') %>%
  hclust(method = 'complete') %>%
  as.phylo()
othersTree$tip.label <- as.character(keggTable[keggTable$phylum %in% others,'DB.Hit'])
trees[[length(trees)+1]] <- unroot(othersTree)

#convert list of trees to 'multiPhylo' object
class(trees) <- "multiPhylo"

The simplest way to merge trees is to add them together (e.g. tree1 + tree2). This attaches the second tree to the first tree at either the root node or a specified node. If neither are provided it attaches to the first internal node (the hypothetical root). An alternative approach is to use the bind.tree function which is simply a wrapper for the addition process with a couple of parameters that are helpful for rooted trees. In our case, we do not have a root (although some sort of artificial root seems to be created at the point of attachment), which makes visualizing the merged tree a little less intuitive. Once again, the goal is to avoid creating artificial phylum nodes.

Merge the phylum trees into one:

#merge trees
mergedTree <- trees[[1]]
for(i in 2:length(trees))
  mergedTree <- mergedTree + trees[[i]]

#check dimensions of merged tree (should have 81 tips after removing animals)
mergedTree

## 
## Phylogenetic tree with 81 tips and 46 internal nodes.
## 
## Tip labels:
##  afo:Afer_2006, cfl:Cfla_3024, ace:Acel_0617, plk:CIK06_24665, actn:L083_7998, ksk:KSE_74220, ...
## 
## Unrooted; includes branch lengths.

Visualize the tree

The ggtree package is used to visualize the tree. Setting the layout to ‘unrooted’ or ‘daylight’ does not make a difference, since the default method for plotting unrooted trees is the ‘daylight’ algorithm (equal_angle is the other popular algorithm).

#plot tree
ggtree(mergedTree, layout = "unrooted")

The above tree contains the result of merging the trees of all 13 phyla present in the data. A clear artificial root node forms at the point of intersection. This is undesirable since it falsely implies the existence of a universal common ancestor at that point which is not true. Although undesirable, the creation of an artificial root node is inevitable in order to construct one merged tree. I will set the distance of each tree to its root to be the average pairwise distance between the all the genes within that phylum, this seems to lead to a reasonable visualizationa although this might be changed in the future. These root nodes can be thought of as acting as points of attachment for the other trees, rather than an actual added data point. Note that due to the evolutionary distances being very close to each other, the visualized tree essentially appears to have one root node from which all clades branch out, when there are actually unique nodes that distinguish each tree, even if they are very close to each other, which is crucial for labeling the clades. But for visualization purposes, it seems for now that a universal root is unavoidable. This will likely be changed in the future when we compute all pairwise alignments between genes to construct the tree. Also, the equal_angle method will be used to visualize the tree instead of daylight, since the daylight algorithm gives an error when attempting to calculate the ‘angle changes’ of the tree when dealing with these added roots.

First, I will create the phylum trees again, but this time adding the root nodes. Next, I will merge the trees into one using the same method as before.

#store names of all phyla in a vector
phyla <- as.data.frame(table(keggTable$phylum))

#group phyla with just one gene into a 'other' category
others <- phyla[phyla$Freq==1,1]

#remove those phyla from the original list
phyla <- as.character(phyla[!(phyla[,1] %in% others), 1])

#initialize variable to keep track of trees with 3 nodes
threeNodes <- character()

#iterate over phyla, create a tree for each, and store in a list
trees <- lapply(phyla, function(phylum){
  #get inferred evolutionary distances for genes within phylum
  distances <- keggTable[keggTable$phylum == phylum,'distance']
  #compute distance matrix
  distMatrix <- dist(distances, method = 'euclidean')
  #perform hierarchical clustering
  tree <- hclust(distMatrix, method = "complete")
  #convert clustering results to object of type 'phylo'
  tree <- as.phylo(tree)
  #label leaves with kegg id
  tree$tip.label <- as.character(keggTable[keggTable$phylum == phylum,'DB.Hit'])
  #unroot tree if it has at least 3 nodes (else it is a straight line and passes an error)
  if(length(tree$tip.label) > 2)
    tree <- unroot(tree)
  #add distance metric to artificial root
  tree$root.edge <- mean(distMatrix)
  #add to list of trees with 3 nodes to allow cluster labeling later
  if(length(tree$tip.label) == 3)
    threeNodes <<- append(threeNodes, phylum)
  
  return(tree)
})

#create tree for 'others' category and add to list of trees
othersTree <- unroot(keggTable[keggTable$phylum %in% others, 'distance'] %>%
                       dist(method = 'euclidean') %>%
                       hclust(method = 'complete') %>%
                       as.phylo())
othersTree$tip.label <- as.character(keggTable[keggTable$phylum %in% others,'DB.Hit'])
othersTree$root.edge <- mean(othersTree$edge.length)
trees[[length(trees)+1]] <- othersTree
phyla <- append(phyla, 'Others')

#make sure the list is composed of actual trees (if all genes in a phylum have 0 distance to each other then no tree)
noTree <- integer()
for(i in 1:length(trees)){
  if(trees[[i]]$root.edge==0) 
    noTree <- append(noTree, i)
}
trees <- trees[-noTree]
phyla <- phyla[-noTree]

#convert list of trees to 'multiPhylo' object
class(trees) <- "multiPhylo"

#merge trees
mergedTree <- trees[[1]]
for(i in 2:length(trees))
  mergedTree <- mergedTree + trees[[i]]

Now, visualize the tree using the equal_angle algorithm. It is worth noting that for this dataset, phylum 1 (Actinobacteria) has 30% of all nodes and have internal edgs that are disproportionately longer than those in the other trees. Phylum 1 thus dominates the tree and leads to a lopsided tree. Visualizing the tree without that phylum leads to a more traditional-looking unrooted tree.

ggtree(mergedTree, layout = 'equal_angle')

Now, I will add colors to label the phyla. Getting the node IDs that identify the phyla is not straightforward. To circle each phylum, we need to get the indices of the first internal node within each phylum; this node is the artificial root of the phylum. The numbering of the node IDs is as follows: first come all the tips, followed by the trees root, followed by the internal nodes within each tree one-by-one. I will use the number of internal nodes within each phylum tree (stored in the previous chunk) to get the root node of each phylum.

In order to achieve better branch distribution, and thus a more clear tree, I will us the plot function of the ape package and pass that plot’s coordinates to ggtree which will be used to add the more visually-appealing annotations.

#create merged tree plot
plot <- ggtree(mergedTree, layout = "equal_angle")

#use ape's plot to get better branch distribution
plot(mergedTree, type='unrooted', show.tip=F)

apePlot <- get("last_plot.phylo", envir=.PlotPhyloEnv)
plot$data$x <- apePlot$xx
plot$data$y <- apePlot$yy

#determine 'phylum nodes'
#initialize vector
phylumRoots <- integer()
#first phylum node is the first node after the merged tree's root
phylumRoots[1] <- length(mergedTree$tip.label)+2
#each root node comes after all the internal nodes of the previous tree
for(i in 2:length(trees))
  phylumRoots <- append(phylumRoots, (phylumRoots[i-1] + trees[[i-1]]$Nnode))

#color phyla clades
for(i in 1:length(phyla)){
  plot <- plot + 
    geom_hilight_encircle(node = phylumRoots[i], fill = i, expand = 0.02) +
    geom_cladelabel(node = phylumRoots[i], label = phyla[i], geom = 'label',
                    color = i, barsize = 0, angle=45)
}

plot

There is room for improvement. Adjusting the size of the phylum ‘bubbles’ as well as aligning the phylum labels to be more readable is needed. Alternative methods for determining the distance to the root should also be considered.

Final Tree

The final tree implemented in HGTsearch uses an alternative visualization for the tree produced by strategy 2. It offers more visually-appealing annotations and an interactive element. ggtree is used to visualize the tree and plotly is used to add the interactive element.

The strategy for constructing the tree:

group genes by phylum, all phyla with just one gene are represented by an ‘others’ category
compute the inverse of the SW score to infer relative evolutionary distance to gene of interest
apply complete-linkage hierarchical clustering on the dissimilarity of the inverse of SW scores
create phylogenetic tree for each individual tree and merge them into one tree by addition

The strategy for visualizing the tree:

use the slanted layout within ggtree to visualize the tree
use plotly package to add interactive component which shows gene info upon mousing over leaves

Limitations to current tree:

gene adjacency matrix is inferred from distance to query gene rather than pairwise alignment among all pairs
final tree contains artificial root at merging point of phylum trees which incorrectly implies that the query gene is a common ancestor

Future work:

compute adjacency matrix by performing pairwise local alignment among all pairs of orthologs
identify strategy to remove artificial root while retaining phylum-level structure
add option for creating trees using neighbor-joining method in addition to hierarchical clustering
addd option to visualize tree at different taxonomic levels (kingdom/phylum/genus…)

Combined chunks of code

library(ggtree) #to visualize tree
library(magrittr) #pipe operator
library(dplyr) #data wrangling
library(ape) #create & manipulate tree
library(plotly) #interactive tree

#read in example table
table <- read.csv('./ACA1_245650.csv', stringsAsFactors = FALSE)

#remove entries which do not contain taxonomy information
table <- table[!(table$Taxonomy == 'character(0)'),]

#remove entries from the animal kingdom to keep only biologically-reasonable phyla
table <- table[-grep('Animals', table$Taxonomy),]

#replace NA in gene description with empty string
table[is.na(table$Annotation),'Annotation'] <- ""

#calculate inverse of SW score and store in table
table$distance <- table$SW.Score ^ (-1)

#extract phylum from taxonomy lineage (always the second term in the lineage) and add to table
table$phylum <- as.factor(apply(table, 1, function(ortholog){
  strsplit(ortholog["Taxonomy"], split = ';')[[1]][3]
}))

#store names of all phyla in a vector
phyla <- as.data.frame(table(table$phylum))

#group phyla with just one gene into a 'others' category
others <- phyla[phyla$Freq==1,1]

#remove those phyla from the original list
phyla <- as.character(phyla[!(phyla[,1] %in% others), 1])

#iterate over phyla, create a tree for each, and store in a list
trees <- lapply(phyla, function(phylum){
  #get inferred evolutionary distances for genes within phylum
  distances <- table[table$phylum == phylum,'distance']
  #compute distance matrix
  distMatrix <- dist(distances, method = 'euclidean')
  #perform hierarchical clustering
  tree <- hclust(distMatrix, method = "complete")
  #convert clustering results to object of type 'phylo'
  tree <- as.phylo(tree)
  #label leaves with kegg id
  tree$tip.label <- as.character(table[table$phylum == phylum,'DB.Hit'])
  #unroot tree if it has at least 3 nodes (else it is a straight line and passes an error)
  if(length(tree$tip.label) > 2)
    tree <- unroot(tree)
  #add distance metric to artificial root
  tree$root.edge <- mean(distMatrix)
  
  return(tree)
})

#create tree for 'others' category and add to list of trees
othersTree <- unroot(table[table$phylum %in% others, 'distance'] %>%
                       dist(method = 'euclidean') %>%
                       hclust(method = 'complete') %>%
                       as.phylo())
othersTree$tip.label <- as.character(table[table$phylum %in% others,'DB.Hit'])
othersTree$root.edge <- mean(othersTree$edge.length)
trees[[length(trees)+1]] <- othersTree
phyla <- append(phyla, 'Others')

#make sure the list is composed of actual trees (if all genes in a phylum have 0 distance to each other then no tree)
noTree <- integer()
for(i in 1:length(trees)){
  if(trees[[i]]$root.edge==0)
    noTree <- append(noTree, i)
}
trees <- trees[-noTree]
phyla <- phyla[-noTree]

#convert list of trees to 'multiPhylo' object
class(trees) <- "multiPhylo"

#merge trees
mergedTree <- trees[[1]]
for(i in 2:length(trees))
  mergedTree <- mergedTree + trees[[i]]

#determine 'phylum nodes'
#initialize vector
phylumRoots <- integer()
#first phylum node is the first node after the merged tree's root
phylumRoots[1] <- length(mergedTree$tip.label)+2
#each root node comes after all the internal nodes of the previous tree
for(i in 2:length(trees))
  phylumRoots <- append(phylumRoots, (phylumRoots[i-1] + trees[[i-1]]$Nnode))

#create merged tree plot
plot <- ggtree(mergedTree, layout = "slanted")

#color phyla clades
for(i in 1:length(phyla)){
  plot <- plot +
    geom_hilight_encircle(node = phylumRoots[i], fill = i) +
    geom_cladelabel(node = phylumRoots[i], label = phyla[i], offset.text = mean(mergedTree$edge.length),
                    color = i, align = TRUE, offset = mean(mergedTree$edge.length)) +
    geom_tiplab(size = 3)
}

#create table with data for interactive plot
interactData <- data_frame(ID = table$DB.Hit, Phylum = table$phylum,
                           Annot = paste('\nOrganism:',table$Organism,'\nDescription:',table$Annotation, '\n Lineage:',table$Taxonomy)) %>%
  inner_join(plot$data, c('ID' = 'label'))

#add interactive plots
plot <- plot + geom_point(data = interactData,
                          aes(x = x, y = y, color = Phylum, label = Annot))
plot <- ggplotly(plot, tooltip = 'label')

plot

Helpful references:
* Reference 1 - ggtree
* Reference 2 - phylogenetics basics in R

HGTsearch - Phylogenetic Tree

Ahmed Youssef

Dec 6, 2018

Introduction

Setting up

Packages

Data

Strategy 1

Constructing the tree

1. Calculate gene evolutionary distances

2. Group genes into phyla

3. Determine tree branches

4. Compute branch lengths

5. Create the tree

Visualizing the tree

Basics

Species

Phyla

Final Tree

Next Steps

Strategy 2

Constructing the tree

1. Calculate gene evolutionary distances and extract phyla

2. Compute distance matrix

3. Hierarchical clustering

Visualize the tree

Clusters

Create & merge phyla trees

Visualize the tree

Final Tree

Combined chunks of code