Fast computation of Faith's Phylogenetic Diversity (PD) measure

In my recent post about a faster method of computing Owen Petchey and Kevin Gaston's functional diversity measure, I mentioned that Petchey-Gaston FD is the same measure as Dan Faith's phylogenetic diversity measure, PD. They are simply computed using different types of tree structure. I also noted that my enhanced version of the Xtree function by Jens Schumacher can also be used to efficiently compute Faith's PD.

Today I am presenting some background to adapting Xtree to work with phylogenetic trees of class “phylo”. Before I start, there are some important differences between dendrogram and phylogenetic tree structures we need to understand. First, each internal node in a dendrogram always subtends (joins) two daughter branches, but in phylogenetic tree this restriction is relaxed. Theoretically, there can be an infinite number of daughter branches from an internal node in a phylogenetic tree.

Second, phylogenetic trees of class “phylo” make edge or branch lengths directly available by storing them in a matrix; with dendrograms of class “agnes” or class “hclust”, Jens Schumacher's algorithm must spend time computing edge lengths.

In the edge matrix in a “phylo” object, tips are indexed 1..Ntip (= number of tips), the root node has index Ntip + 1, and internal nodes are indexed from root.node to total.nodes. This is helpful because it lets us very efficiently cross-reference between edge indexes and the left and right terminal node indices of each edge. It also helps that objects of class “phylo” have methods to report the number of tips and edges in the stored tree structure and this lets us very quickly create the edge incidence matrix H1.

Because we already have edge indexes pre-defined, we can very quickly populate the entries in the H1 matrix for edges joining tips to their first internal node. We can also quickly set-up a vector of internal node indices in descending order so we can efficiently traverse tree from first nodes below tip nodes down to the root. We then simply work along the vector and descend the tree from penultimate internal nodes to the root.

As we descend, we can accumulate records of branches or edges traversed and update the H1 matrix. From Schumacher's original Xtree algorithm we borrow the principle that when we descend from tip (also referred to as a “terminal node”) level towards the root, row entries in the currently focussed edge (ie col) of H1 are equal to the sum of all H1 columns representing edges above the focal edge in the tree. Using the rowSums() function vectorises this step making it very fast.

Benchmarking

To see how much of a speed-up we can get with this adaptation of Xtree to phylogenetic diversity computations, I prepared a script to compare the execution times of PD computations by the distRoot function in the package “adephylo”, the pd function in the package “picante”, and of course, my FastXtreePhylo function. I chose to compute total PD for a entire phylogeny (=sum of all edge or branch lengths) and used the following approaches to compute exactly the same numerical result:

adephylo: Simply call the function distRoot and set method=“patristic”. The whole-tree PD is then the sum of the returned vector.

picante: There is no simple function call for picante because the function pd requires a community matrix and a phylogenetic tree of class “phylo”. To “trick” pd into give us the whole-tree PD value, we can supply it with a community matrix with the same number of rows as taxa or tips in the phylogentic tree, and populated with 1s on the main diagonal (all other values in the matrix set to 0). The whole-tree PD is then the sum of the column “PD” in the returned matrix.

FastXtreePhylo: FastXtreePhylo returns a list with two components: matrix H1 representing the paths through the tree from root to each tip, and h2.prime a numeric vector giving the length of each branch in the tree. Some matrix algebra and a summation of the resulting vector gives the whole-tree PD value.

These techniques were implemented in my script except I did not make the final summations. To provide test phylogenies, I have used the rtree function in package “ape” to create random phylogenetic trees of increasing size.

The results are quite astonishing. Here they are in tabular form showing total elapsed time in seconds reported by the function system.time():

## Tree size  picante.pd  adephylo.distRoot  FastXtreePhylo
## 100           0.508             0.461          0.004
## 200           1.866             1.384          0.009
## 500          14.586             7.196          0.050
## 1000         76.582            28.259          0.166
## 2000        421.838           117.925          0.560
## 3000       1215.157           259.919          1.152

No matter what the size of the tree, FastXtreePhylo is 2 or 3 orders of magnitude faster! Graphically the results are just as impressive:

plot of chunk unnamed-chunk-2

Source code for FastXtreePhylo and the benchmarking driver is provided below:

# Some code to perform benchmarking of methods to compute Faith's PD measure.
# (Includes source for FastXTreePhylo)
# 
# Computation of PD for a entire phylogeny (=sum of all edge or branch lengths)
# is possible using the following approaches in the packages adephylo and
# picante, and using my FastXtreePhylo function:
#
# adephylo: Simply call the function distRoot and set method="patristic". The
# whole-tree PD is then the sum of the returned vector.
#
# picante: There is no simple function call for picante because the function pd
# requires a community matrix and a phylogenetic tree of class "phylo". To
# "trick" function pd into give us the whole-tree PD value, we can supply it
# with a community matrix with the same number of rows as taxa or tips in the
# phylogentic tree, and populated with 1s on the main diagonal (all other values
# in the matrix set to 0). The whole-tree PD is then the sum of the column "PD"
# in the returned matrix.
#
# FastXtreePhylo: FastXtreePhylo returns a list with two components: matrix H1
# representing the paths through the tree from root to each tip, and h2.prime a
# numeric vector giving the length of each branch in the tree. Some matrix
# algebra and a summation of the resulting vector gives the whole-tree PD value.
#
# These techniques are implemented below except I have not made the final
# summations. To provide test phylogenies, I have used the rtree function in
# package "ape" to create random phylogenetic trees of increasing size.
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# See http://www.gnu.org/licenses/ for a copy of the license.
#
# Peter D. Wilson 7 April 2014
#

####################### FastXtreePhylo #######################
#
# Based on an algorithm by Jens Schumacher developed in 2003. The
# orginal version can be found at:
#      http://www.thetrophiclink.org/resources/calculating-fd
# - last accessed 25 March 2014
#
# Adapted to work with objects of class "phylo" defined in the
# package ape. From the old FastXtree algorithm we borrow the principle
# that when we descend from tip (==terminal node) level towards the root,
# row entries in the currently focussed edge (ie col) of H1 are equal
# to the sum of all H1 columns representing edges above the focal edge
# in the tree. Using the rowSums() function vectorises this step making
# it very fast.
#
# PDW 10 March 2014

FastXtreePhylo <- function(thisTree)
{
  # In edge matrix in a phylo object, OTUs are indexed 1:Ntip, the root node has
  # index Ntip + 1, and internal nodes are indexed from root.node to 
  # total.nodes. This is clever because it lets us very efficiently
  # cross-reference between edge indicies and the left and right terminal node
  # indices of each edge. So, compute the key values:
  root.node <- Ntip(thisTree) + 1
  total.nodes <- max(thisTree$edge)

  # Matrix H1 of Schumacher's Xtree method, which is exactly the same as
  # David Nipperess's phylomatrix:
  H1 <- matrix(0,Ntip(thisTree),Nedge(thisTree),dimnames=list(thisTree$tip.label,NULL))

  # A short-cut: we can instantly populate H1 with terminal edges:
  rc.ind <- cbind("row"=1:Ntip(thisTree),"col"=which(thisTree$edge[,2] < root.node))
  H1[rc.ind] <- 1

  # Make a vector of internal node indices in descending order so we can traverse
  # tree from first nodes below tip nodes down to the root:
  internal.nodes <- seq(total.nodes,root.node,-1)

  # Now, visit each internal node accumulating subtended child edge records:
  for (thisNode in internal.nodes)
  {
    # Find the edge which has thisNode on its left:
    next.edge <- which(thisTree$edge[,2]==thisNode)

    # Which edges are children of the node to the right:
    child.edges <- which(thisTree$edge[,1]==thisNode)

    # Do the magic rowSums() trick to accumulate edges subtended by the current edge:
    H1[,next.edge] <- rowSums(H1[,child.edges])
  }

  # All done...package results as a named list:
  return(list("h2.prime"=thisTree$edge.length,"H1"=H1))  
}


##################### BENCHMARKING "MAIN" CODE ####################
treeSizes <- c(100,200,500,1000,2000,3000)

results <- matrix(0,length(treeSizes),3,
                  dimnames=list(as.character(treeSizes),
                  c("picante.pd","adephylo.distRoot","FastXtreePhylo")))

# Adapted from the example in help page for distRoot() in pacakage adephylo:
if(require(ape) && require(phylobase) && require(adephylo) && require(picante))
{
  treeSize <- 100
  #for (treeSize in treeSizes)
  {
    # Make a random phylogenetic tree using the rtree function in package ape:
    rt <- as(rtree(treeSize),"phylo")

    ade.time <- system.time(ade.PD <- distRoot(rt,method="patristic"))["elapsed"]

    # Create the dummy community matrix:
    commMat <- diag(1,treeSize,treeSize)
    colnames(commMat) <- rt$tip.label
    picante.time <- system.time(picante.PD <- pd(commMat,rt))["elapsed"]

    fastxtree.time <- system.time({fxtp <- FastXtreePhylo(rt)
                                   fxt.PD <- t(fxtp[["H1"]] %*% fxtp[["h2.prime"]])})["elapsed"]
    results[as.character(treeSize),] <- c(picante.time,ade.time,fastxtree.time)
  }
}

print(results)

plot(treeSizes,results[,"picante.pd"],pch=21,bg="skyblue3",
     xlab="Tree size (number of tips)",ylab="Execution time (seconds)")
lines(treeSizes,results[,"picante.pd"],lwd=2,col="skyblue3")
points(treeSizes,results[,"adephylo.distRoot"],pch=21,bg="plum3")
lines(treeSizes,results[,"adephylo.distRoot"],lwd=2,col="plum3")
points(treeSizes,results[,"FastXtreePhylo"],pch=21,bg="red3")
lines(treeSizes,results[,"FastXtreePhylo"],lwd=2,col="red3")
legend(250,1200,c("picante","adephylo","FastXtreePhylo"),
       pch=c(21,21,21),pt.bg=c("skyblue3","plum3","red3",pt.cex=1.2,bty="n"))

cat("*** End of processing.\n")

I hope you find this code useful…

Cheers,

Peter Wilson