Introduction to Distance-based phylogenetics trees in R

The data used in the 5 x 5 and 3 x 3 matrix comes from a MSA

of different chimpanzee subspecies. In the 5 x 5 matrix, the values being used to compute the trees are the Percent Identical (PID). The PID values came from 10 selected dimorphic loci in the MSA. In order to calculate the specific PID values we needed to compare the 10 selected dimorphic loci on the 5 subspecies with each species. A total of 10 PID were generated.
We calculated the PID based on the alleles that were similar within in each subspecies. Using the PID values we will then construct a dissimilarity matrix and use the packages ape and phangorn to help construct different trees.

Preliminaries

# Here are two packages needed to run r code chunks
# install.packages(ape)
# install.packages(phangorn)

# Analysis of Phylogenetics and Evolution packages ape and phangorn
library(ape)
library(phangorn)

Example 1: 3 Sequences

Similarity matrix

This matrix is based on the proportion of bases that are identical between sequence. This is often referred to as PID for Proportion Identical or Percentage Identical.

BLAST reports PID in its main output. PID is a very simple metric of similarity; more sophisticated measures are used in practice.

Make a similarity matrix with the matrix() command. Note that I have to declare the number of rows

# Bad matrix 1, does not declare the number of rows
matrix(c(1.0, 0.5, 0.3,
         0.5, 1.0, 0.4,
         0.3, 0.4, 1.0))
##       [,1]
##  [1,]  1.0
##  [2,]  0.5
##  [3,]  0.3
##  [4,]  0.5
##  [5,]  1.0
##  [6,]  0.4
##  [7,]  0.3
##  [8,]  0.4
##  [9,]  1.0
# Good matrix
matrix(c(1.0, 0.5, 0.3,
         0.5, 1.0, 0.4,
         0.3, 0.4, 1.0),
       nrow = 3, byrow = T)
##      [,1] [,2] [,3]
## [1,]  1.0  0.5  0.3
## [2,]  0.5  1.0  0.4
## [3,]  0.3  0.4  1.0

Store the matrix

my_sim_mat <- matrix(c(1.0, 0.5, 0.3,
         0.5, 1.0, 0.4,
         0.3, 0.4, 1.0),
       nrow = 3, byrow = T)

Label the matrix with row.names() and colnames()

  # The syntax can be confusing for row. names vs. col names declarations
row.names(my_sim_mat) <- c("G","T","M")
colnames(my_sim_mat) <- c("G","T","M")

row.names(my_sim_mat)
## [1] "G" "T" "M"
colnames(my_sim_mat)
## [1] "G" "T" "M"

Dissimilarity matrix

Similarity, dissimilarity, and distance are all related. Most methods use distance, not similarity.

We can do vectored math to recalculate the matrix

# By using vectored math, we can store the dissimilarity elements 
# in my_dist_mat matrix, which is 1 - PID.
my_dist_mat <- 1 - my_sim_mat

print("Here is the dissimilarity matrix")
## [1] "Here is the dissimilarity matrix"
my_dist_mat
##     G   T   M
## G 0.0 0.5 0.7
## T 0.5 0.0 0.6
## M 0.7 0.6 0.0
print("Here is the Similarity matrix for comparison")
## [1] "Here is the Similarity matrix for comparison"
my_sim_mat
##     G   T   M
## G 1.0 0.5 0.3
## T 0.5 1.0 0.4
## M 0.3 0.4 1.0

Convert to R’s distance format

# as.dist get's rid of values below the diagonal line of a matrix.
my_dist_mat2 <- as.dist(my_dist_mat) # diag = F is the default argument for as.dist().

my_dist_mat2
##     G   T
## T 0.5    
## M 0.7 0.6
#If the arg diag = T, this will return values at the diagonal line as well.
print("Comparing the distance matrix with arg diag = T")
## [1] "Comparing the distance matrix with arg diag = T"
my_dist_mat2 <- as.dist(my_dist_mat, diag = T)

my_dist_mat2
##     G   T   M
## G 0.0        
## T 0.5 0.0    
## M 0.7 0.6 0.0

Build a neighbor-joining (nj) tree

Neighbor Joining is one of the most common ways to build a tree using molecular data that’s been converted to sequences; its one of the options within BLAST.

Build the tree with nj()

# The neighboring joining construction can be done w/ ape package
# using the nj() function.
my_nj <- ape::nj(my_dist_mat2)

Plot the tree as an “unrooted” tree

# In order to get a unrooted tree, the arg "unrooted" must be set
plot(my_nj, "unrooted")

Plot the tree as an “rooted” tree

# rooted is the nj tree made by default w/ nj()
plot(my_nj)

UPGMA/WPGMA are other algorithms that work with distance matrices. They are not commonly used now but are useful for teaching because they can easily be done by hand on small datasets.

# Constructing a Unweighted Pair-Group Method with Arithmetic mean (upgma) tree w/ phangorn package
my_upgma <- phangorn::upgma(my_dist_mat2)

Plot the UPGMA tree

plot(my_upgma)

Compare the rooted NJ and the UPGMA

par(mfrow = c(1,2))
plot(my_nj)
plot(my_upgma)

WPGMA tree

# Constructing a Weighted Pair-Group Method with Arithmetic mean (wpgma) tree w/ phangorn package
plot(wpgma(my_dist_mat2))

Minimum evolution tree

# the fastme.ols function from the package ape
# can be used to construct a minimum evolution tree
plot(fastme.ols(my_dist_mat2))

Example 2: 5 Sequences

Build the 5 x 5 matrix.

Be sure to add the nrow = … statement.

# 5 x 5 Similarity matrix used from data from excel sheet

five_sim_matrix <- c(1, 0, 0, 0, 0,
                     1, 1, 0, 0,    0, 
                     0.7, 0.7, 1, 0, 0,
                     0.6, 0.6, 0.3, 1, 0,
                     0.4, 0.4, 0.1, 0.6, 1.0)

# Storing the above vector in a matrix sim.matrix
sim.matrix <- matrix(five_sim_matrix, nrow = 5, byrow = T)

# prints the matrix
sim.matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  1.0  0.0  0.0  0.0    0
## [2,]  1.0  1.0  0.0  0.0    0
## [3,]  0.7  0.7  1.0  0.0    0
## [4,]  0.6  0.6  0.3  1.0    0
## [5,]  0.4  0.4  0.1  0.6    1

Name the row and columns for the matrix

# The names stand for the respective Chimpanzee subspecies
# ME = Mahale_East
# B = Bodongo_East
# G = Gombe_East
# T = Tai_West
# MW = Mali_West
row.names(sim.matrix) <- c("ME","B","G", "T", "MW")
colnames(sim.matrix) <- c("ME","B","G", "T", "MW")

sim.matrix
##     ME   B   G   T MW
## ME 1.0 0.0 0.0 0.0  0
## B  1.0 1.0 0.0 0.0  0
## G  0.7 0.7 1.0 0.0  0
## T  0.6 0.6 0.3 1.0  0
## MW 0.4 0.4 0.1 0.6  1

Turn into a distance matrix. This is 2 steps and requires the as.dist() command

# Takes each element in five_dist_mat and subtracts the numeric value by 1
five_dist_mat  <- 1 - sim.matrix
#as.dist splits the matrix down the diagonal, so it gets rid of unneeded data
five_dist_mat  <- as.dist(sim.matrix, diag = F)

five_dist_mat
##     ME   B   G   T
## B  1.0            
## G  0.7 0.7        
## T  0.6 0.6 0.3    
## MW 0.4 0.4 0.1 0.6

Neighbor-Joining tree with n()

# Builds a near-jointed tree of the 5 x 5 
five_nj <- ape::nj(five_dist_mat)

Plot uprooted NJ tree

# Use the "uprooted" argument to get a uprooted NJ tree
plot(five_nj, "unrooted")

Plot rooted NJ tree

# Default tree is rooted
plot(five_nj)

Build UPGMA tree

# The phangorn package can generate a upgma tree
five_upgma <- phangorn::upgma(five_dist_mat)

Plot UPGMA tree

plot(five_upgma)

Compare rooted NJ and UPGMA plots

par(mfrow = c(1,2))
plot(five_nj)
plot(five_upgma)

Build WPGMA tree

five_wpgma <- wpgma(five_dist_mat)

Compare rooted WPGMA and UPGMA plots

par(mfrow = c(1, 2))
plot(five_wpgma)
plot(five_upgma)

Build Minimum evolution tree

plot(fastme.ols(my_dist_mat))