Excel Spreadsheet

In the excel spread sheet I chose 10 arbitrary polymorphic loci and created similarity and dissimilarity matrices with using 1 if they were the same and 0 if they were different Then calculated the similarities between the two by dividing score/N: N being the number of loci in each row (10). The 5X5 similarity matrix is what I used in this code to create the phylogenetic tree.

Preliminaries

# install.packages(ape)
# install.packages(phangorn)

library(ape)
library(phangorn)

Example 1: 3 Sequences

Similarity matrix

This matrix is based on the proportion of bases that are identical between sequence. This is often referred to as PID for Proportion Identical or Percentage Identical.

BLAST reports PID in its main output. PID is a very simple metric of similarity; more sophisticated measures are used in practice.

Make a similarity matrix with the matrix() command. Note that I have to declare the number of rows

# Bad matrix 1
matrix(c(1.0, 0.5, 0.3,
         0.5, 1.0, 0.4,
         0.3, 0.4, 1.0)) #similarity matrix bad because doesnt have the row = 3
##       [,1]
##  [1,]  1.0
##  [2,]  0.5
##  [3,]  0.3
##  [4,]  0.5
##  [5,]  1.0
##  [6,]  0.4
##  [7,]  0.3
##  [8,]  0.4
##  [9,]  1.0
#this only gives us a vector not a matrix


# Good matrix
matrix(c(1.0, 0.5, 0.3,
         0.5, 1.0, 0.4,
         0.3, 0.4, 1.0), 
       nrow = 3) #gives us a matrix
##      [,1] [,2] [,3]
## [1,]  1.0  0.5  0.3
## [2,]  0.5  1.0  0.4
## [3,]  0.3  0.4  1.0

Store the matrix

my_sim_mat <- matrix(c(1.0, 0.5, 0.3,
                       0.5, 1.0, 0.4,
                       0.3, 0.4, 1.0),
                 nrow = 3,
                 byrow = T)

Label the matrix with row.names() and colnames()

row.names(my_sim_mat) <- c("G","T","M")
colnames(my_sim_mat) <- c("G","T","M")

Disimilarity matrix

Similarity, disimilarity, and distance are all related. Most methods use distance, not similarity.

We can do vectorized math to recalculate the matrix

my_dist_mat <- 1-my_sim_mat #gives us the opposite -- dissimilarity of 70%

Convert to R’s distance format

as.dist()

my_dist_mat2 <- as.dist(my_dist_mat)
my_dist_mat
##     G   T   M
## G 0.0 0.5 0.7
## T 0.5 0.0 0.6
## M 0.7 0.6 0.0
is(my_dist_mat)
## [1] "matrix"    "array"     "mMatrix"   "structure" "vector"
class(my_dist_mat)
## [1] "matrix" "array"

Build a neighbor-joining (nj) tree

Neighbor Joining is one of the most common ways to build a tree using molecular data that’s been converted to sequences; its one of the options within BLAST.

Build the tree with nj()

my_nj <- ape::nj(my_dist_mat2)

Plot the tree as an “unrooted” tree

plot(my_nj, "unrooted")

– G |-|__ T - rooted tree = —-| |_____ M

Plot the tree as an “rooted” tree

plot(my_nj)

- doesnt like to root things so its not actually rooted when doing the programsso plotting as an unrooted tree makes more sense

  • G-T = 0.5

  • T-M = 0.6

  • phylogenetic trees can be part of the clustering algorithm (classification)

    • classification – classifying the main differences between two groups - can cluster related things together
  • considered a reduction method

UPGMA/WPGMA are other algorithms that work with distance matrices. They are not commonly used now but are useful for teaching becaues they can easily be done by hand on small datasets.

my_upgma <- phangorn::upgma(my_dist_mat2)
  • UPGMA - unweighted pair group method with arithmetic mean Plot the UPGMA tree
plot(my_upgma)

Compare the rooted NJ and the UPGMA – rooted tree does not look as realistic as the UPGMA

par(mfrow = c(1,2))
plot(my_nj)
plot(my_upgma)

WPGMA tree

plot(wpgma(my_dist_mat2))

Minimum evolution tree

plot(fastme.ols(my_dist_mat2))

Example 2: 5 Sequences

Build the matrix.

Be sure to add the nrow = … statemetn.

five_sim_mat <- matrix(c(1.0,     0.0,  0.0,  0.0,  0.0,        
                         1.0,   1.0,    0.0,  0.0,  0.0,                
                         0.8,   0.8,    1.0,    0.0,  0.0,      
                         0.5,     0.5,  0.3,    1.0,    0.0,        
                         0.3,   0.3,    0.1,    0.6,    1.0),
                       nrow = 5,
                       byrow = T) ######

Name things

row.names(five_sim_mat) <- c("ME", "B", "G", "T", "MW") #row names
colnames(five_sim_mat)  <- c("ME", "B", "G", "T", "MW") #column names

Turn into a distance matrix. This is 2 steps and requires the as.dist() command

five_dist_mat <- 1- five_sim_mat  ######
five_dist_mat2<- as.dist(five_dist_mat) ######

Neighbor-Joining tree with nj()

five_nj <- nj(five_dist_mat2)

Plot unrooted NJ tree

plot(five_nj, "unrooted")

Plot rooted NJ tree

plot(five_nj)

Build UPGMA tree

five_upgma <- phangorn::upgma(five_dist_mat2)

Plot UPGMA tree

plot(five_upgma)

Compare rooted NJ and UPGMA plots

par(mfrow = c(1,2))
plot(five_nj)
plot(five_upgma)

Build WPGMA tree

plot(wpgma(five_dist_mat2))

Compare rooted WPGMA and UPGMA plots

par(mfrow = c(1,2))
plot(wpgma(five_dist_mat2))
plot(five_upgma)

Build Minimum evolution tree

plot(fastme.ols(five_dist_mat2))