choose 10 polymorphic loci from the alignment and tally up the number of alleles per locus, then make a matric to hold the pair-wise similarities.Use =if(X = Y, 1,0) to score the alignments automatically. Calculate the PID and fill out the similarity matrix. calculate the dis-similarity matrix.
library(ape)
library(phangorn)
This matrix is based on the proportion of bases that are identical between sequence. This is often referred to as PID for Proportion Identical or Percentage Identical.
BLAST reports PID in its main output. PID is a very simple metric of similarity; more sophisticated measures are used in pratice.
Make a similarity matrix with the matrix() command. Note that I have to declare the number of rows
# Bad matrix 1
matrix(c(1.0, 0.5, 0.3,
0.5, 1.0, 0.4,
0.3, 0.4, 1.0))
## [,1]
## [1,] 1.0
## [2,] 0.5
## [3,] 0.3
## [4,] 0.5
## [5,] 1.0
## [6,] 0.4
## [7,] 0.3
## [8,] 0.4
## [9,] 1.0
# Good matrix
matrix(c(1.0, 0.5, 0.3,
0.5, 1.0, 0.4,
0.3, 0.4, 1.0),
nrow = 3)
## [,1] [,2] [,3]
## [1,] 1.0 0.5 0.3
## [2,] 0.5 1.0 0.4
## [3,] 0.3 0.4 1.0
Store the matrix
my_sim_mat <- matrix(c( 1 , 0 , 0 , 0 , 0,
0.1 , 1 , 0 , 0 , 0,
0.5 , 0.4 , 1 , 0 , 0,
0.4 , 0.5 ,0.5, 1, 0,
0.4 , 0.5 ,0.7 ,0.6 , 1),
nrow = 5,
byrow = T)
Label the matrix with row.names() and colnames()
row.names(my_sim_mat) <- c("ME","B","G.","T.","MW")
colnames(my_sim_mat) <- c("ME", "B" ,"G.", "T.", "MW")
Similarity, disimilarity, and distance are all related. Most methods use distance, not similarity.
We can do vectorized math to recalculate the matrix
my_dist_mat <- 1-my_sim_mat
as.dist()
my_dist_mat2 <- as.dist(my_dist_mat)
my_dist_mat
## ME B G. T. MW
## ME 0.0 1.0 1.0 1.0 1
## B 0.9 0.0 1.0 1.0 1
## G. 0.5 0.6 0.0 1.0 1
## T. 0.6 0.5 0.5 0.0 1
## MW 0.6 0.5 0.3 0.4 0
my_dist_mat2
## ME B G. T.
## B 0.9
## G. 0.5 0.6
## T. 0.6 0.5 0.5
## MW 0.6 0.5 0.3 0.4
my_dist_mat ME B G. T. MW ME 1 0 0 0 0 B 0.1 1 0 0 0 G. 0.5 0.4 1 0 0 T. 0.4 0.5 0.5 1 0 MW 0.4 0.5 0.7 0.6 1
my_dist_mat2 G. T. MW G. 0.5
T. 0.4 0.5
MW 0.4 0.5 0.7
is(my_dist_mat2)
## [1] "dist"
class(my_dist_mat2)
## [1] "dist"
Neighbor Joining is one of the most common ways to build a tree using molecular data that’s been converted to sequences; its one of the options within BLAST.
Build the tree with nj()
my_nj <- ape::nj(my_dist_mat2)
Plot the tree as an “unrooted” tree
plot(my_nj, "unrooted")