R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

setwd("~/Downloads/mammals/")

Molecular ecology tutorial

This document follows the tutorial from http://www.molecularecologist.com/2016/02/quick-and-dirty-tree-building-in-r/.

Loading libraries

If one or more libraries are not installed, yet, it’s possible to install them with the install.packages("packagename") command.

We are going to use the ape, phangorn and seqinr packages:

# Load libraries
library(ape)
## Warning: package 'ape' was built under R version 3.4.2
library(phangorn)
## Warning: package 'phangorn' was built under R version 3.4.2
library(seqinr)
## Warning: package 'seqinr' was built under R version 3.4.1
## 
## Attaching package: 'seqinr'
## The following objects are masked from 'package:ape':
## 
##     as.alignment, consensus

Alignment File Formats and Conversion

Using the read.dna() function in the package ape, you’ll import your sequence data, choosing between “interleaved,” “sequential”, “clustal” and “fasta” formats.

From there, you’ll want to convert your alignment to a “phyDat” object for use in phangorn, the major package for treebuilding in R. Once it’s imported, it’s easy to subset your data and only include, say, the first ten taxa:

# Import data [mammals.dna=https://pastebin.com/8Mt3QUTV]
mammals <- read.dna("~/Downloads/mammals/TMEM55A.dna", format="interleaved")
mammals_phyDat <- phyDat(mammals, type = "DNA", levels = NULL)

# Subset (first ten)
mammals10 <- subset(mammals_phyDat, 1:10)
mammals10_phyDat <- phyDat(mammals10, type = "DNA", levels = NULL)

Model Testing and Distance Matrices

Reducing sequence alignments into a matrix of pairwise distances allows for the rapid estimation of phylogenetic trees (at the cost of the additional information those sequences contain).

To produce a distance matrix, however, you’ll need to decide on the model of nucleotide (or protein) evolution that best fits your data, performing a likelihood ratio test.

mt <- modelTest(mammals10)
## negative edges length changed to 0!
## [1] "JC+I"
## [1] "JC+G"
## [1] "JC+G+I"
## [1] "F81+I"
## [1] "F81+G"
## [1] "F81+G+I"
## [1] "K80+I"
## [1] "K80+G"
## [1] "K80+G+I"
## [1] "HKY+I"
## [1] "HKY+G"
## [1] "HKY+G+I"
## [1] "SYM+I"
## [1] "SYM+G"
## [1] "SYM+G+I"
## [1] "GTR+I"
## [1] "GTR+G"
## [1] "GTR+G+I"
print(mt)
##      Model df    logLik      AIC         AICw     AICc        AICcw
## 1       JC 17 -2737.793 5509.587 6.033120e-53 5510.396 7.330043e-53
## 2     JC+I 18 -2668.901 5373.803 1.843904e-23 5374.709 2.134824e-23
## 3     JC+G 18 -2666.866 5369.733 1.410908e-22 5370.639 1.633513e-22
## 4   JC+G+I 19 -2666.843 5371.686 5.312805e-23 5372.694 5.845196e-23
## 5      F81 20 -2736.655 5513.309 9.381663e-54 5514.425 9.781248e-54
## 6    F81+I 21 -2667.348 5376.697 4.337392e-24 5377.926 4.273318e-24
## 7    F81+G 21 -2665.319 5372.639 3.299603e-23 5373.867 3.250859e-23
## 8  F81+G+I 22 -2665.292 5374.585 1.247189e-23 5375.932 1.157894e-23
## 9      K80 18 -2689.537 5415.074 2.012508e-32 5415.980 2.330029e-32
## 10   K80+I 19 -2618.876 5275.753 3.605616e-02 5276.761 3.966931e-02
## 11   K80+G 19 -2616.487 5270.974 3.933050e-01 5271.982 4.327177e-01
## 12 K80+G+I 20 -2616.304 5272.608 1.737509e-01 5273.723 1.811513e-01
## 13     HKY 21 -2688.471 5418.941 2.910358e-33 5420.170 2.867365e-33
## 14   HKY+I 22 -2617.061 5278.121 1.103290e-02 5279.469 1.024297e-02
## 15   HKY+G 22 -2614.689 5273.378 1.181810e-01 5274.726 1.097196e-01
## 16 HKY+G+I 23 -2614.523 5275.047 5.131378e-02 5276.519 4.476553e-02
## 17     SYM 22 -2686.437 5416.874 8.183118e-33 5418.221 7.597231e-33
## 18   SYM+I 23 -2616.446 5278.893 7.501979e-03 5280.365 6.544636e-03
## 19   SYM+G 23 -2613.611 5273.223 1.277398e-01 5274.695 1.114387e-01
## 20 SYM+G+I 24 -2613.368 5274.736 5.995799e-02 5276.338 4.901150e-02
## 21     GTR 25 -2686.195 5422.390 5.189801e-34 5424.128 3.963749e-34
## 22   GTR+I 26 -2615.537 5283.074 9.270386e-04 5284.954 6.596538e-04
## 23   GTR+G 26 -2612.827 5277.655 1.393043e-02 5279.534 9.912489e-03
## 24 GTR+G+I 27 -2612.620 5279.241 6.303005e-03 5281.268 4.166601e-03
##         BIC
## 1  5588.664
## 2  5457.531
## 3  5453.461
## 4  5460.066
## 5  5606.341
## 6  5474.380
## 7  5470.322
## 8  5476.919
## 9  5498.802
## 10 5364.133
## 11 5359.354
## 12 5365.639
## 13 5516.624
## 14 5380.456
## 15 5375.713
## 16 5382.033
## 17 5519.208
## 18 5385.879
## 19 5380.209
## 20 5386.373
## 21 5538.679
## 22 5404.015
## 23 5398.596
## 24 5404.833
dna_dist <- dist.ml(mammals10, model="JC69")

Neighbor Joining, UPGMA, and Maximum Parsimony

Once you have a distance matrix, phangorn provides simple, quick functions for estimating trees from distance matrices using neighbor-joining and UPGMA algorithms, which can be visualized using the plot() function:

mammals_UPGMA <- upgma(dna_dist)
mammals_NJ  <- NJ(dna_dist)
plot(mammals_UPGMA, main="UPGMA")

plot(mammals_NJ, main = "Neighbor Joining")

But which of these trees is a better fit for your data?

Using the parsimony() function, you can compare their respective parsimony scores. The function optim.parsimony() lets you go a step further, searching treespace through nearest-neighbor interchange (NNI) and subtree pruning and regrafting (SPR). Alternatively, pratchet() allows you to perform the search with the parsimony ratchet algorithm.

parsimony(mammals_UPGMA, mammals10_phyDat)
## [1] 332
parsimony(mammals_NJ, mammals10_phyDat)
## [1] 331
mammals_optim <- optim.parsimony(mammals_NJ, mammals10_phyDat)
## Final p-score 311 after  2 nni operations
mammals_pratchet <- pratchet(mammals10)
## [1] "Best pscore so far: 311"
## [1] "Best pscore so far: 311"
## [1] "Best pscore so far: 311"
## [1] "Best pscore so far: 311"
## [1] "Best pscore so far: 311"
## [1] "Best pscore so far: 311"
## [1] "Best pscore so far: 311"
## [1] "Best pscore so far: 311"
## [1] "Best pscore so far: 311"
## [1] "Best pscore so far: 311"
plot(mammals_optim)

plot(mammals_pratchet)

Maximum Likelihood and Bootstrapping

Though more computationally intensive than distance-matrix methods, building trees with maximum likelihood methods allows you to include the full data from your sequence alignment in a statistical framework that estimates model parameters.

You can begin by computing the likelihood of a given tree with the function pml(). Then, the function optim.pml() will optimize tree topology and branch length for your selected model of nucleotide evolution.

fit <- pml(mammals_NJ, mammals10)
print(fit)
## 
##  loglikelihood: -2762.28 
## 
## unconstrained loglikelihood: -4145.952 
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
## 0.25 0.25 0.25 0.25
fitJC <- optim.pml(fit, model = "JC", rearrangement = "stochastic")
## optimize edge weights:  -2762.28 --> -2738.582 
## optimize edge weights:  -2738.582 --> -2738.571 
## optimize topology:  -2738.571 --> -2733.869 
## optimize topology:  -2733.869 --> -2731.714 
## optimize topology:  -2731.714 --> -2727.183 
## 3 
## optimize edge weights:  -2727.183 --> -2727.182 
## optimize topology:  -2727.182 --> -2712.432 
## optimize topology:  -2712.432 --> -2707.769 
## optimize topology:  -2707.769 --> -2704.282 
## 3 
## optimize edge weights:  -2704.282 --> -2704.282 
## optimize topology:  -2704.282 --> -2701.33 
## optimize topology:  -2701.33 --> -2701.33 
## 1 
## optimize edge weights:  -2701.33 --> -2701.33 
## optimize topology:  -2701.33 --> -2701.33 
## 0 
## [1] "Ratchet iteration  1 , best pscore so far: -2685.78143477337"
## [1] "Ratchet iteration  2 , best pscore so far: -2685.78143475434"
## [1] "Ratchet iteration  3 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  4 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  5 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  6 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  7 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  8 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  9 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  10 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  11 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  12 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  13 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  14 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  15 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  16 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  17 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  18 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  19 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  20 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  21 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  22 , best pscore so far: -2685.78143472712"
## [1] "Ratchet iteration  23 , best pscore so far: -2685.78143472712"
## optimize edge weights:  -2685.781 --> -2685.781 
## optimize topology:  -2685.781 --> -2685.781 
## 0 
## optimize edge weights:  -2685.781 --> -2685.781
logLik(fitJC)
## 'log Lik.' -2685.781 (df=17)
bs <- bootstrap.pml(fitJC, bs=100, optNni=TRUE, multicore=TRUE, control = pml.control(trace=0))
plotBS(midpoint(fitJC$tree), bs, p = 50, type="p")

Exporting Trees

Finally, unless you’re analyzing relationships among a very limited number of taxa, the system viewer in R is probably going to be insufficient for viewing and saving your tree. The function write.tree() allows you to export the output of your analyses in Newick format, useful for further tinkering in programs such as FigTree.

write.tree(bs, file="bootstrap_example.tre")