Basics of Sequence Data and Phylogenies in R

The analysis of sequence data, detailing the primary structure of DNA, RNA, and proteins is becoming commonplace across subdisciplines in biology, from molecular biology to macroevolution. While the development of tools for understanding sequence data remain at the cutting edge of bioinformatics and biostatistics, the R community has already contributed many packages that aid in the analysis of sequence data. In particular, this tutorial is focused on bringing aligned sequence data into R and using a variety of algorithms to construct phylogenetic trees modeling evolutionary relationships among the organisms represented by the sequences.

As usual, we will want to start with loading some useful packages related to sequences and phylogenies. You may need to install these packages before loading them.

Unfortunately, there has also been an update problem between R, Bioconductor, and some of the R packages, so we need to run a little update

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")

Now we can load the packages:

require(ape)

## Loading required package: ape

## Warning: package 'ape' was built under R version 3.1.2

require(phangorn)

## Loading required package: phangorn
## 
## Attaching package: 'phangorn'
## 
## The following object is masked from 'package:ape':
## 
##     as.prop.part

Our data in this example are aligned ‘DNA barcode’ sequences from the cytochrome oxidase I (COI) gene for butterflies and moths (order Lepidoptera) collected at the Brown Family Environmental Center at Kenyon College, in the fall of 2015. The sequences were aligned using the Clustal algorithm at the European Bioinformatics Institute website. The alignment was then saved as a text file which can be read into R with the read.alignment() function.

butterfly.aln.phydat=read.phyDat("butterflyCOI.2015.clustalw", format="clustal")

summary(butterfly.aln.phydat)[1:5,]

##              Length Class    Mode     
## KC2015.00.05 "553"  "-none-" "numeric"
## KC2015.00.06 "553"  "-none-" "numeric"
## KC2015.05.08 "553"  "-none-" "numeric"
## KC2015.00.07 "553"  "-none-" "numeric"
## KC2015.05.04 "553"  "-none-" "numeric"

The phyDat alignment object is a list of the aligned sequences, each with 552 bp positions. Each sequence is identified with a sample code, which we want to match to a previously determined taxonomic identity. Those taxonomic names are in a .csv file which we also want to load.

ButterflyData <- read.csv("ButterflyData2015.csv")

Next, we need to extract the taxon labels based on the sequence codes. This process is a bit cumbersome, but it works…

butterfly.tips=as.character(ButterflyData$TaxonID[match(names(butterfly.aln.phydat), as.character(ButterflyData$SampleCode))])
#Some of the sequence codes are not in our database so we need to replace it again with the sequence code
for (i in which(is.na(butterfly.tips))) butterfly.tips[i]=names(butterfly.aln.phydat)[i]
butterfly.tips

##  [1] "Pieris.rapae.00.05"               
##  [2] "Pieris.rapae.00.06"               
##  [3] "Pieris.rapae.05.08"               
##  [4] "Pieris.rapae.00.07"               
##  [5] "Pieris.rapae.05.04"               
##  [6] "Pieris.rapae.01.06"               
##  [7] "Pieris.rapae.04.02"               
##  [8] "Pieris.rapae.01.16"               
##  [9] "Pieris.rapae.00.09"               
## [10] "Satyrium.edwardsii.04.08"         
## [11] "Pieris.rapae.05.02"               
## [12] "Pieris.rapae.02.02"               
## [13] "KC2015.xx.NC3"                    
## [14] "Phyciodes.tharos.00.08"           
## [15] "Chlosyne.nycteis.03.05"           
## [16] "KC2015.07.12"                     
## [17] "Phyciodes.tharos.07.02"           
## [18] "Bolaria.selene.05.13"             
## [19] "Phyciodes.tharos.06.11"           
## [20] "Phyciodes.tharos.08.16"           
## [21] "Phyciodes.tharos.08.22"           
## [22] "Phyciodes.tharos.04.06"           
## [23] "Boloria.bellona.01.02"            
## [24] "Boloria.bellona.02.10"            
## [25] "Boloria.bellona.07.05"            
## [26] "Boloria.bellona.08.07"            
## [27] "KC2015.08.23"                     
## [28] "Boloria.bellona.06.10"            
## [29] "Boloria.bellona.07.03"            
## [30] "KC2015.02.09"                     
## [31] "Boloria.bellona.08.10"            
## [32] "Belloria.bellona.03.04"           
## [33] "Boloria.bellona.08.17"            
## [34] "KC2015.06.04C"                    
## [35] "Boloria.bellona.05.06"            
## [36] "KC2015.06.08C"                    
## [37] "Clossiana.bellona.04.03"          
## [38] "Indet.01.13"                      
## [39] "Indet.00.10"                      
## [40] "Erynnis.juvenalis.juvenalis.06.12"
## [41] "Erynnis.horatius.08.13"           
## [42] "Erynnis.horatius.08.21"           
## [43] "Strymon.melinus.humuli.06.08"     
## [44] "KC2015.08.11"                     
## [45] "KC2015.02.xx"                     
## [46] "Eparhyreus.clarus.08.12"          
## [47] "KC2015.08.24"                     
## [48] "Pieris.rapae.00.13"               
## [49] "KC2015.02.07"                     
## [50] "Indet.01.04"                      
## [51] "Crambus.albellus .03.01"          
## [52] "Indet.03.02"                      
## [53] "Holomelina.aurantiaca.01.07"      
## [54] "Pholisora.catullus.01.10"         
## [55] "Holomelina.aurantiaca.01.11"      
## [56] "KC2015.02.13"                     
## [57] "Indet.05.15"                      
## [58] "Celastrina.neglecta.07.04"        
## [59] "Celastrina.neglecta.04.11"        
## [60] "Colias.philodice.02.08"           
## [61] "Colias.philodice.05.11"           
## [62] "Colias.philodice.08.20"           
## [63] "Colias.philodice.05.12"           
## [64] "KC2015.06.06C"                    
## [65] "Euphyes.vestris.01.12"            
## [66] "Ancyloxypha.numitor.07.06"        
## [67] "Indet.00.16"                      
## [68] "Lambdina.fervidaria.02.03"        
## [69] "Indet.01.09"                      
## [70] "KC2015.03.NC1"                    
## [71] "KC2015.03.NC2"                    
## [72] "KC2015.xx.NC1"                    
## [73] "KC2015.xx.NC2"                    
## [74] "KC2015.06.12C"                    
## [75] "Crambus.albellus .03.03"          
## [76] "Colias.philodice.02.12"           
## [77] "Chlosyne.nycteis.03.09"           
## [78] "KC2105.08.14"

Then we can attach the taxonomic labels to the sequences by replacing the names() of the alignment object.

names(butterfly.aln.phydat)=butterfly.tips

So now we have meaningful labels for each (or at least most) of the sequences, so we can think about visualizing the phylogenetic relationships between the individuals. Since we are only looking at the sequence information for the COI locus, the resulting phylogeny is a gene tree, rather than a species tree.

Phylogenetic trees can be modeled from sequence data using a variety of approaches. Here, we will begin with a ‘neighbor joining’ algorithm, which sequentially joins taxa based on their sequence similarity. We have to begin the process by calculating the pairwise ‘distance’ among all of the 87 sequences, using the dist.ml() function from phangorn.

butterfly.phydat.dist=dist.ml(butterfly.aln.phydat)

Next, we can construct and plot a neighbor joining phylogeny from the distance matrix using NJ() from ape.

butterfly.NJ=NJ(butterfly.phydat.dist)
plot.phylo(butterfly.NJ, use.edge.length=FALSE, cex=0.75)

plot of chunk unnamed-chunk-9

Now this tree is somewhat hard to read, but if you zoom in, you can see that in many but certainly not all cases, similar species are grouping together. Given the sequence we looked at above, how might we explain individuals from the same species coming out in different evolutionary COI lineages in the tree?

We can compare our neighbor joining tree to a tree constructed using a ‘parsimony’ algorithm, which is based on the assumption that the best phylogeny is the one that requires the minimum number of base pair changes.

butterfly.pars=pratchet(butterfly.aln.phydat) #prachet constructs the tree based on parsimony

## [1] "Best pscore so far: 1325"
## [1] "Best pscore so far: 1325"
## [1] "Best pscore so far: 1325"
## [1] "Best pscore so far: 1323"
## [1] "Best pscore so far: 1323"
## [1] "Best pscore so far: 1323"
## [1] "Best pscore so far: 1323"
## [1] "Best pscore so far: 1323"

plot.phylo(butterfly.pars, use.edge.length=FALSE, cex=0.75)

plot of chunk unnamed-chunk-10

Here, we can see that the topology (shape) of the tree changes, depending on the assumptions that we make about how the evolutionary process works at the sequence level. These are two of many approaches to constructing phylogenies and we have not touched on methods for comparing alternative models.

What we have done is provided some background and examples for working with sequence data and constructing and visualizing phylogenies in R.

Basics of Sequence Data and Phylogenies in R

Drew Kerkhoff

January 18, 2016