Loading Files and Trees

Loading in the required pacakges

Remember, in R specialized functions exist in packages. We need to install and load each package in R that we will use for today’s lesson.

###INSTALL
#install.packages("ape", quiet = TRUE)
#install.packages("phanghorn", quiet = TRUE)
#install.packages ("phytools", quiet=TRUE)
#if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")

#BiocManager::install("msa")

### LOAD
library(ape, quietly = TRUE)
library(phangorn, quietly = TRUE)
library(phytools, quietly = TRUE)
library(msa, quietly = TRUE)

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:ape':
## 
##     complement

## The following object is masked from 'package:base':
## 
##     strsplit

1.Now, we have specialized functions for tree construction and manipulation installed. Now we are going to go on to the website genbank to download some sequences for tree construction.

2. Notice on the dropdown menu there are several possible choices such as nucleotide, protein, genes, or assembly. These are all different data repositories where we store information on various aspects of molecular biology. We are interested in mining nucleotide data for our organism of interest. On the top of the page, enter in vulpes.

3. Notice that there are definitely some fox genes there but there are way too many for what we can analyze. There are also many different kinds of genes when we would prefer to analyze only a single gene. So, lets try filtering it down to a single gene.

4. We now only have sequences for a single gene but we still have some bacterial samples present in our analysis. Let’s filter for only sequences in the Canidae (the family the foxes belong to) and see what we get back.

5. Great! Now we have only the COI gene for a group of dogs. Let’s download the sequences by using the upper right dropdown menu. As shown in the image.

6. Now that we have the sequence data, we still need to align the sequences together and then create trees from the aligned dataset. Before that, we have to import the sequences into R for alignment. We can do that by first changing our working directory to where the sequences are downloaded. This is very important! Many of the errors beginning student’s make is when they are attempting to analyze data but don’t realize they are in the wrong working directory.

## We will change our working directory to our Downloads folder because that is where the sequences are.
#setwd("Downloads")

## We can read in the data into R using a function from msa.
fox<- readDNAStringSet("fox.fasta")
fox

##   A DNAStringSet instance of length 57
##      width seq                                         names               
##  [1]   740 ATGTTCATTAATCGATGATT...CCTGAAGTCTACATTTTAAT KX156608.1 Vulpes...
##  [2]   740 ATGTTCATTAATCGATGATT...CCTGAAGTCTACATTTTAAT KX156607.1 Vulpes...
##  [3]   740 ATGTTCATTAATCGATGATT...CCTGAAGTCTACATTTTAAT KX156606.1 Vulpes...
##  [4]   740 ATGTTCATTAATCGATGATT...CCTGAAGTCTACATTTTAAT KX156605.1 Vulpes...
##  [5]   740 ATGTTCATTAATCGATGATT...CCTGAAGTCTACATTTTAAT KX156604.1 Vulpes...
##  ...   ... ...
## [53]   516 TATTCGGGGCATGAGCCGGT...ATCAGTCCTAATTACAGCTG KP992981.1 Vulpes...
## [54]   516 TATTCGGGGCATGAGCCGGT...ATCAGTCCTAATTACAGCTG KP992980.1 Vulpes...
## [55]   516 TATTCGGGGCATGAGCCGGT...ATCAGTCCTAATTACAGCTG KP992979.1 Vulpes...
## [56]   516 TATTCGGGGCATGAGCCGGT...ATCAGTTCTAATTACAGCCG KP992978.1 Vulpes...
## [57]   516 TATTCGGGGCATGAGCCGGT...ATCAGTCCTAATTACAGCTG KP992977.1 Vulpes...

7. We have now read the sequences into R and assigned them the object name fox. If you type fox to the screen you will see the sequences are now stored in R. Now, lets align the sequences together (this may take a while!!)

## Now, lets align them using a multiple sequence alignment algorithm clustal
fox_alignment<-msa(fox)

## use default substitution matrix

8. Now that the sequences are aligned, they need to be converted to a new format that can be handled by the phangorn/ape packages.

# Convert to a format that can be used for phylogenetic analysis.
fox_al2 <- msaConvert(fox_alignment, type="phangorn::phyDat")
fox_al2

## 57 sequences with 16822 character and 514 different site patterns.
## The states are a c g t

8. Finally, we can make a pairwise distance matrix between our sequences and create two trees: UPGMA and NJ. After plotting the trees, do you see any differences? What are they?

# Create a distance matrix of the sequences
dm <- dist.ml(fox_al2)

# Create a UPGMA tree for the group
treeUPGMA <- upgma(dm)

# Create a neighbor joining tree
treeNJ <- NJ(dm)

plot(treeUPGMA,main="UPGMA", cex=0.5)

plot(treeNJ,main="NJ", cex=0.5)

9. We can also look at which tree is the most parsimonious by evaluating the number of changes that occurr along each tree. Would you pick a tree with the lower or high parsimony score?

#Calculate the parsimony score for each tree
parsimony(treeUPGMA, fox_al2)

## [1] 4024

parsimony(treeNJ, fox_al2)

## [1] 4032

**9. Your new assignment is to go on genbank and create a NJ and UPGMA tree for your taxon of interest. Make sure there are not too many sequences or it will take forever to align and be cumbersome to plot. When you are finished, export the two trees as a PDF and send it to my email (mason . linscott [at] gmail.com)