Assignment: Your assignment is to use your notes from class - along with help from classmates, UTAs, and me - to turn this script into a fleshed-out description of what is going on.

This is a substantial project - we’ll work on it in steps over the rest of the unit.

We are currently focused on the overall process and will cover the details over the rest of this unit.

Your first assignment is to get this script to run from top to bottom by adding all of the missing R commands. Once you have done that, you can knit it into an HTML file and upload it to RPubs. (Note - you’ll need to add the YAML header!)

Your second assignment, which will be posted later, is to answer all the TODO and other prompts to add information. You can start on this, but you don’t have to do this on your first time through the code.

Delete all the prompts like TODO() as you compete them. Use RStudio’s search function to see if you’ve missed any - there are a LOT!

Add YAML header!!! Give it a title

A complete bioinformatics workflow in R

By: Nathan L. Brouwer

“Worked example: Building a phylogeny in R”

Introduction

The goal is to get the “big picture”

Describe how phylogeneies can be used in biology (readings will be assigned)

Phylogeneies are used to compare the genomes of different species and to explore evolutionary lineage among such species.

Vocab

Make a list of at least 10 vocab terms that are important (don’t have to define)

1. Fasta file

2. Reproducible workflow

3. Multiple Sequence Allignment “MSA”

4. Accession Number

5. Distance Matrix

6. Pairwise Alignment

7. Shroom

8. Indels

9. Sequence Logo

10. BLAST results

Key functions

Make a list of at least 5 key functions Put in the format of package::function

1.devtools::install_github(“brouwern/compbio4all”)

2.knitr::opts_chunk$set(echo = TRUE)

3.install.packages(“BiocManager”)

4.devtools::install_github(“YuLab-SMU/ggmsa”)

5.hShroom3 <- rentrez::entrez_fetch(db = “protein”, id = “NP_065910”, rettype = “fasta”)

Software Preliminaires

Add the necessary calls to library() to load call packages Indicate which packages cam from Bioconducotr, CRAN, and GitHub

Load packages into memory

# BioConductor
#install.packages("BiocManager")
library(BiocManager)
#BiocManager::install("Biostrings")
library(Biostrings)

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

## Loading required package: XVector

## Loading required package: GenomeInfoDb

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

#install.packages("devtools")
library("devtools")

## Loading required package: usethis

## 
## Attaching package: 'devtools'

## The following object is masked from 'package:BiocManager':
## 
##     install

#Github

#  ------------------------------
#devtools::install_github("brouwern/compbio4all")
library("compbio4all")

#CRAN

#install.packages("rentrez")
library(rentrez)

#install.packages("seqinr")
library(seqinr)

## 
## Attaching package: 'seqinr'

## The following object is masked from 'package:Biostrings':
## 
##     translate

#install.packages("ape")
library("ape")

## 
## Attaching package: 'ape'

## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus

## The following object is masked from 'package:Biostrings':
## 
##     complement

#Github

#devtools::install_github("YuLab-SMU/ggmsa")
library(ggmsa)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

(Explain what is going on…) We are assigning the human shroom data to hShroom3; the funtion downloads data and assigns it to something.

#library(rentrez)
# Human shroom 3 (H. sapiens)
#hShroom3 <- rentrez::entrez_fetch(db = "protein", 
                          #id = "NP_065910", 
                          #rettype = "fasta")

slash n is new line character

hShroom3

## [1] ">NP_065910.3 protein Shroom3 [Homo sapiens]\nMMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKL\nQAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAA\nWSGGVKLRLKHRRSEPAGRPHSWHTTKSGEKQPDASMMQISQGMIGPPWHQSYHSSSSTSDLSNYDHAYL\nRRSPDQCSSQGSMESLEPSGAYPPCHLSPAKSTGSIDQLSHFHNKRDSAYSSFSTSSSILEYPHPGISGR\nERSGSMDNTSARGGLLEGMRQADIRYVKTVYDTRRGVSAEYEVNSSALLLQGREARASANGQGYDKWSNI\nPRGKGVPPPSWSQQCPSSLETATDNLPPKVGAPLPPARSDSYAAFRHRERPSSWSSLDQKRLCRPQANSL\nGSLKSPFIEEQLHTVLEKSPENSPPVKPKHNYTQKAQPGQPLLPTSIYPVPSLEPHFAQVPQPSVSSNGM\nLYPALAKESGYIAPQGACNKMATIDENGNQNGSGRPGFAFCQPLEHDLLSPVEKKPEATAKYVPSKVHFC\nSVPENEEDASLKRHLTPPQGNSPHSNERKSTHSNKPSSHPHSLKCPQAQAWQAGEDKRSSRLSEPWEGDF\nQEDHNANLWRRLEREGLGQSLSGNFGKTKSAFSSLQNIPESLRRHSSLELGRGTQEGYPGGRPTCAVNTK\nAEDPGRKAAPDLGSHLDRQVSYPRPEGRTGASASFNSTDPSPEEPPAPSHPHTSSLGRRGPGPGSASALQ\nGFQYGKPHCSVLEKVSKFEQREQGSQRPSVGGSGFGHNYRPHRTVSTSSTSGNDFEETKAHIRFSESAEP\nLGNGEQHFKNGELKLEEASRQPCGQQLSGGASDSGRGPQRPDARLLRSQSTFQLSSEPEREPEWRDRPGS\nPESPLLDAPFSRAYRNSIKDAQSRVLGATSFRRRDLELGAPVASRSWRPRPSSAHVGLRSPEASASASPH\nTPRERHSVTPAEGDLARPVPPAARRGARRRLTPEQKKRSYSEPEKMNEVGIVEEAEPAPLGPQRNGMRFP\nESSVADRRRLFERDGKACSTLSLSGPELKQFQQSALADYIQRKTGKRPTSAAGCSLQEPGPLRERAQSAY\nLQPGPAALEGSGLASASSLSSLREPSLQPRREATLLPATVAETQQAPRDRSSSFAGGRRLGERRRGDLLS\nGANGGTRGTQRGDETPREPSSWGARAGKSMSAEDLLERSDVLAGPVHVRSRSSPATADKRQDVLLGQDSG\nFGLVKDPCYLAGPGSRSLSCSERGQEEMLPLFHHLTPRWGGSGCKAIGDSSVPSECPGTLDHQRQASRTP\nCPRPPLAGTQGLVTDTRAAPLTPIGTPLPSAIPSGYCSQDGQTGRQPLPPYTPAMMHRSNGHTLTQPPGP\nRGCEGDGPEHGVEEGTRKRVSLPQWPPPSRAKWAHAAREDSLPEESSAPDFANLKHYQKQQSLPSLCSTS\nDPDTPLGAPSTPGRISLRISESVLRDSPPPHEDYEDEVFVRDPHPKATSSPTFEPLPPPPPPPPSQETPV\nYSMDDFPPPPPHTVCEAQLDSEDPEGPRPSFNKLSKVTIARERHMPGAAHVVGSQTLASRLQTSIKGSEA\nESTPPSFMSVHAQLAGSLGGQPAPIQTQSLSHDPVSGTQGLEKKVSPDPQKSSEDIRTEALAKEIVHQDK\nSLADILDPDSRLKTTMDLMEGLFPRDVNLLKENSVKRKAIQRTVSSSGCEGKRNEDKEAVSMLVNCPAYY\nSVSAPKAELLNKIKEMPAEVNEEEEQADVNEKKAELIGSLTHKLETLQEAKGSLLTDIKLNNALGEEVEA\nLISELCKPNEFDKYRMFIGDLDKVVNLLLSLSGRLARVENVLSGLGEDASNEERSSLYEKRKILAGQHED\nARELKENLDRRERVVLGILANYLSEEQLQDYQHFVKMKSTLLIEQRKLDDKIKLGQEQVKCLLESLPSDF\nIPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL\n\n"

Humans have 3 shrooms

TODO:explain what cat() is doing # It starts a new line character; printing it out on the screen.

cat(hShroom3)

## >NP_065910.3 protein Shroom3 [Homo sapiens]
## MMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKL
## QAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAA
## WSGGVKLRLKHRRSEPAGRPHSWHTTKSGEKQPDASMMQISQGMIGPPWHQSYHSSSSTSDLSNYDHAYL
## RRSPDQCSSQGSMESLEPSGAYPPCHLSPAKSTGSIDQLSHFHNKRDSAYSSFSTSSSILEYPHPGISGR
## ERSGSMDNTSARGGLLEGMRQADIRYVKTVYDTRRGVSAEYEVNSSALLLQGREARASANGQGYDKWSNI
## PRGKGVPPPSWSQQCPSSLETATDNLPPKVGAPLPPARSDSYAAFRHRERPSSWSSLDQKRLCRPQANSL
## GSLKSPFIEEQLHTVLEKSPENSPPVKPKHNYTQKAQPGQPLLPTSIYPVPSLEPHFAQVPQPSVSSNGM
## LYPALAKESGYIAPQGACNKMATIDENGNQNGSGRPGFAFCQPLEHDLLSPVEKKPEATAKYVPSKVHFC
## SVPENEEDASLKRHLTPPQGNSPHSNERKSTHSNKPSSHPHSLKCPQAQAWQAGEDKRSSRLSEPWEGDF
## QEDHNANLWRRLEREGLGQSLSGNFGKTKSAFSSLQNIPESLRRHSSLELGRGTQEGYPGGRPTCAVNTK
## AEDPGRKAAPDLGSHLDRQVSYPRPEGRTGASASFNSTDPSPEEPPAPSHPHTSSLGRRGPGPGSASALQ
## GFQYGKPHCSVLEKVSKFEQREQGSQRPSVGGSGFGHNYRPHRTVSTSSTSGNDFEETKAHIRFSESAEP
## LGNGEQHFKNGELKLEEASRQPCGQQLSGGASDSGRGPQRPDARLLRSQSTFQLSSEPEREPEWRDRPGS
## PESPLLDAPFSRAYRNSIKDAQSRVLGATSFRRRDLELGAPVASRSWRPRPSSAHVGLRSPEASASASPH
## TPRERHSVTPAEGDLARPVPPAARRGARRRLTPEQKKRSYSEPEKMNEVGIVEEAEPAPLGPQRNGMRFP
## ESSVADRRRLFERDGKACSTLSLSGPELKQFQQSALADYIQRKTGKRPTSAAGCSLQEPGPLRERAQSAY
## LQPGPAALEGSGLASASSLSSLREPSLQPRREATLLPATVAETQQAPRDRSSSFAGGRRLGERRRGDLLS
## GANGGTRGTQRGDETPREPSSWGARAGKSMSAEDLLERSDVLAGPVHVRSRSSPATADKRQDVLLGQDSG
## FGLVKDPCYLAGPGSRSLSCSERGQEEMLPLFHHLTPRWGGSGCKAIGDSSVPSECPGTLDHQRQASRTP
## CPRPPLAGTQGLVTDTRAAPLTPIGTPLPSAIPSGYCSQDGQTGRQPLPPYTPAMMHRSNGHTLTQPPGP
## RGCEGDGPEHGVEEGTRKRVSLPQWPPPSRAKWAHAAREDSLPEESSAPDFANLKHYQKQQSLPSLCSTS
## DPDTPLGAPSTPGRISLRISESVLRDSPPPHEDYEDEVFVRDPHPKATSSPTFEPLPPPPPPPPSQETPV
## YSMDDFPPPPPHTVCEAQLDSEDPEGPRPSFNKLSKVTIARERHMPGAAHVVGSQTLASRLQTSIKGSEA
## ESTPPSFMSVHAQLAGSLGGQPAPIQTQSLSHDPVSGTQGLEKKVSPDPQKSSEDIRTEALAKEIVHQDK
## SLADILDPDSRLKTTMDLMEGLFPRDVNLLKENSVKRKAIQRTVSSSGCEGKRNEDKEAVSMLVNCPAYY
## SVSAPKAELLNKIKEMPAEVNEEEEQADVNEKKAELIGSLTHKLETLQEAKGSLLTDIKLNNALGEEVEA
## LISELCKPNEFDKYRMFIGDLDKVVNLLLSLSGRLARVENVLSGLGEDASNEERSSLYEKRKILAGQHED
## ARELKENLDRRERVVLGILANYLSEEQLQDYQHFVKMKSTLLIEQRKLDDKIKLGQEQVKCLLESLPSDF
## IPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL

TODO: explain what this code chunk is doing ## Three different accession numbers among three species and is assigning it to different shroom titles.

# Mouse shroom 3a (M. musculus)
#mShroom3a <- rentrez::entrez_fetch(db = "protein", 
                          #id = "AAF13269", 
                          #rettype = "fasta")

# Human shroom 2 (H. sapiens)
#hShroom2 <- rentrez::entrez_fetch(db = "protein", 
                          #id = "CAA58534", 
                          #rettype = "fasta")


# Sea-urchin shroom
#sShroom <- rentrez::entrez_fetch(db = "protein", 
                         # id = "XP_783573", 
                        #  rettype = "fasta")

TODO: Explain what this code chunk is doing ## Shows the number of amino acids in the fasta file.

#nchar(hShroom3)
#nchar(mShroom3a)
#nchar(sShroom)
#nchar(hShroom2)

Cleaning/Prepping macromolecular sequences

TODO: Explain what this function does ## It cleans fastas and puts them in the right format.

library(compbio4all)
fasta_cleaner

## function (fasta_object, parse = TRUE) 
## {
##     fasta_object <- sub("^(>)(.*?)(\\n)(.*)(\\n\\n)", "\\4", 
##         fasta_object)
##     fasta_object <- gsub("\n", "", fasta_object)
##     if (parse == TRUE) {
##         fasta_object <- stringr::str_split(fasta_object, pattern = "", 
##             simplify = FALSE)
##     }
##     return(fasta_object[[1]])
## }
## <bytecode: 0x0000000043a46ab0>
## <environment: namespace:compbio4all>

#TODO: explain how to add the function to your R session even if you can’t download compbio4all ## Put the function in a package. #r} #fasta_cleaner <- function(fasta_object, parse = TRUE){

# fasta_object <- sub(“^(>)(.?)(\n)(.)(\n\n)”,“\4”,fasta_object) # fasta_object <- gsub(“”, "", fasta_object)

# if(parse == TRUE){ #fasta_object <- stringr::str_split(fasta_object, # pattern = "", # simplify = FALSE) # }

return(fasta_object[[1]])

#TODO: briefly explain what this code chunk is doing ## This puts the fastas in the correct fotmat.

#hShroom3  <- fasta_cleaner(hShroom3,  parse = F)
#mShroom3a <- fasta_cleaner(mShroom3a, parse = F)
#hShroom2  <- fasta_cleaner(hShroom2,  parse = F)
#sShroom   <- fasta_cleaner(sShroom,   parse = F)

#hShroom3

#nchar(hShroom3)

Pairwise Alignment of sequences (sibling of msa)

#TODO: give this a title. Explain what code below is doing ## It is figuring out the best way to line up the sequences.

library(Biostrings)
# add necessary function
#align.h3.vs.m3a <- Biostrings::pairwiseAlignment(
                  #hShroom3,
                  #mShroom3a)

#align.h3.vs.m3a

#TODO: In 1-2 sentence explain what this object shows ## It compares the two different shrooms of the two species and their sequence allignment. The dashes indicate “gab or indel” insertion/deletion.

#align.h3.vs.m3a

#TODO: explain what this is showing ## It is comparing the two sequences and counting the number of identical bases; percent identity; how similar they are.

# add necessary function
#Biostrings::pid(align.h3.vs.m3a)

#TODO: briefly explain what is going on here versus the previous code chunk ## Assigned new shrooms 2 and 3 to the title.

#align.h3.vs.h2 <- Biostrings::pairwiseAlignment(
                  #hShroom3,
                  #hShroom2)

#TODO: explain what is going on here and compare and contrast with previous ouput ## 70% identity for 1st and now a 30% identity; genes bewteen the two shrooms are 70% identical.

#pid(align.h3.vs.h2)
#score(align.h3.vs.h2)

#TODO: briefly explian the difference between the output of score() and pid() (can be very brief - we’ll get into the details later)

score() a summary of the quality of allignment, pid() percent identity

#Biostrings::pid(align.h3.vs.h2)

The shroom family of genes

#TODO: briefly explain why I have this whole table here ## Many shrooms of different species laying out the vector as a table for msa.

#shroom_table <- c("CAA78718" , "X. laevis Apx" ,         "xShroom1",
            #"NP_597713" , "H. sapiens APXL2" ,     "hShroom1",
            #"CAA58534" , "H. sapiens APXL",        "hShroom2",
           # "ABD19518" , "M. musculus Apxl" ,      "mShroom2",
           # "AAF13269" , "M. musculus ShroomL" ,   "mShroom3a",
           # "AAF13270" , "M. musculus ShroomS" ,   "mShroom3b",
           # "NP_065910", "H. sapiens Shroom" ,     "hShroom3",
           # "ABD59319" , "X. laevis Shroom-like",  "xShroom3",
           # "NP_065768", "H. sapiens KIAA1202" ,   "hShroom4a",
           # "AAK95579" , "H. sapiens SHAP-A" ,     "hShroom4b",
           # #"DQ435686" , "M. musculus KIAA1202" ,  "mShroom4",
          #  "ABA81834" , "D. melanogaster Shroom", "dmShroom",
          #  "EAA12598" , "A. gambiae Shroom",      "agShroom",
          #  "XP_392427" , "A. mellifera Shroom" ,  "amShroom",
          #  "XP_783573" , "S. purpuratus Shroom" , "spShroom") #sea urchin

#TODO: write a short sentence explaining what this next code chunk will do, then annotate each line with what was done. ## This will transfer the specific data from the shrooms to the table format.

# convert to XXXXXXXXXC
#shroom_table_matrix <- matrix(shroom_table,
                                  #byrow = T,
                                  #nrow = 14)
# convert to XXXXXXXXXC
#shroom_table <- data.frame(shroom_table_matrix, 
                     #stringsAsFactors = F)

# XXXXXXXXXC columns
#names(shroom_table) <- c("accession", "name.orig","name.new")

# Create simplified species names
#shroom_table$spp <- "Homo"
#shroom_table$spp[grep("laevis",shroom_table$name.orig)] <- "Xenopus"
#shroom_table$spp[grep("musculus",shroom_table$name.orig)] <- "Mus"
#shroom_table$spp[grep("melanogaster",shroom_table$name.orig)] <- "Drosophila"
#shroom_table$spp[grep("gambiae",shroom_table$name.orig)] <- "mosquito"
#shroom_table$spp[grep("mellifera",shroom_table$name.orig)] <- "bee"
#shroom_table$spp[grep("purpuratus",shroom_table$name.orig)] <- "sea urchin"

#TODO: in a brief sentence explain what this is doing ## It is putting the shroom vectors in a table format.

#shroom_table

XXXXXing multiple sequences

#TODO: in a brief sentence explain what the $ allows us to do ## Access all of the numbers for the different species.

#shroom_table$accession

#TODO: briefly explain what this chunk is doing and add the correct function ## This makes what we are doing reproducable; a long file with the 14 sequences.

# add necessary function
#shrooms <-rentrez::entrez_fetch(db = "protein", 
                         # id = shroom_table$accession, 
                         # rettype = "fasta")

#is(shrooms)
#nchar(shrooms)

#TODO: in a very brief sentence explain what this is doing. ## Making the above file look better with a new format.

#cat(shrooms)

#TODO: in a brief sentence explain what this is doing and if/how its different from the previous code chunks ## Assigning the data into a list data structure. ## Wrapper - relationship b/w entrez and compbio4all –> “dependency!!”

#shrooms_list <- compbio4all::entrez_fetch_list(db = "protein", 
                          #id = shroom_table$accession, 
                          #rettype = "fasta")

#is(shrooms_list)

#TODO: briefly explain what I am doing this ## Length of 14, units/parts.

#length(shrooms_list)

#TODO: briefly explain what I am doing this. We will get into the details of for() loops in R later in the semester. ## Building a msa from a list; for loops alow you to repeat a task, repeating cleaning.

`#{r} #for(i in 1:length(shrooms_list)){ #shrooms_list[[i]] <- fasta_cleaner(shrooms_list[[i]], parse = F) }



#TODO: summarize what is going on in this code chunk, then annotate each line of code with what its doing
## We are "cleaning" data.
```#{r}
# XXXXXXXXCX
#shrooms_vector <- rep(NA, length(shrooms_list))

# XXXXXXXXCX
for(i in 1:length(shrooms_vector)){
  #shrooms_vector[i] <- shrooms_list[[i]]
}

#  XXXXXXXXCX
#names(shrooms_vector) <- names(shrooms_list)

#TODO: explain what this is doing then add the necessary function. ## Converting the vector to a string set; amino acid ss. `#{r} # add necessary function #shrooms_vector_ss <- Biostrings::AAStringSet(shrooms_vector)



## MSA


##TODO: briefly summarize what this section of the document will do.  
Readings will be assigned to explain what MSAs are.

## Multiple sequcne allignments compare/align sequences to assess their similarities.


### Building an XXXXXXXX (MSA)



##TODO: briefly explain what this chunk does, then add the necessary function.
## msa will format the data in the ss format; reimplementation of an algorithm; assigning the msa to the new title.
``#`{r}
# add necessary function
#library(msa)
#shrooms_align <-msa(shrooms_vector_ss,
                     #method = "ClustalW")

Viewing an MSA

#TODO: briefly summarize what this section will do. ## Data preparation.

Viewing an MSA in R

#TODO: Briefly summarize what output is shown below ## It displays the raw output of R, somewhat unhelpful.

#shrooms_align

#TODO: briefly explain what is being done in this chunk. This is tricky (and annoying) so do your best

# WHAT IS THE LINE BELOW DOING? (its tricky - do your best)
#class(shrooms_align) <- "AAMultipleAlignment"
## Assigning the amino acid format/template to the shroom. 


# WHAT IS THE LINE BELOW DOING? This is simpler
#shrooms_align_seqinr <- msaConvert(shrooms_align, type = "seqinr::alignment")
## Simplifying the shrooms by comparing the msa's.

#TODO: what is the output this produces ## Displaying the current msa via the console. #{r, eval = F} #print_msa(alignment = shrooms_align_seqinr, #chunksize = 60)

Displaying an MSA XXXXXXXX

#TODO: explain this output and how its differnet from the prevoius ## Prints out the msa in a better, more attractive format.Produces one section of the shroom gene.

## add necessary function
library(ggmsa)
#ggmsa::ggmsa(shrooms_align,   # shrooms_align, NOT shrooms_align_seqinr
     # start = 2000, 
     # end = 2100)

Saving an MSA as PDF

TODO: explain what this command is doing. Add the package the function is coming from using :: notation This may not work for everyone. If its not working you can comment it out.

#msaPrettyPrint(shrooms_align,             # alignment
              # file = "shroom_msa.pdf",   # file name
              # y=c(2000, 2100),           # range
              # askForOverwrite=FALSE)

TODO: explain what this command is doing ## It is showing the location/path of the address for saving. “Get working directory.”

#getwd()

#list.files()

Part 2

#To make things easier we’ll move forward with just a subset of sequences:

#* XP_392427: amShroom (bee shroom) #* EAA12598: agShroom (mosquito shroom) #* ABA81834: dmShroom (Drosophila shroom) #* XP_783573: spShroom (sea urchin shroom) #* CAA78718: xShroom1 (frog shroom)

#Our main working object shrooms_vector_ss has the names of our genes listed

#names(shrooms_vector_ss)

#We can select the ones we want to focus on be first making a vector of the names

#names.focal <- c("XP_392427","EAA12598","ABA81834","XP_783573","CAA78718")

#We can use this vector and bracket notation to select the what we want from shrooms_vector_ss:

#shrooms_vector_ss[names.focal]

#Let’s assign the subset of sequences to a new object called shrooms_vector_ss_subset.

#shrooms_vector_ss_subset <- shrooms_vector_ss[names.focal]

#Let’s make another MSA with just this subset. If msa isn’t working for you you can comment this out.

#shrooms_align_subset <- msa(shrooms_vector_ss_subset,
                     #method = "ClustalW")

#To view it using ggmsa we need to do those annoying conversions again.

#class(shrooms_align_subset) <- "AAMultipleAlignment"
#shrooms_align_subset_seqinr <- msaConvert(shrooms_align_subset, type = "seqinr::alignment")

#Then we can plot it

#ggmsa::ggmsa(shrooms_align_subset,   # shrooms_align, NOT shrooms_align_seqinr
      #start = 2030, 
      #end = 2100)

#We can save our new smaller MSA like this. #{r, eval = F} #msaPrettyPrint(shrooms_align_subset, # alignment # file = "shroom_msa_subset.pdf", # file name # y=c(2030, 2100), # range # askForOverwrite=FALSE)

Genetic distances of sequence in subset

#While an MSA is a good way to examine a sequence its hard to assess all of the information visually. A phylogenetic tree allows you to summarize patterns in an MSA. The fastest way to make phylogenetic trees to is first summarize an MSA using a genetic distance matrix. The more amino acids that are identical to each other, the smaller the genetic distance is between them and the less evolution has occurred.

#We usually work in terms of difference or genetic distance (a.k.a. evolutionary distance), though often we also talk in terms of similarity or identity.

#Calculating genetic distance from an MSA is done using the seqinr::dist.alignment() function.

#shrooms_subset_dist <- seqinr::dist.alignment(shrooms_align_subset_seqinr, 
                                     #  matrix = "identity")

#This produces a “dist” class object.

#is(shrooms_subset_dist)
#class(shrooms_subset_dist)

#If you’ve been having trouble with the MSA software, the data necessary to build the distance matrix directly in R is in this code chunk (you can ignore the details).

#shrooms_subset_dist_alt <- matrix(data = NA,
    #                          nrow = 5, 
     #                         ncol = 5)

#distances <- c(0.8260049, 
    #           0.8478722, 0.9000568, 
    #           0.9244596, 0.9435187, 0.9372139, 
     #          0.9238779, 0.9370038, 0.9323225,0.9413209)
shrooms_subset_dist_alt[lower.tri(shrooms_subset_dist_alt)] <- distances

#seqnames <- c("EAA12598","ABA81834","XP_392427", "XP_783573","CAA78718")
#colnames(shrooms_subset_dist_alt) <- seqnames
#row.names(shrooms_subset_dist_alt)  <- seqnames
#shrooms_subset_dist_alt <- as.dist(shrooms_subset_dist_alt)
#shrooms_subset_dist <- shrooms_subset_dist_alt

#We’ve made a matrix using dist.alignment(); let’s round it off so its easier to look at using the round() function.

#shrooms_subset_dist_rounded <- round(shrooms_subset_dist,
                            #  digits = 3)

#If we want to look at it we can type #{r eval = T} #shrooms_subset_dist_rounded

#Not that we have 5 sequence, but the matrix is 4 x 4. This is because redundant information is dropped, including distances from one sequence to itself. This makes it so that the first column is EAA12598, but the first row is ABA81834.

Phylognetic trees of subset sequences (finally!)

#We got our sequences, built a multiple sequence alignment, and calculated the genetic distance between sequences. Now we are - finally - ready to build a phylogenetic tree.

#First, we let R figure out the structure of the tree. There are MANY ways to build phylogenetic trees. We’ll use a common one used for exploring sequences called neighbor joining algorithm via the function nj(). Neighbor joining uses genetic distances to cluster sequences into clades.

#nj() is simple function that takes only a single argument, a distance matrix.

# Note - not using rounded values
#tree_subset <- nj(shrooms_subset_dist)

Plotting phylogenetic trees

#Now we’ll make a quick plot of our tree using plot() (and add a little label using an important function called mtext()).

# plot tree
#plot.phylo(tree_subset, main="Phylogenetic Tree", 
         #   type = "unrooted", 
         #   use.edge.length = F)

# add label
#mtext(text = "Shroom family gene tree - UNrooted, no branch lengths")

#This is an unrooted tree with no outgroup defined. For the sake of plotting we’ve also ignored the evolutionary distance between the sequences, so the branch lengths don’t have meaning.

#To make a rooted tree we remove type = "unrooted. In the case of neighbor joining, the algorithm tries to figure out the outgroup on its own.

# plot tree
#plot.phylo(tree_subset, main="Phylogenetic Tree", 
 #           use.edge.length = F)

# add label
#mtext(text = "Shroom family gene tree - rooted, no branch lenths")

#We can include information about branch length by setting use.edge.length = ... to T.

# plot tree
#plot.phylo(tree_subset, main="Phylogenetic Tree", 
            use.edge.length = T)

# add label
#mtext(text = "Shroom family gene tree - rooted, with branch lenths")

#Now the length of the branches indicates the evolutionary distance between sequences and correlate to the distances reported in our distance matrix. The branches are all very long, indicating that these genes have been evolving independently for many millions of years.

#An important note: the vertical lines on the tree have no meaning, only the horizontal ones.

#Because the branch lengths are all so long I find this tree a bit hard to view when its rooted. Let’s make it unrooted again.

# plot tree
#plot.phylo(tree_subset, main="Phylogenetic Tree", 
     #      type = "unrooted",
      #      use.edge.length = T)

# add label
#mtext(text = "Shroom family gene tree - rooted, with branch lenths")

#Now you can see that the ABA and EAA sequences form a clade, and that the distance between them is somewhat smaller than the distance between other sequences. If we go back to our original distance matrix, we can see that the smallest genetic distance is between ABA and EAA at 0.826.

#shrooms_subset_dist_rounded

#We can confirm that this is the minimum using the min() function.

#min(shrooms_subset_dist_rounded)

A bioinformatics workflow

Robbie Conner

9/28/2021