Code By: Nathan L. Brouwer
In biology, phylogenies can be used to study evolutionary relationships among different species of organisms. Phylogenies can be made using morphological data or molecular data (nucleic acids or proteins). Scientists use phylogenetic trees to visualize evolutionary relationships among different species, and can also show evolutionary.
package, function, dependency, accession number, pairwise alignment, multiple sequence alignment, phylogenetic tree, reproducible workflow, YAML header, knitr header
rentrez::entrez_fetch Biostrings::pairwiseAlignment Biostrings::pid compbio4all::fasta_cleaner compbio4all::entrez_fetch_list compbio4all::print_msa
CRAN : install.packages(“PackageName”) function Bioconductor: first install.packages(“BiocManager”), then library(BiocManager), and then to install a package within Bioconductor use the install(“PackageName”) function Github: first install.packages(“devtools”), then library(devtools), and then install_github(“DeveloperName/PackageName”)
# github packages
library(compbio4all)
# CRAN packages
library(rentrez)
library(seqinr)
library(ape)
# Bioconductor packages
library(msa)
library(Biostrings)
In this chunk of code, you are using the entrez_fetch function to retrieve fasta files from NCBI databases on the internet. rentrez :: entrez_fetch
# Human shroom 3 (H. sapiens)
hShroom3 <- entrez_fetch(db = "protein",
id = "NP_065910",
rettype = "fasta")
The cat() function respects the newline character . Without cat(), Rstudio will not respect the newline character and the code will remain on one line.
cat(hShroom3)
## >NP_065910.3 protein Shroom3 [Homo sapiens]
## MMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKL
## QAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAA
## WSGGVKLRLKHRRSEPAGRPHSWHTTKSGEKQPDASMMQISQGMIGPPWHQSYHSSSSTSDLSNYDHAYL
## RRSPDQCSSQGSMESLEPSGAYPPCHLSPAKSTGSIDQLSHFHNKRDSAYSSFSTSSSILEYPHPGISGR
## ERSGSMDNTSARGGLLEGMRQADIRYVKTVYDTRRGVSAEYEVNSSALLLQGREARASANGQGYDKWSNI
## PRGKGVPPPSWSQQCPSSLETATDNLPPKVGAPLPPARSDSYAAFRHRERPSSWSSLDQKRLCRPQANSL
## GSLKSPFIEEQLHTVLEKSPENSPPVKPKHNYTQKAQPGQPLLPTSIYPVPSLEPHFAQVPQPSVSSNGM
## LYPALAKESGYIAPQGACNKMATIDENGNQNGSGRPGFAFCQPLEHDLLSPVEKKPEATAKYVPSKVHFC
## SVPENEEDASLKRHLTPPQGNSPHSNERKSTHSNKPSSHPHSLKCPQAQAWQAGEDKRSSRLSEPWEGDF
## QEDHNANLWRRLEREGLGQSLSGNFGKTKSAFSSLQNIPESLRRHSSLELGRGTQEGYPGGRPTCAVNTK
## AEDPGRKAAPDLGSHLDRQVSYPRPEGRTGASASFNSTDPSPEEPPAPSHPHTSSLGRRGPGPGSASALQ
## GFQYGKPHCSVLEKVSKFEQREQGSQRPSVGGSGFGHNYRPHRTVSTSSTSGNDFEETKAHIRFSESAEP
## LGNGEQHFKNGELKLEEASRQPCGQQLSGGASDSGRGPQRPDARLLRSQSTFQLSSEPEREPEWRDRPGS
## PESPLLDAPFSRAYRNSIKDAQSRVLGATSFRRRDLELGAPVASRSWRPRPSSAHVGLRSPEASASASPH
## TPRERHSVTPAEGDLARPVPPAARRGARRRLTPEQKKRSYSEPEKMNEVGIVEEAEPAPLGPQRNGMRFP
## ESSVADRRRLFERDGKACSTLSLSGPELKQFQQSALADYIQRKTGKRPTSAAGCSLQEPGPLRERAQSAY
## LQPGPAALEGSGLASASSLSSLREPSLQPRREATLLPATVAETQQAPRDRSSSFAGGRRLGERRRGDLLS
## GANGGTRGTQRGDETPREPSSWGARAGKSMSAEDLLERSDVLAGPVHVRSRSSPATADKRQDVLLGQDSG
## FGLVKDPCYLAGPGSRSLSCSERGQEEMLPLFHHLTPRWGGSGCKAIGDSSVPSECPGTLDHQRQASRTP
## CPRPPLAGTQGLVTDTRAAPLTPIGTPLPSAIPSGYCSQDGQTGRQPLPPYTPAMMHRSNGHTLTQPPGP
## RGCEGDGPEHGVEEGTRKRVSLPQWPPPSRAKWAHAAREDSLPEESSAPDFANLKHYQKQQSLPSLCSTS
## DPDTPLGAPSTPGRISLRISESVLRDSPPPHEDYEDEVFVRDPHPKATSSPTFEPLPPPPPPPPSQETPV
## YSMDDFPPPPPHTVCEAQLDSEDPEGPRPSFNKLSKVTIARERHMPGAAHVVGSQTLASRLQTSIKGSEA
## ESTPPSFMSVHAQLAGSLGGQPAPIQTQSLSHDPVSGTQGLEKKVSPDPQKSSEDIRTEALAKEIVHQDK
## SLADILDPDSRLKTTMDLMEGLFPRDVNLLKENSVKRKAIQRTVSSSGCEGKRNEDKEAVSMLVNCPAYY
## SVSAPKAELLNKIKEMPAEVNEEEEQADVNEKKAELIGSLTHKLETLQEAKGSLLTDIKLNNALGEEVEA
## LISELCKPNEFDKYRMFIGDLDKVVNLLLSLSGRLARVENVLSGLGEDASNEERSSLYEKRKILAGQHED
## ARELKENLDRRERVVLGILANYLSEEQLQDYQHFVKMKSTLLIEQRKLDDKIKLGQEQVKCLLESLPSDF
## IPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL
In this chunk of code, R using the entrez_fetch function to retrieve the sequences of the 3 shroom proteins from the NCBI database, in FASTA format.
# Mouse shroom 3a (M. musculus)
mShroom3a <- entrez_fetch(db = "protein",
id = "AAF13269",
rettype = "fasta")
# Human shroom 2 (H. sapiens)
hShroom2 <- entrez_fetch(db = "protein",
id = "CAA58534",
rettype = "fasta")
# Sea-urchin shroom
sShroom <- entrez_fetch(db = "protein",
id = "XP_783573",
rettype = "fasta")
This chunk of code is telling you the number of amino acids in each protein. In general, the nchar() function tells you the number of characters in an object.
nchar(hShroom3)
## [1] 2070
nchar(mShroom3a)
## [1] 2083
nchar(sShroom)
## [1] 1758
nchar(hShroom2)
## [1] 1673
The fasta_cleaner function converts a FASTA file stored as an object into a vector, making it friendly to do alignments in R.
fasta_cleaner
## function (fasta_object, parse = TRUE)
## {
## fasta_object <- sub("^(>)(.*?)(\\n)(.*)(\\n\\n)", "\\4",
## fasta_object)
## fasta_object <- gsub("\n", "", fasta_object)
## if (parse == TRUE) {
## fasta_object <- stringr::str_split(fasta_object, pattern = "",
## simplify = FALSE)
## }
## return(fasta_object[[1]])
## }
## <bytecode: 0x7f99595b1660>
## <environment: namespace:compbio4all>
Use function() in order to define a new function that would essentially accomplish the same goal as the preexisting one contained in a different package. The function() function is is Base R so no packages are needed to use it.
fasta_cleaner <- function(fasta_object, parse = TRUE){
fasta_object <- sub("^(>)(.*?)(\\n)(.*)(\\n\\n)","\\4",fasta_object)
fasta_object <- gsub("\n", "", fasta_object)
if(parse == TRUE){
fasta_object <- stringr::str_split(fasta_object,
pattern = "",
simplify = FALSE)
}
return(fasta_object[[1]])
}
In this code chunk, R takes the fasta sequences of each shroom proteins and converts them into a character vector.
hShroom3 <- fasta_cleaner(hShroom3, parse = F)
mShroom3a <- fasta_cleaner(mShroom3a, parse = F)
hShroom2 <- fasta_cleaner(hShroom2, parse = F)
sShroom <- fasta_cleaner(sShroom, parse = F)
hShroom3
## [1] "MMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKLQAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAAWSGGVKLRLKHRRSEPAGRPHSWHTTKSGEKQPDASMMQISQGMIGPPWHQSYHSSSSTSDLSNYDHAYLRRSPDQCSSQGSMESLEPSGAYPPCHLSPAKSTGSIDQLSHFHNKRDSAYSSFSTSSSILEYPHPGISGRERSGSMDNTSARGGLLEGMRQADIRYVKTVYDTRRGVSAEYEVNSSALLLQGREARASANGQGYDKWSNIPRGKGVPPPSWSQQCPSSLETATDNLPPKVGAPLPPARSDSYAAFRHRERPSSWSSLDQKRLCRPQANSLGSLKSPFIEEQLHTVLEKSPENSPPVKPKHNYTQKAQPGQPLLPTSIYPVPSLEPHFAQVPQPSVSSNGMLYPALAKESGYIAPQGACNKMATIDENGNQNGSGRPGFAFCQPLEHDLLSPVEKKPEATAKYVPSKVHFCSVPENEEDASLKRHLTPPQGNSPHSNERKSTHSNKPSSHPHSLKCPQAQAWQAGEDKRSSRLSEPWEGDFQEDHNANLWRRLEREGLGQSLSGNFGKTKSAFSSLQNIPESLRRHSSLELGRGTQEGYPGGRPTCAVNTKAEDPGRKAAPDLGSHLDRQVSYPRPEGRTGASASFNSTDPSPEEPPAPSHPHTSSLGRRGPGPGSASALQGFQYGKPHCSVLEKVSKFEQREQGSQRPSVGGSGFGHNYRPHRTVSTSSTSGNDFEETKAHIRFSESAEPLGNGEQHFKNGELKLEEASRQPCGQQLSGGASDSGRGPQRPDARLLRSQSTFQLSSEPEREPEWRDRPGSPESPLLDAPFSRAYRNSIKDAQSRVLGATSFRRRDLELGAPVASRSWRPRPSSAHVGLRSPEASASASPHTPRERHSVTPAEGDLARPVPPAARRGARRRLTPEQKKRSYSEPEKMNEVGIVEEAEPAPLGPQRNGMRFPESSVADRRRLFERDGKACSTLSLSGPELKQFQQSALADYIQRKTGKRPTSAAGCSLQEPGPLRERAQSAYLQPGPAALEGSGLASASSLSSLREPSLQPRREATLLPATVAETQQAPRDRSSSFAGGRRLGERRRGDLLSGANGGTRGTQRGDETPREPSSWGARAGKSMSAEDLLERSDVLAGPVHVRSRSSPATADKRQDVLLGQDSGFGLVKDPCYLAGPGSRSLSCSERGQEEMLPLFHHLTPRWGGSGCKAIGDSSVPSECPGTLDHQRQASRTPCPRPPLAGTQGLVTDTRAAPLTPIGTPLPSAIPSGYCSQDGQTGRQPLPPYTPAMMHRSNGHTLTQPPGPRGCEGDGPEHGVEEGTRKRVSLPQWPPPSRAKWAHAAREDSLPEESSAPDFANLKHYQKQQSLPSLCSTSDPDTPLGAPSTPGRISLRISESVLRDSPPPHEDYEDEVFVRDPHPKATSSPTFEPLPPPPPPPPSQETPVYSMDDFPPPPPHTVCEAQLDSEDPEGPRPSFNKLSKVTIARERHMPGAAHVVGSQTLASRLQTSIKGSEAESTPPSFMSVHAQLAGSLGGQPAPIQTQSLSHDPVSGTQGLEKKVSPDPQKSSEDIRTEALAKEIVHQDKSLADILDPDSRLKTTMDLMEGLFPRDVNLLKENSVKRKAIQRTVSSSGCEGKRNEDKEAVSMLVNCPAYYSVSAPKAELLNKIKEMPAEVNEEEEQADVNEKKAELIGSLTHKLETLQEAKGSLLTDIKLNNALGEEVEALISELCKPNEFDKYRMFIGDLDKVVNLLLSLSGRLARVENVLSGLGEDASNEERSSLYEKRKILAGQHEDARELKENLDRRERVVLGILANYLSEEQLQDYQHFVKMKSTLLIEQRKLDDKIKLGQEQVKCLLESLPSDFIPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL"
The code below is doing a pairwise alignment of the proteins hShroom3 and mShroom3a and storing the result in a variable called align.h3.vs.m3a.
# add necessary function
library(Biostrings)
align.h3.vs.m3a <- pairwiseAlignment(
hShroom3,
mShroom3a)
This object shows the global pairwise alignment between hShroom3 and mShroom3a, and also gives the score of the alignment. You can see all of the indels.
align.h3.vs.m3a
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: MMRTTEDFHKPSATLN-SNTATKGRYIYLEAFLE...KAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL
## subject: MK-TPENLEEPSATPNPSRTPTE-RFVYLEALLE...KAGAISLPPALTGHATPGGTSVFGGVFPTLTSPL
## score: 2189.934
This code chunk just displays the score of the alignment without the alignment itself.
# add necessary function
Biostrings:: score(align.h3.vs.m3a)
## [1] 2189.934
This code chunk shows all the indels (insertions or deletions) in addition to the alignment score whereas the previous code chunk just shows the score of the alignment.
align.h3.vs.h2 <- Biostrings::pairwiseAlignment(
hShroom3,
hShroom2)
The score() function only shows the alignment score while in the previous code chunk, the output included the full pairwise alignment and you could see all of the indels.
score(align.h3.vs.h2)
## [1] -5673.853
pid() calculates the percent or proportion of sequence identity for a pairwise sequence alignment, and doesn’t include indels. The score is a quantitative, numerical value and not a proportion.
Biostrings::pid(align.h3.vs.h2)
## [1] 33.83277
The full table is included here so that you can reference the accession numbers and names for each protein.
shroom_table <- c("CAA78718" , "X. laevis Apx" , "xShroom1",
"NP_597713" , "H. sapiens APXL2" , "hShroom1",
"CAA58534" , "H. sapiens APXL", "hShroom2",
"ABD19518" , "M. musculus Apxl" , "mShroom2",
"AAF13269" , "M. musculus ShroomL" , "mShroom3a",
"AAF13270" , "M. musculus ShroomS" , "mShroom3b",
"NP_065910", "H. sapiens Shroom" , "hShroom3",
"ABD59319" , "X. laevis Shroom-like", "xShroom3",
"NP_065768", "H. sapiens KIAA1202" , "hShroom4a",
"AAK95579" , "H. sapiens SHAP-A" , "hShroom4b",
#"DQ435686" , "M. musculus KIAA1202" , "mShroom4",
"ABA81834" , "D. melanogaster Shroom", "dmShroom",
"EAA12598" , "A. gambiae Shroom", "agShroom",
"XP_392427" , "A. mellifera Shroom" , "amShroom",
"XP_783573" , "S. purpuratus Shroom" , "spShroom") #sea urchin
In this code chunk, the set of values contained in shroom_table are converted into a matrix, and then a data frame, which is then made easier to read by naming columns and creating more simplified species names.
# convert to matrix
shroom_table_matrix <- matrix(shroom_table,
byrow = T,
nrow = 14)
# taking the values contained into shroom_table and creating a matrix using them
# convert to data frame
shroom_table <- data.frame(shroom_table_matrix,
stringsAsFactors = F)
# converted the matrix into a data frame
# naming columns
names(shroom_table) <- c("accession","name.orig","name.new")
# assinging names to each column in the table
# Create simplified species names
shroom_table$spp <- "Homo"
shroom_table$spp[grep("laevis",shroom_table$name.orig)] <- "Xenopus"
shroom_table$spp[grep("musculus",shroom_table$name.orig)] <- "Mus"
shroom_table$spp[grep("melanogaster",shroom_table$name.orig)] <- "Drosophila"
shroom_table$spp[grep("gambiae",shroom_table$name.orig)] <- "mosquito"
shroom_table$spp[grep("mellifera",shroom_table$name.orig)] <- "bee"
shroom_table$spp[grep("purpuratus",shroom_table$name.orig)] <- "sea urchin"
# updating the names of the species to more simplified ones in the column spp
This line of code is displaying the data frame containing shroom_table_matrix. In the table, the columns are accession, name.orig, name.new, and spp, all of which are different names/classifications of each protein.
shroom_table
## accession name.orig name.new spp
## 1 CAA78718 X. laevis Apx xShroom1 Xenopus
## 2 NP_597713 H. sapiens APXL2 hShroom1 Homo
## 3 CAA58534 H. sapiens APXL hShroom2 Homo
## 4 ABD19518 M. musculus Apxl mShroom2 Mus
## 5 AAF13269 M. musculus ShroomL mShroom3a Mus
## 6 AAF13270 M. musculus ShroomS mShroom3b Mus
## 7 NP_065910 H. sapiens Shroom hShroom3 Homo
## 8 ABD59319 X. laevis Shroom-like xShroom3 Xenopus
## 9 NP_065768 H. sapiens KIAA1202 hShroom4a Homo
## 10 AAK95579 H. sapiens SHAP-A hShroom4b Homo
## 11 ABA81834 D. melanogaster Shroom dmShroom Drosophila
## 12 EAA12598 A. gambiae Shroom agShroom mosquito
## 13 XP_392427 A. mellifera Shroom amShroom bee
## 14 XP_783573 S. purpuratus Shroom spShroom sea urchin
The $ allows us to specifically index certain rows or columns of data.
shroom_table$accession
## [1] "CAA78718" "NP_597713" "CAA58534" "ABD19518" "AAF13269" "AAF13270"
## [7] "NP_065910" "ABD59319" "NP_065768" "AAK95579" "ABA81834" "EAA12598"
## [13] "XP_392427" "XP_783573"
This code chunk fetches the protein sequences from the shroom table by indexing the accession numbers.
# add necessary function
library(rentrez)
shrooms <- entrez_fetch(db = "protein",
id = shroom_table$accession,
rettype = "fasta")
This code chunk is displaying all of the protein sequences with newlines.
cat(shrooms)
This code chunk uses the entrez_fetch_list function and all of the shroom protein sequences are displayed separated by the accession number, while the previous code chunk using the entrez_fetch function just displayed all of the amino acid sequences in one large chunk, not separated.shr
shrooms_list <- entrez_fetch_list(db = "protein",
id = shroom_table$accession,
rettype = "fasta")
is(shrooms_list)
## [1] "list" "vector" "list_OR_List" "vector_OR_Vector"
## [5] "vector_OR_factor"
length(shrooms_list)
## [1] 14
nchar(shrooms_list)
## CAA78718 NP_597713 CAA58534 ABD19518 AAF13269 AAF13270 NP_065910 ABD59319
## 1486 915 1673 1543 2083 1895 2070 1864
## NP_065768 AAK95579 ABA81834 EAA12598 XP_392427 XP_783573
## 1560 778 1647 750 2230 1758
This line of code returns the number of components in shrooms_list. We use it to double check and make sure the number of components are what we intended for it to be.
length(shrooms_list)
## [1] 14
The for loop makes the code more efficient so that you only have to write one line of code in order to clean all the fasta files in shrooms_listshr.
for(i in 1:length(shrooms_list)){
shrooms_list[[i]] <- fasta_cleaner(shrooms_list[[i]], parse = F)
}
In summary, this code chunk is reassigning the values of shrooms_list to a new variable called shrooms_vector, since the fasta cleaner function converted the object shrooms_list into a readable vector.
# XXXXXXXXCX
shrooms_vector <- rep(NA, length(shrooms_list))
# displays each component of shrooms_list
# XXXXXXXXCX
for(i in 1:length(shrooms_vector)){
shrooms_vector[i] <- shrooms_list[[i]]
}
# for each component of shrooms_list the code reassings it to shrooms_vector
# XXXXXXXXCX
names(shrooms_vector) <- names(shrooms_list)
# reassigns the names of shrooms_list to shrooms_vector
TODO: explain what this is doing then add the necessary function.
# add necessary function
shrooms_vector_ss <- Biostrings::sort(shrooms_vector)
In this section, the code builds a multiple sequence alignment in order to compare multiple or more than 2 protein or nucleic acid sequences.
TODO: briefly explain what this chunk does, then add the necessary function.
# add necessary function
library(seqinr)
shrooms_align <- msa(shrooms_vector_ss,
type = "protein", method = "ClustalW")
## use default substitution matrix
TODO: briefly summarize what this section will do.
TODO: Briefly summarize what output is shown below
shrooms_align
## CLUSTAL 2.1
##
## Call:
## msa(shrooms_vector_ss, type = "protein", method = "ClustalW")
##
## MsaAAMultipleAlignment with 14 rows and 2252 columns
## aln names
## [1] -------------------------...SVFGGVFPTLTSPL----------- AAF13269
## [2] -------------------------...SVFGGVFPTLTSPL----------- AAF13270
## [3] -------------------------...CTFSGIFPTLTSPL----------- NP_065910
## [4] -------------------------...NKS--LPPPLTSSL----------- ABD59319
## [5] -------------------------...------------------------- ABD19518
## [6] -------------------------...------------------------- CAA58534
## [7] -------------------------...------------------------- NP_065768
## [8] -------------------------...------------------------- AAK95579
## [9] -------------------------...LT----------------------- NP_597713
## [10] -------------------------...------------------------- CAA78718
## [11] -------------------------...------------------------- EAA12598
## [12] -------------------------...------------------------- ABA81834
## [13] MTELQPSPPGYRVQDEAPGPPSCPP...------------------------- XP_392427
## [14] -------------------------...AATSSSSNGIGGPEQLNSNATSSYC XP_783573
## Con -------------------------...------------------------- Consensus
TODO: briefly explain what is being done in this chunk. This is tricky (and annoying) so do your best
# WHAT IS THE LINE BELOW DOING? (its tricky - do your best)
class(shrooms_align) <- "AAMultipleAlignment"
# WHAT IS THE LINE BELOW DOING? This is simpler
shrooms_align_seqinr <- msaConvert(shrooms_align, type = "seqinr::alignment")
TODO: what is the output this produces
print_msa(alignment = shrooms_align_seqinr,
chunksize = 60)
TODO: explain this output and how its differnet from the prevoius
## add necessary function
ggmsa::ggmsa(shrooms_align,
start = 2000,
end = 2100)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
#### Saving an MSA as PDF
TODO: explain what this command is doing. Add the package the function is coming from using :: notation This may not work for everyone. If its not working you can comment it out.
# msa::msaPrettyPrint
# msaPrettyPrint(shrooms_align, # alignment
# file = "shroom_msa.pdf", # file name
# y=c(2000, 2100), # range
# askForOverwrite=FALSE)
The command getwd returns a file path showing the current R working directory.
getwd()
## [1] "/Users/dsesha02"