“Worked example: Building a phylogeny in R”

Introduction

Modern phylogenetic studies uses nucleotide sequences to classify evolutionary relationships between species. Phylogenetic Trees are created by aligning the sequences from the same/similar genes of multiple species. The tree is a visualization of an evolutionary hypothesis.

These kinds of studies provide information about what genes diverge, how species evolved, how evolution took place/what differed in the evolutionary pathway of species. Biologists can use phylogenetics to fuel larger theories about how evolution and genetics changes through time and between species.

Vocab

phylogenetics
sequence alignment
function (in R environment)
nucleotide sequence
amino acid sequence
gene
pairwise v multiple sequence alignment
class (in R environment)
container (in R environment)
set vs list vs vector (in R environment)

Key functions

rentrez::entrez_fetch
compbio4all::fasta_cleaner
ggmsa::ggmsa
Biostrings::PairwiseAlignment
Biostrings::pid

Software Preliminaires

If necessary, be sure to download the packages before trying to load them into memory:

for github packages: devtools::install_github(“devloper_name/package”) requires devtools to be installed via install.packages()

for CRAN packages: install.packages(“package_name”)

for Bioconductor: install.packages(“BioManager”) BioManager::install(“package_name”)

Load packages into memory

# github packages
library(compbio4all)


# CRAN packages
library(rentrez)
library(seqinr)
library(ape)

# Bioconductor packages
library(msa)
library(Biostrings)

Retrieving macromolecular sequences

This section will show how to transfer a sequence from the online database to the project you are working on. This section will make extensive use of the rentrez package.

entrez_fetch is a function that takes a gene given by the id number (accession number) from the given database, then gives the output as a fasta file. We are essentially retrieving the necessary information from the database into a form that we can work with in R.

# Human shroom 3 (H. sapiens)
hShroom3 <- rentrez::entrez_fetch(db = "protein", 
                          id = "NP_065910", 
                          rettype = "fasta")

The cat() function helps convert the output into a readable format. For example, in a fasta file the line breaks are denoted by the new line character (‘’) instead of creating a new line, the cat() function adds the line break when it reads that character.

cat(hShroom3)

## >NP_065910.3 protein Shroom3 [Homo sapiens]
## MMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKL
## QAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAA
## WSGGVKLRLKHRRSEPAGRPHSWHTTKSGEKQPDASMMQISQGMIGPPWHQSYHSSSSTSDLSNYDHAYL
## RRSPDQCSSQGSMESLEPSGAYPPCHLSPAKSTGSIDQLSHFHNKRDSAYSSFSTSSSILEYPHPGISGR
## ERSGSMDNTSARGGLLEGMRQADIRYVKTVYDTRRGVSAEYEVNSSALLLQGREARASANGQGYDKWSNI
## PRGKGVPPPSWSQQCPSSLETATDNLPPKVGAPLPPARSDSYAAFRHRERPSSWSSLDQKRLCRPQANSL
## GSLKSPFIEEQLHTVLEKSPENSPPVKPKHNYTQKAQPGQPLLPTSIYPVPSLEPHFAQVPQPSVSSNGM
## LYPALAKESGYIAPQGACNKMATIDENGNQNGSGRPGFAFCQPLEHDLLSPVEKKPEATAKYVPSKVHFC
## SVPENEEDASLKRHLTPPQGNSPHSNERKSTHSNKPSSHPHSLKCPQAQAWQAGEDKRSSRLSEPWEGDF
## QEDHNANLWRRLEREGLGQSLSGNFGKTKSAFSSLQNIPESLRRHSSLELGRGTQEGYPGGRPTCAVNTK
## AEDPGRKAAPDLGSHLDRQVSYPRPEGRTGASASFNSTDPSPEEPPAPSHPHTSSLGRRGPGPGSASALQ
## GFQYGKPHCSVLEKVSKFEQREQGSQRPSVGGSGFGHNYRPHRTVSTSSTSGNDFEETKAHIRFSESAEP
## LGNGEQHFKNGELKLEEASRQPCGQQLSGGASDSGRGPQRPDARLLRSQSTFQLSSEPEREPEWRDRPGS
## PESPLLDAPFSRAYRNSIKDAQSRVLGATSFRRRDLELGAPVASRSWRPRPSSAHVGLRSPEASASASPH
## TPRERHSVTPAEGDLARPVPPAARRGARRRLTPEQKKRSYSEPEKMNEVGIVEEAEPAPLGPQRNGMRFP
## ESSVADRRRLFERDGKACSTLSLSGPELKQFQQSALADYIQRKTGKRPTSAAGCSLQEPGPLRERAQSAY
## LQPGPAALEGSGLASASSLSSLREPSLQPRREATLLPATVAETQQAPRDRSSSFAGGRRLGERRRGDLLS
## GANGGTRGTQRGDETPREPSSWGARAGKSMSAEDLLERSDVLAGPVHVRSRSSPATADKRQDVLLGQDSG
## FGLVKDPCYLAGPGSRSLSCSERGQEEMLPLFHHLTPRWGGSGCKAIGDSSVPSECPGTLDHQRQASRTP
## CPRPPLAGTQGLVTDTRAAPLTPIGTPLPSAIPSGYCSQDGQTGRQPLPPYTPAMMHRSNGHTLTQPPGP
## RGCEGDGPEHGVEEGTRKRVSLPQWPPPSRAKWAHAAREDSLPEESSAPDFANLKHYQKQQSLPSLCSTS
## DPDTPLGAPSTPGRISLRISESVLRDSPPPHEDYEDEVFVRDPHPKATSSPTFEPLPPPPPPPPSQETPV
## YSMDDFPPPPPHTVCEAQLDSEDPEGPRPSFNKLSKVTIARERHMPGAAHVVGSQTLASRLQTSIKGSEA
## ESTPPSFMSVHAQLAGSLGGQPAPIQTQSLSHDPVSGTQGLEKKVSPDPQKSSEDIRTEALAKEIVHQDK
## SLADILDPDSRLKTTMDLMEGLFPRDVNLLKENSVKRKAIQRTVSSSGCEGKRNEDKEAVSMLVNCPAYY
## SVSAPKAELLNKIKEMPAEVNEEEEQADVNEKKAELIGSLTHKLETLQEAKGSLLTDIKLNNALGEEVEA
## LISELCKPNEFDKYRMFIGDLDKVVNLLLSLSGRLARVENVLSGLGEDASNEERSSLYEKRKILAGQHED
## ARELKENLDRRERVVLGILANYLSEEQLQDYQHFVKMKSTLLIEQRKLDDKIKLGQEQVKCLLESLPSDF
## IPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL

This block is retrieving more versions of the shroom gene from the same database, in the same format, as the first gene.

# Mouse shroom 3a (M. musculus)
mShroom3a <- rentrez::entrez_fetch(db = "protein", 
                          id = "AAF13269", 
                          rettype = "fasta")

# Human shroom 2 (H. sapiens)
hShroom2 <- rentrez::entrez_fetch(db = "protein", 
                          id = "CAA58534", 
                          rettype = "fasta")


# Sea-urchin shroom
sShroom <- rentrez::entrez_fetch(db = "protein", 
                          id = "XP_783573", 
                          rettype = "fasta")

The nchar function counts the number of characters in the given vector. This call is on the raw data, so it will include all of the characters that are used to depict formatting (like new line). This function can be helpful to call before and after data modification to confirm the transformation did what was expected.

nchar(hShroom3)

## [1] 2070

nchar(mShroom3a)

## [1] 2083

nchar(sShroom)

## [1] 1758

nchar(hShroom2)

## [1] 1673

Prepping macromolecular sequences

data cleaning

The fasta_cleaner function is given in the compbio4all package and can be used to “clean up” a fasta file. In other words, this function takes a fasta file and removes the unnecessary characters and formatting to create a more readable, clean output.

fasta_cleaner

## function (fasta_object, parse = TRUE) 
## {
##     fasta_object <- sub("^(>)(.*?)(\\n)(.*)(\\n\\n)", "\\4", 
##         fasta_object)
##     fasta_object <- gsub("\n", "", fasta_object)
##     if (parse == TRUE) {
##         fasta_object <- stringr::str_split(fasta_object, pattern = "", 
##             simplify = FALSE)
##     }
##     return(fasta_object[[1]])
## }
## <bytecode: 0x7fd8ef9eb5b0>
## <environment: namespace:compbio4all>

If downloading the package a function comes from isn’t going smoothly, you can use the script of the function and import it by 1. saving the file and calling it before using it so that it is loaded into memory (above) or 2. copy the script into the project you are working on (below). In either circumstance it is CRITICAL that you credit the work you are using.

For example, fasta_cleaner is from the compbio4all package. If downloading the package wasn’t working, I could go to the page on github and find the function I want to use. Here, the link is https://github.com/brouwern/compbio4all/blob/master/R/fasta_cleaner.R. since it isn’t from a package downloaded in memory, I would put somewhere (as a comment in the code or where I loaded the other packages) who wrote the function and where to find it/what package it was supposed to be in.

fasta_cleaner <- function(fasta_object, parse = TRUE){

  fasta_object <- sub("^(>)(.*?)(\\n)(.*)(\\n\\n)","\\4",fasta_object)
  fasta_object <- gsub("\n", "", fasta_object)

  if(parse == TRUE){
    fasta_object <- stringr::str_split(fasta_object,
                                       pattern = "",
                                       simplify = FALSE)
  }

  return(fasta_object[[1]])
}

Here we are cleaning each fasta file downloaded from the protein database and overwritting the existing variable to be the version of the file that we cleaned. This simply means using the same variable name (hShroom3, for example) and assigning it to what is produced from the fasta_cleaner function.

hShroom3  <- fasta_cleaner(hShroom3,  parse = F)
mShroom3a <- fasta_cleaner(mShroom3a, parse = F)
hShroom2  <- fasta_cleaner(hShroom2,  parse = F)
sShroom   <- fasta_cleaner(sShroom,   parse = F)

hShroom3

## [1] "MMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKLQAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAAWSGGVKLRLKHRRSEPAGRPHSWHTTKSGEKQPDASMMQISQGMIGPPWHQSYHSSSSTSDLSNYDHAYLRRSPDQCSSQGSMESLEPSGAYPPCHLSPAKSTGSIDQLSHFHNKRDSAYSSFSTSSSILEYPHPGISGRERSGSMDNTSARGGLLEGMRQADIRYVKTVYDTRRGVSAEYEVNSSALLLQGREARASANGQGYDKWSNIPRGKGVPPPSWSQQCPSSLETATDNLPPKVGAPLPPARSDSYAAFRHRERPSSWSSLDQKRLCRPQANSLGSLKSPFIEEQLHTVLEKSPENSPPVKPKHNYTQKAQPGQPLLPTSIYPVPSLEPHFAQVPQPSVSSNGMLYPALAKESGYIAPQGACNKMATIDENGNQNGSGRPGFAFCQPLEHDLLSPVEKKPEATAKYVPSKVHFCSVPENEEDASLKRHLTPPQGNSPHSNERKSTHSNKPSSHPHSLKCPQAQAWQAGEDKRSSRLSEPWEGDFQEDHNANLWRRLEREGLGQSLSGNFGKTKSAFSSLQNIPESLRRHSSLELGRGTQEGYPGGRPTCAVNTKAEDPGRKAAPDLGSHLDRQVSYPRPEGRTGASASFNSTDPSPEEPPAPSHPHTSSLGRRGPGPGSASALQGFQYGKPHCSVLEKVSKFEQREQGSQRPSVGGSGFGHNYRPHRTVSTSSTSGNDFEETKAHIRFSESAEPLGNGEQHFKNGELKLEEASRQPCGQQLSGGASDSGRGPQRPDARLLRSQSTFQLSSEPEREPEWRDRPGSPESPLLDAPFSRAYRNSIKDAQSRVLGATSFRRRDLELGAPVASRSWRPRPSSAHVGLRSPEASASASPHTPRERHSVTPAEGDLARPVPPAARRGARRRLTPEQKKRSYSEPEKMNEVGIVEEAEPAPLGPQRNGMRFPESSVADRRRLFERDGKACSTLSLSGPELKQFQQSALADYIQRKTGKRPTSAAGCSLQEPGPLRERAQSAYLQPGPAALEGSGLASASSLSSLREPSLQPRREATLLPATVAETQQAPRDRSSSFAGGRRLGERRRGDLLSGANGGTRGTQRGDETPREPSSWGARAGKSMSAEDLLERSDVLAGPVHVRSRSSPATADKRQDVLLGQDSGFGLVKDPCYLAGPGSRSLSCSERGQEEMLPLFHHLTPRWGGSGCKAIGDSSVPSECPGTLDHQRQASRTPCPRPPLAGTQGLVTDTRAAPLTPIGTPLPSAIPSGYCSQDGQTGRQPLPPYTPAMMHRSNGHTLTQPPGPRGCEGDGPEHGVEEGTRKRVSLPQWPPPSRAKWAHAAREDSLPEESSAPDFANLKHYQKQQSLPSLCSTSDPDTPLGAPSTPGRISLRISESVLRDSPPPHEDYEDEVFVRDPHPKATSSPTFEPLPPPPPPPPSQETPVYSMDDFPPPPPHTVCEAQLDSEDPEGPRPSFNKLSKVTIARERHMPGAAHVVGSQTLASRLQTSIKGSEAESTPPSFMSVHAQLAGSLGGQPAPIQTQSLSHDPVSGTQGLEKKVSPDPQKSSEDIRTEALAKEIVHQDKSLADILDPDSRLKTTMDLMEGLFPRDVNLLKENSVKRKAIQRTVSSSGCEGKRNEDKEAVSMLVNCPAYYSVSAPKAELLNKIKEMPAEVNEEEEQADVNEKKAELIGSLTHKLETLQEAKGSLLTDIKLNNALGEEVEALISELCKPNEFDKYRMFIGDLDKVVNLLLSLSGRLARVENVLSGLGEDASNEERSSLYEKRKILAGQHEDARELKENLDRRERVVLGILANYLSEEQLQDYQHFVKMKSTLLIEQRKLDDKIKLGQEQVKCLLESLPSDFIPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL"

Aligning (Pairwise) sequences

Below we are creating a pairwise alignment between the human shroom3 gene and the mouse shroom 3a gene. We are using this to compare the two sequences to find how similar or different they are from eachother.

align.h3.vs.m3a <- Biostrings::pairwiseAlignment(
                  hShroom3,
                  mShroom3a)

This object shows a summary of the alignment, including the two sequences and the score of the alignment.

align.h3.vs.m3a

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: MMRTTEDFHKPSATLN-SNTATKGRYIYLEAFLE...KAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL
## subject: MK-TPENLEEPSATPNPSRTPTE-RFVYLEALLE...KAGAISLPPALTGHATPGGTSVFGGVFPTLTSPL
## score: 2189.934

pid is percent identity. This function gives a quantitative measure of how similar two sequences are. This function also ignores indels within the sequences. In this case, human shroom3 and mouse shroom3a are 70% similar.

Biostrings::pid(align.h3.vs.m3a)

## [1] 70.56511

This pairwise alignment is between two human shroom genes: 3 and 2.

align.h3.vs.h2 <- Biostrings::pairwiseAlignment(
                  hShroom3,
                  hShroom2)

The alignment score for two of the human shroom genes is printed below. The negative number suggests a low similarity. When comparing human3 to mouse3a, the score was positive, suggesting there was a significant amount of similarity.

score(align.h3.vs.h2)

## [1] -5673.853

score() and pid() are calculated differently. Indels are ignored in pid and not in scores. The score is calculated by assigning a point value to each nucleotide/amino acid in the alignment (+1 for same, -1 for indel, 0 if different). pid takes the proportion of the matching elements in the alignment over the number of the elements in the sequence. pid can be done in proportion or percent as well as in similarity or dissimilarity.

Biostrings::pid(align.h3.vs.h2)

## [1] 33.83277

The shroom family of genes

This section is creating a dataframe of the genes to be analyzed. The first step is making a table that will be converting into the dataframe, and manipulating the data in the structure to work for what we want to do with it.

The table below has the 14 shroom genes of interest. Having this vector/table here allows us to create a dataframe that will give us an easy to read display of what data we are looking for. The columns are split between: Accession Number, Species, and Gene Name.

shroom_table <- c("CAA78718" , "X. laevis Apx" ,         "xShroom1",
            "NP_597713" , "H. sapiens APXL2" ,     "hShroom1",
            "CAA58534" , "H. sapiens APXL",        "hShroom2",
            "ABD19518" , "M. musculus Apxl" ,      "mShroom2",
            "AAF13269" , "M. musculus ShroomL" ,   "mShroom3a",
            "AAF13270" , "M. musculus ShroomS" ,   "mShroom3b",
            "NP_065910", "H. sapiens Shroom" ,     "hShroom3",
            "ABD59319" , "X. laevis Shroom-like",  "xShroom3",
            "NP_065768", "H. sapiens KIAA1202" ,   "hShroom4a",
            "AAK95579" , "H. sapiens SHAP-A" ,     "hShroom4b",
            #"DQ435686" , "M. musculus KIAA1202" ,  "mShroom4",
            "ABA81834" , "D. melanogaster Shroom", "dmShroom",
            "EAA12598" , "A. gambiae Shroom",      "agShroom",
            "XP_392427" , "A. mellifera Shroom" ,  "amShroom",
            "XP_783573" , "S. purpuratus Shroom" , "spShroom") #sea urchin

The below chunk of code is what will create the dataframe:

# convert to a Matrix
shroom_table_matrix <- matrix(shroom_table,
                                  byrow = T,
                                  nrow = 14)
# convert to a DataFrame
shroom_table <- data.frame(shroom_table_matrix, 
                     stringsAsFactors = F)

# creating columns
names(shroom_table) <- c("accession", "name.orig","name.new")

# Create simplified species names
shroom_table$spp <- "Homo"
shroom_table$spp[grep("laevis",shroom_table$name.orig)] <- "Xenopus"
shroom_table$spp[grep("musculus",shroom_table$name.orig)] <- "Mus"
shroom_table$spp[grep("melanogaster",shroom_table$name.orig)] <- "Drosophila"
shroom_table$spp[grep("gambiae",shroom_table$name.orig)] <- "mosquito"
shroom_table$spp[grep("mellifera",shroom_table$name.orig)] <- "bee"
shroom_table$spp[grep("purpuratus",shroom_table$name.orig)] <- "sea urchin"

When calling the variable by itself, the console will print the object stored in that variable. Here we are printing the dataframe we created above:

shroom_table

##    accession              name.orig  name.new        spp
## 1   CAA78718          X. laevis Apx  xShroom1    Xenopus
## 2  NP_597713       H. sapiens APXL2  hShroom1       Homo
## 3   CAA58534        H. sapiens APXL  hShroom2       Homo
## 4   ABD19518       M. musculus Apxl  mShroom2        Mus
## 5   AAF13269    M. musculus ShroomL mShroom3a        Mus
## 6   AAF13270    M. musculus ShroomS mShroom3b        Mus
## 7  NP_065910      H. sapiens Shroom  hShroom3       Homo
## 8   ABD59319  X. laevis Shroom-like  xShroom3    Xenopus
## 9  NP_065768    H. sapiens KIAA1202 hShroom4a       Homo
## 10  AAK95579      H. sapiens SHAP-A hShroom4b       Homo
## 11  ABA81834 D. melanogaster Shroom  dmShroom Drosophila
## 12  EAA12598      A. gambiae Shroom  agShroom   mosquito
## 13 XP_392427    A. mellifera Shroom  amShroom        bee
## 14 XP_783573   S. purpuratus Shroom  spShroom sea urchin

Aligning multiple sequences

The ‘$’ notation in R allows you to access the entire column of a dataframe. Below we are taking the accession column from the shroom_table dataframe and printing its contents.

shroom_table$accession

##  [1] "CAA78718"  "NP_597713" "CAA58534"  "ABD19518"  "AAF13269"  "AAF13270" 
##  [7] "NP_065910" "ABD59319"  "NP_065768" "AAK95579"  "ABA81834"  "EAA12598" 
## [13] "XP_392427" "XP_783573"

Using the ‘$’ notation from above, we can transfer the fasta files from each gene in one line of code, just as we did with each individual one before.

shrooms <- rentrez::entrez_fetch(db = "protein", 
                          id = shroom_table$accession, 
                          rettype = "fasta")

As stated above, cat() respects the formatting characters of fasta files that R does not recognize as formatting. Below, we are taking all of the fasta files we transferred and making R format them as they were meant to be.

cat(shrooms)

entrez_fetch_list is a wrapper function in compbio4all that modifies a the function entrez_fetch from the rentrez package. The important distinction is this function creates a list containing each of the 14 genes. The other function used above creates one long file with all of the sequences, here they are separated.

shrooms_list <- entrez_fetch_list(db = "protein", 
                          id = shroom_table$accession, 
                          rettype = "fasta")

The length of shrooms_list should be 14, running the length() function allows you to check the above call and make sure the information you have is what you expect. The 14 elements are the individual fasta files from each gene we are interested in.

length(shrooms_list)

## [1] 14

The for loop below is taking each element in shrooms_list (each fasta file taken from the database) and using the fasta_cleaner function to reformat the file.

for(i in 1:length(shrooms_list)){
  shrooms_list[[i]] <- fasta_cleaner(shrooms_list[[i]], parse = F)
}

The below code chunk is using the list created above to make a vector with the same data and organization.

# create a vector that is the length of shrooms_list and is filled with NA values
shrooms_vector <- rep(NA, length(shrooms_list))

# copy each element in shrooms_list into the corresponding index of shrooms_vector
for(i in 1:length(shrooms_vector)){
  shrooms_vector[i] <- shrooms_list[[i]]
}

#  give the column titles of shrooms_list to shrooms_vector
names(shrooms_vector) <- names(shrooms_list)

AAStringSet class is a container for storing sets of Amino Acid String objects. The function AAStringSet() is used to create that class. This class makes manipulation of the objects easy and efficient.

shrooms_vector_ss <- Biostrings::AAStringSet(shrooms_vector)

MSA

An MSA is a Multiple Sequence Alignment. Similar to the PairWise Alignments we were doing before, MSAs compare and align more than just 2 sequences. In the following section we will create one alignment that includes all 14 genes of interest.

Building a Multiple Sequence Alignment (MSA)

The msa function creates the multiple sequence alignment. The output of this code, as we will see below, creates a table that is hard to read and lacking the visual we are looking for. There is a lot happening in this chunk even though it appears like a simple taks (only two arguments!): it will take a little time for it to run.

shrooms_align <- msa(shrooms_vector_ss,
                     method = "ClustalW")

## use default substitution matrix

Viewing an MSA

Viewing an MSA can be done in a couple ways. This section is going to explore how different outputs look and the type of data preperation required to make the exact visual we want.

Viewing an MSA in R

TODO: The default MSA print structure in R is shown below. The table is sparesly filled and does not appear to have an agreed upon consensus sequence. Overall, the information given here is not sufficient or useful. But there is a way to fix that!

shrooms_align

## CLUSTAL 2.1  
## 
## Call:
##    msa(shrooms_vector_ss, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 14 rows and 2252 columns
##      aln                                                   names
##  [1] -------------------------...------------------------- NP_065768
##  [2] -------------------------...------------------------- AAK95579
##  [3] -------------------------...SVFGGVFPTLTSPL----------- AAF13269
##  [4] -------------------------...SVFGGVFPTLTSPL----------- AAF13270
##  [5] -------------------------...CTFSGIFPTLTSPL----------- NP_065910
##  [6] -------------------------...NKS--LPPPLTSSL----------- ABD59319
##  [7] -------------------------...------------------------- CAA58534
##  [8] -------------------------...------------------------- ABD19518
##  [9] -------------------------...LT----------------------- NP_597713
## [10] -------------------------...------------------------- CAA78718
## [11] -------------------------...------------------------- EAA12598
## [12] -------------------------...------------------------- ABA81834
## [13] MTELQPSPPGYRVQDEAPGPPSCPP...------------------------- XP_392427
## [14] -------------------------...AATSSSSNGIGGPEQLNSNATSSYC XP_783573
##  Con -------------------------...------------------------- Consensus

The following steps are part of the data preparation process. The first statement is changing the class of shrooms_align from MsaAAMultipleAlignment to AAMultipleAlignment. This is essentially changing what we did above back to just the alignment and not the MSA. The second statement is creating a new variable that is converting shrooms_align to a class of alignment.

# changing the class of the shrooms_align object to AAMultipleAlignment
class(shrooms_align) <- "AAMultipleAlignment"

# assigning a new variable the new version of shrooms_align msa
shrooms_align_seqinr <- msaConvert(shrooms_align, type = "seqinr::alignment")

Below is the new version of our MSA. This output is still very hard to look at and interpret, but it shows that progress is being made.

print_msa(alignment = shrooms_align_seqinr, 
          chunksize = 60)

Displaying an MSA XXXXXXXX

ggmsa is a function that creates the diagram, colorcoded version of an MSA that we were looking for. The same 100 amino acids for each gene is listed and colored according the the amino acids chemical property. The organization of this chart compared to the others we’ve made is far superior because of how easy it is to look at and understand. Even without a visable concensous sequence, we can easily figure out what it might be. The readablity of this graph is what we were looking for.

ggmsa::ggmsa(shrooms_align,   # shrooms_align, NOT shrooms_align_seqinr
      start = 2000, 
      end = 2100)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

Saving an MSA as PDF

If this function were to work, it would take the above MSA we made and create/save a pdf version of it to the current directory you are working in.

Currently it is throwing a texi2dvi error, so it is commented out for now.

'''
msa::msaPrettyPrint(shrooms_align,             # alignment
               file = "shroom_msa.pdf",   # file name
               y=c(2000, 2100),           # range
               askForOverwrite=FALSE)
'''

getwd() prints the current directory you are in ie your “Working Directory”. This is also the path in which the file is saved on your computer.

getwd()

## [1] "/Users/kaylaneal/Downloads/compbio"

Bioinformatic Workflow

A complete bioinformatics workflow in R