Assignment: Your assignment is to use your notes from class - along with help from classmates, UTAs, and me - to turn this script into a fleshed-out description of what is going on.

This is a substantial project - we’ll work on it in steps over the rest of the unit.

We are currently focused on the overall process and will cover the details over the rest of this unit.

Your first assignment is to get this script to run from top to bottom by adding all of the missing R commands. Once you have done that, you can knit it into an HTML file and upload it to RPubs. (Note - you’ll need to add the YAML header!)

Your second assignment, which will be posted later, is to answer all the TODO and other prompts to add information. You can start on this, but you don’t have to do this on your first time through the code.

Delete all the prompts like TODO() as you compete them. Use RStudio’s search function to see if you’ve missed any - there are a LOT!

A complete bioinformatics workflow in R

By: Nathan L. Brouwer

“Worked example: Building a phylogeny in R”

Introduction

Describe how phylogeneies can be used in biology (readings will be assigned)

Vocab

Make a list of at least 10 vocab terms that are important (don’t have to define) argument function list named list named vector for() loop R console

Key functions

Make a list of at least 5 key functions Put in the format of package::function

rentrez::entrez_fetch compbio4all::entrez_fetch_list() compbio4all::print_msa() msa::msaConvert() seqinr::dist.alignment()

Software Preliminaires

Add the necessary calls to library() to load call packages Indicate which packages cam from Bioconducotr, CRAN, and GitHub

Load packages into memory

# github packages
# install.packages("devtools")
# devtools::install_github("brouwern/compbio4all")
# devtools::install_github("YuLab-SMU/ggmsa")



# CRAN packages
library(rentrez)
library(seqinr)
library(ape)



# Bioconductor packages
#install.packages("BioCManager")
library(BiocManager)
#BiocManager::install("Biostrings")
library(msa)
#BiocManager::install("msa")
library(msa)

Downloading macromolecular sequences

TODO: Fill in the XXXXXs and write a 1-2 sentence of what is going on here.
Add the package that is where entrez_fetch is from using :: notation

# Human shroom 3 (H. sapiens)
hShroom3 <- rentrez::entrez_fetch(db = "protein", 
                          id = "NP_065910", 
                          rettype = "fasta")

TODO:explain what cat() is doing cat() function is printing out the sequence on the screen, the way it would look online into a long, single line.

cat(hShroom3)

## >NP_065910.3 protein Shroom3 [Homo sapiens]
## MMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKL
## QAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAA
## WSGGVKLRLKHRRSEPAGRPHSWHTTKSGEKQPDASMMQISQGMIGPPWHQSYHSSSSTSDLSNYDHAYL
## RRSPDQCSSQGSMESLEPSGAYPPCHLSPAKSTGSIDQLSHFHNKRDSAYSSFSTSSSILEYPHPGISGR
## ERSGSMDNTSARGGLLEGMRQADIRYVKTVYDTRRGVSAEYEVNSSALLLQGREARASANGQGYDKWSNI
## PRGKGVPPPSWSQQCPSSLETATDNLPPKVGAPLPPARSDSYAAFRHRERPSSWSSLDQKRLCRPQANSL
## GSLKSPFIEEQLHTVLEKSPENSPPVKPKHNYTQKAQPGQPLLPTSIYPVPSLEPHFAQVPQPSVSSNGM
## LYPALAKESGYIAPQGACNKMATIDENGNQNGSGRPGFAFCQPLEHDLLSPVEKKPEATAKYVPSKVHFC
## SVPENEEDASLKRHLTPPQGNSPHSNERKSTHSNKPSSHPHSLKCPQAQAWQAGEDKRSSRLSEPWEGDF
## QEDHNANLWRRLEREGLGQSLSGNFGKTKSAFSSLQNIPESLRRHSSLELGRGTQEGYPGGRPTCAVNTK
## AEDPGRKAAPDLGSHLDRQVSYPRPEGRTGASASFNSTDPSPEEPPAPSHPHTSSLGRRGPGPGSASALQ
## GFQYGKPHCSVLEKVSKFEQREQGSQRPSVGGSGFGHNYRPHRTVSTSSTSGNDFEETKAHIRFSESAEP
## LGNGEQHFKNGELKLEEASRQPCGQQLSGGASDSGRGPQRPDARLLRSQSTFQLSSEPEREPEWRDRPGS
## PESPLLDAPFSRAYRNSIKDAQSRVLGATSFRRRDLELGAPVASRSWRPRPSSAHVGLRSPEASASASPH
## TPRERHSVTPAEGDLARPVPPAARRGARRRLTPEQKKRSYSEPEKMNEVGIVEEAEPAPLGPQRNGMRFP
## ESSVADRRRLFERDGKACSTLSLSGPELKQFQQSALADYIQRKTGKRPTSAAGCSLQEPGPLRERAQSAY
## LQPGPAALEGSGLASASSLSSLREPSLQPRREATLLPATVAETQQAPRDRSSSFAGGRRLGERRRGDLLS
## GANGGTRGTQRGDETPREPSSWGARAGKSMSAEDLLERSDVLAGPVHVRSRSSPATADKRQDVLLGQDSG
## FGLVKDPCYLAGPGSRSLSCSERGQEEMLPLFHHLTPRWGGSGCKAIGDSSVPSECPGTLDHQRQASRTP
## CPRPPLAGTQGLVTDTRAAPLTPIGTPLPSAIPSGYCSQDGQTGRQPLPPYTPAMMHRSNGHTLTQPPGP
## RGCEGDGPEHGVEEGTRKRVSLPQWPPPSRAKWAHAAREDSLPEESSAPDFANLKHYQKQQSLPSLCSTS
## DPDTPLGAPSTPGRISLRISESVLRDSPPPHEDYEDEVFVRDPHPKATSSPTFEPLPPPPPPPPSQETPV
## YSMDDFPPPPPHTVCEAQLDSEDPEGPRPSFNKLSKVTIARERHMPGAAHVVGSQTLASRLQTSIKGSEA
## ESTPPSFMSVHAQLAGSLGGQPAPIQTQSLSHDPVSGTQGLEKKVSPDPQKSSEDIRTEALAKEIVHQDK
## SLADILDPDSRLKTTMDLMEGLFPRDVNLLKENSVKRKAIQRTVSSSGCEGKRNEDKEAVSMLVNCPAYY
## SVSAPKAELLNKIKEMPAEVNEEEEQADVNEKKAELIGSLTHKLETLQEAKGSLLTDIKLNNALGEEVEA
## LISELCKPNEFDKYRMFIGDLDKVVNLLLSLSGRLARVENVLSGLGEDASNEERSSLYEKRKILAGQHED
## ARELKENLDRRERVVLGILANYLSEEQLQDYQHFVKMKSTLLIEQRKLDDKIKLGQEQVKCLLESLPSDF
## IPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL

TODO: explain what this code chunk is doing It is downloading the three different amino acid sequences automatically uaing the entrez database The id= argument is the accession number of the sequence. The db= is the type of entrez database. The rettype = is the file type we want the function to return.

# Mouse shroom 3a (M. musculus)
mShroom3a <- rentrez::entrez_fetch(db = "protein", 
                          id = "AAF13269", 
                          rettype = "fasta")

# Human shroom 2 (H. sapiens)
hShroom2 <- rentrez::entrez_fetch(db = "protein", 
                          id = "CAA58534", 
                          rettype = "fasta")


# Sea-urchin shroom
sShroom <- rentrez::entrez_fetch(db = "protein", 
                          id = "XP_783573", 
                          rettype = "fasta")

TODO: Explain what this code chunk is doing This code chunk is determining the number of amino acids in the fasta file/the protein sequences. We could also say its determining the length by counting the number of characters in the sequences.

nchar(hShroom3)

## [1] 2070

nchar(mShroom3a)

## [1] 2083

nchar(sShroom)

## [1] 1758

nchar(hShroom2)

## [1] 1673

Prepping macromolecular sequences

TODO: Explain what this function does The fasta_cleaner function cleans all of the sequences in the fasta file and makes them into an appropriate format for sequence alignment.It removes all of the non-sequence data such as metadata and newline characters.

library(compbio4all)
fasta_cleaner

## function (fasta_object, parse = TRUE) 
## {
##     fasta_object <- sub("^(>)(.*?)(\\n)(.*)(\\n\\n)", "\\4", 
##         fasta_object)
##     fasta_object <- gsub("\n", "", fasta_object)
##     if (parse == TRUE) {
##         fasta_object <- stringr::str_split(fasta_object, pattern = "", 
##             simplify = FALSE)
##     }
##     return(fasta_object[[1]])
## }
## <bytecode: 0x7fd28c597a68>
## <environment: namespace:compbio4all>

TODO: explain how to add the function to your R session even if you can’t download compbio4all You define a function to a name using <- function(){} and inside the parinthesis you give the code instructions for the function.

fasta_cleaner <- function(fasta_object, parse = TRUE){

  fasta_object <- sub("^(>)(.*?)(\\n)(.*)(\\n\\n)","\\4",fasta_object)
  fasta_object <- gsub("\n", "", fasta_object)

  if(parse == TRUE){
    fasta_object <- stringr::str_split(fasta_object,
                                       pattern = "",
                                       simplify = FALSE)
  }

  return(fasta_object[[1]])
}

TODO: briefly explain what this code chunk is doing The fasta_cleaner function cleans all of the sequences in the fasta file and makes them into an appropriate format for sequence alignment.

hShroom3  <- fasta_cleaner(hShroom3,  parse = F)
mShroom3a <- fasta_cleaner(mShroom3a, parse = F)
hShroom2  <- fasta_cleaner(hShroom2,  parse = F)
sShroom   <- fasta_cleaner(sShroom,   parse = F)

hShroom3

## [1] "MMRTTEDFHKPSATLNSNTATKGRYIYLEAFLEGGAPWGFTLKGGLEHGEPLIISKVEEGGKADTLSSKLQAGDEVVHINEVTLSSSRKEAVSLVKGSYKTLRLVVRRDVCTDPGHADTGASNFVSPEHLTSGPQHRKAAWSGGVKLRLKHRRSEPAGRPHSWHTTKSGEKQPDASMMQISQGMIGPPWHQSYHSSSSTSDLSNYDHAYLRRSPDQCSSQGSMESLEPSGAYPPCHLSPAKSTGSIDQLSHFHNKRDSAYSSFSTSSSILEYPHPGISGRERSGSMDNTSARGGLLEGMRQADIRYVKTVYDTRRGVSAEYEVNSSALLLQGREARASANGQGYDKWSNIPRGKGVPPPSWSQQCPSSLETATDNLPPKVGAPLPPARSDSYAAFRHRERPSSWSSLDQKRLCRPQANSLGSLKSPFIEEQLHTVLEKSPENSPPVKPKHNYTQKAQPGQPLLPTSIYPVPSLEPHFAQVPQPSVSSNGMLYPALAKESGYIAPQGACNKMATIDENGNQNGSGRPGFAFCQPLEHDLLSPVEKKPEATAKYVPSKVHFCSVPENEEDASLKRHLTPPQGNSPHSNERKSTHSNKPSSHPHSLKCPQAQAWQAGEDKRSSRLSEPWEGDFQEDHNANLWRRLEREGLGQSLSGNFGKTKSAFSSLQNIPESLRRHSSLELGRGTQEGYPGGRPTCAVNTKAEDPGRKAAPDLGSHLDRQVSYPRPEGRTGASASFNSTDPSPEEPPAPSHPHTSSLGRRGPGPGSASALQGFQYGKPHCSVLEKVSKFEQREQGSQRPSVGGSGFGHNYRPHRTVSTSSTSGNDFEETKAHIRFSESAEPLGNGEQHFKNGELKLEEASRQPCGQQLSGGASDSGRGPQRPDARLLRSQSTFQLSSEPEREPEWRDRPGSPESPLLDAPFSRAYRNSIKDAQSRVLGATSFRRRDLELGAPVASRSWRPRPSSAHVGLRSPEASASASPHTPRERHSVTPAEGDLARPVPPAARRGARRRLTPEQKKRSYSEPEKMNEVGIVEEAEPAPLGPQRNGMRFPESSVADRRRLFERDGKACSTLSLSGPELKQFQQSALADYIQRKTGKRPTSAAGCSLQEPGPLRERAQSAYLQPGPAALEGSGLASASSLSSLREPSLQPRREATLLPATVAETQQAPRDRSSSFAGGRRLGERRRGDLLSGANGGTRGTQRGDETPREPSSWGARAGKSMSAEDLLERSDVLAGPVHVRSRSSPATADKRQDVLLGQDSGFGLVKDPCYLAGPGSRSLSCSERGQEEMLPLFHHLTPRWGGSGCKAIGDSSVPSECPGTLDHQRQASRTPCPRPPLAGTQGLVTDTRAAPLTPIGTPLPSAIPSGYCSQDGQTGRQPLPPYTPAMMHRSNGHTLTQPPGPRGCEGDGPEHGVEEGTRKRVSLPQWPPPSRAKWAHAAREDSLPEESSAPDFANLKHYQKQQSLPSLCSTSDPDTPLGAPSTPGRISLRISESVLRDSPPPHEDYEDEVFVRDPHPKATSSPTFEPLPPPPPPPPSQETPVYSMDDFPPPPPHTVCEAQLDSEDPEGPRPSFNKLSKVTIARERHMPGAAHVVGSQTLASRLQTSIKGSEAESTPPSFMSVHAQLAGSLGGQPAPIQTQSLSHDPVSGTQGLEKKVSPDPQKSSEDIRTEALAKEIVHQDKSLADILDPDSRLKTTMDLMEGLFPRDVNLLKENSVKRKAIQRTVSSSGCEGKRNEDKEAVSMLVNCPAYYSVSAPKAELLNKIKEMPAEVNEEEEQADVNEKKAELIGSLTHKLETLQEAKGSLLTDIKLNNALGEEVEALISELCKPNEFDKYRMFIGDLDKVVNLLLSLSGRLARVENVLSGLGEDASNEERSSLYEKRKILAGQHEDARELKENLDRRERVVLGILANYLSEEQLQDYQHFVKMKSTLLIEQRKLDDKIKLGQEQVKCLLESLPSDFIPKAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL"

Aligning sequences

TODO: give this a title. Explain what code below is doing This code chunk is taking hShroom3 and mShroom3a and is determining the best way to align them doing a global alignment, using the pairwiseAlignment() function.

library(Biostrings)

align.h3.vs.m3a <- Biostrings::pairwiseAlignment(
                  hShroom3,
                  mShroom3a)

TODO: In 1-2 sentence explain what this object shows This object is showing the pairwise alignment of hShroom3 and mShroom3a. It will line them up and determine the best way of alignment and gives it a score based on degree of similarity.

align.h3.vs.m3a

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: MMRTTEDFHKPSATLN-SNTATKGRYIYLEAFLE...KAGALALPPNLTSEPIPAGGCTFSGIFPTLTSPL
## subject: MK-TPENLEEPSATPNPSRTPTE-RFVYLEALLE...KAGAISLPPALTGHATPGGTSVFGGVFPTLTSPL
## score: 2189.934

TODO: explain what this is showing This pid command is the percentage identity. This is a rough percentage estimate of similarity between the two sequences. It is stating that shroom3 from humans and shroom3 from mice are about 71% similar.

# add necessary function
Biostrings::pid(align.h3.vs.m3a)

## [1] 70.56511

TODO: briefly explain what is going on here versus the previous code chunk This code chunk is aligning sequences as previously done, not calculating percentage identity just yet. Also, instead of comparing hShroom3 with mShroom3a, there is now a comparison between hShroom3 and hShroom2.

align.h3.vs.h2 <- Biostrings::pairwiseAlignment(
                  hShroom3,
                  hShroom2)

TODO: explain what is going on here and compare and contrast with previous ouput The score function is different from the pid function and it accesses the score directly without all the other information. Both the score() function and the pid() function are assessing how closely they are aligned however.

score(align.h3.vs.h2)

## [1] -5673.853

TODO: briefly explian the difference between the output of score() and pid() (can be very brief - we’ll get into the details later) score() gave a much lesser number than the pid. This may be because indels are skipped when calculating pid. Score() is giving a simple number and pid() is giving a percentage.

Biostrings::pid(align.h3.vs.h2)

## [1] 33.83277

The shroom family of genes

TODO: briefly explain why I have this whole table here This table is showing all of the accession numbers, gene names for the sequences. It has shown them on their own.

shroom_table <- c("CAA78718" , "X. laevis Apx" ,         "xShroom1",
            "NP_597713" , "H. sapiens APXL2" ,     "hShroom1",
            "CAA58534" , "H. sapiens APXL",        "hShroom2",
            "ABD19518" , "M. musculus Apxl" ,      "mShroom2",
            "AAF13269" , "M. musculus ShroomL" ,   "mShroom3a",
            "AAF13270" , "M. musculus ShroomS" ,   "mShroom3b",
            "NP_065910", "H. sapiens Shroom" ,     "hShroom3",
            "ABD59319" , "X. laevis Shroom-like",  "xShroom3",
            "NP_065768", "H. sapiens KIAA1202" ,   "hShroom4a",
            "AAK95579" , "H. sapiens SHAP-A" ,     "hShroom4b",
            #"DQ435686" , "M. musculus KIAA1202" ,  "mShroom4",
            "ABA81834" , "D. melanogaster Shroom", "dmShroom",
            "EAA12598" , "A. gambiae Shroom",      "agShroom",
            "XP_392427" , "A. mellifera Shroom" ,  "amShroom",
            "XP_783573" , "S. purpuratus Shroom" , "spShroom") #sea urchin

TODO: write a short sentence explaining what this next code chunk will do, then annotate each line with what was done.

This code chunk is creating a table containing all of the shroom data, with specific titles and format etc.

# convert to matrix
shroom_table_matrix <- matrix(shroom_table,     # creating a matrix from shroom data
                                  byrow = T,    # logical. Matrix filled by rows
                                  nrow = 14)    # number of rows is 14
# convert to datframe
shroom_table <- data.frame(shroom_table_matrix, #creating data frame 
                     stringsAsFactors = F)      

# name columns
names(shroom_table) <- c("accession", "name.orig","name.new") #naming the tables columns
 
# Create simplified species names
shroom_table$spp <- "Homo" # simplifiying 
shroom_table$spp[grep("laevis",shroom_table$name.orig)] <- "Xenopus" # simplifying "laevis" to "Xenopus"
shroom_table$spp[grep("musculus",shroom_table$name.orig)] <- "Mus" #simplifying "musculus" to "Mus"
shroom_table$spp[grep("melanogaster",shroom_table$name.orig)] <- "Drosophila" #simplifying "melanogaster" to "Drosophila"
shroom_table$spp[grep("gambiae",shroom_table$name.orig)] <- "mosquito" #simplifying "gambiae" to "mosquitio"
shroom_table$spp[grep("mellifera",shroom_table$name.orig)] <- "bee" #simplifying "mellifera" to "bee"
shroom_table$spp[grep("purpuratus",shroom_table$name.orig)] <- "sea urchin"# simplifying "purpuratus" to "sea urchin"

TODO: in a brief sentence explain what this is doing It has presented the table we have just created with the code chunk above.

shroom_table

##    accession              name.orig  name.new        spp
## 1   CAA78718          X. laevis Apx  xShroom1    Xenopus
## 2  NP_597713       H. sapiens APXL2  hShroom1       Homo
## 3   CAA58534        H. sapiens APXL  hShroom2       Homo
## 4   ABD19518       M. musculus Apxl  mShroom2        Mus
## 5   AAF13269    M. musculus ShroomL mShroom3a        Mus
## 6   AAF13270    M. musculus ShroomS mShroom3b        Mus
## 7  NP_065910      H. sapiens Shroom  hShroom3       Homo
## 8   ABD59319  X. laevis Shroom-like  xShroom3    Xenopus
## 9  NP_065768    H. sapiens KIAA1202 hShroom4a       Homo
## 10  AAK95579      H. sapiens SHAP-A hShroom4b       Homo
## 11  ABA81834 D. melanogaster Shroom  dmShroom Drosophila
## 12  EAA12598      A. gambiae Shroom  agShroom   mosquito
## 13 XP_392427    A. mellifera Shroom  amShroom        bee
## 14 XP_783573   S. purpuratus Shroom  spShroom sea urchin

Downloading multiple sequences

TODO: in a brief sentence explain what the $ allows us to do The $ function has allowed us to precisely download the accession column to only look at all the accession numbers from the shroom table.

shroom_table$accession

##  [1] "CAA78718"  "NP_597713" "CAA58534"  "ABD19518"  "AAF13269"  "AAF13270" 
##  [7] "NP_065910" "ABD59319"  "NP_065768" "AAK95579"  "ABA81834"  "EAA12598" 
## [13] "XP_392427" "XP_783573"

TODO: briefly explain what this chunk is doing and add the correct function This is giving the whole set of accessions to entrez_fetch in the form of a fasta file and naming it shrooms.

# add necessary function
shrooms <- rentrez::entrez_fetch         (db = "protein", 
                          id = shroom_table$accession, 
                          rettype = "fasta")

TODO: in a very brief sentence explain what this is doing. This has printed all of the sequences out, with their specific accession numbers, as if the data was in a text editor in a long single format.

cat(shrooms)

TODO: in a brief sentence explain what this is doing and if/how its different from the previous code chunks

Entrez_fetch_list is the wrapper function, it is taking the entrez_fetch and mofidying it by writing code around it. The entrez_fetch_list function is putting the output of the entrez_fetch into an R data format called a list, which is different to the previous code chunks that just printed the output into in a single, long set of data.

library(rentrez)
shrooms_list<- compbio4all::entrez_fetch_list(db = "protein", 
                          id = shroom_table$accession, 
                          rettype = "fasta")

is(shrooms_list)

## [1] "list"             "vector"           "list_OR_List"     "vector_OR_Vector"
## [5] "vector_OR_factor"

length(shrooms_list)

## [1] 14

nchar(shrooms_list)

##  CAA78718 NP_597713  CAA58534  ABD19518  AAF13269  AAF13270 NP_065910  ABD59319 
##      1486       915      1673      1543      2083      1895      2070      1864 
## NP_065768  AAK95579  ABA81834  EAA12598 XP_392427 XP_783573 
##      1560       778      1647       750      2230      1758

Compbio4all and rentrez are dependencies.

TODO: briefly explain what I am doing this Determining the number of characters in the list. In this case, determining the number of proteins.

length(shrooms_list)

## [1] 14

TODO: briefly explain what I am doing this. We will get into the details of for() loops in R later in the semester. This code chunk is cleaning up the sequences.

for(i in 1:length(shrooms_list)){
  shrooms_list[[i]] <- fasta_cleaner(shrooms_list[[i]], parse = F)
}

TODO: summarize what is going on in this code chunk, then annotate each line of code with what its doing We need to take each one of our sequences from our list and put it into a vector, in particular a named vector

# XXXXXXXXCX
shrooms_vector <- rep(NA, length(shrooms_list))

# XXXXXXXXCX
for(i in 1:length(shrooms_vector)){
  shrooms_vector[i] <- shrooms_list[[i]]
}

#  XXXXXXXXCX
names(shrooms_vector) <- names(shrooms_list)

TODO: explain what this is doing then add the necessary function. This is converting the named vector into a string set. The _ss tag is what the output is being assigned to.

# add necessary function
shrooms_vector_ss <- Biostrings::AAStringSet        (shrooms_vector)

MSA

TODO: briefly summarize what this section of the document will do.
Readings will be assigned to explain what MSAs are.

In this section, there will be an alignment performed between all of the sequences that were previously downloaded, which is then used to create a phylogenetic tree. This will indicate possible relations between genes within and between species.

Building an Multiple Sequence Alignment (MSA)

TODO: briefly explain what this chunk does, then add the necessary function.

This is assigning the multiple sequence alignment,created by the ClustalW msa algorithm to shrooms_align. This code is the algorithm for downloading the Means we do not have to download it using the command line

# add necessary function
shrooms_align <- msa(shrooms_vector_ss,
                     method = "ClustalW")

## use default substitution matrix

Viewing an MSA

TODO: briefly summarize what this section will do. This will allow us to vizualize the MSA in a more accessible way.

Viewing an MSA in R

TODO: Briefly summarize what output is shown below This is the output from the msa, using the ClustalW. It is showing areas of similarity between the amino acid sequences.

shrooms_align

## CLUSTAL 2.1  
## 
## Call:
##    msa(shrooms_vector_ss, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 14 rows and 2252 columns
##      aln                                                   names
##  [1] -------------------------...------------------------- NP_065768
##  [2] -------------------------...------------------------- AAK95579
##  [3] -------------------------...SVFGGVFPTLTSPL----------- AAF13269
##  [4] -------------------------...SVFGGVFPTLTSPL----------- AAF13270
##  [5] -------------------------...CTFSGIFPTLTSPL----------- NP_065910
##  [6] -------------------------...NKS--LPPPLTSSL----------- ABD59319
##  [7] -------------------------...------------------------- CAA58534
##  [8] -------------------------...------------------------- ABD19518
##  [9] -------------------------...LT----------------------- NP_597713
## [10] -------------------------...------------------------- CAA78718
## [11] -------------------------...------------------------- EAA12598
## [12] -------------------------...------------------------- ABA81834
## [13] MTELQPSPPGYRVQDEAPGPPSCPP...------------------------- XP_392427
## [14] -------------------------...AATSSSSNGIGGPEQLNSNATSSYC XP_783573
##  Con -------------------------...------------------------- Consensus

TODO: briefly explain what is being done in this chunk. This is tricky (and annoying) so do your best

# WHAT IS THE LINE BELOW DOING? (its tricky - do your best)
# The AAMultipleAlignment (Amino acid multiple alignment), is being assigned to the shrooms_alignt. The shrooms_align is a type of AAMultipleALignment. 
class(shrooms_align) <- "AAMultipleAlignment"

# WHAT IS THE LINE BELOW DOING? This is simpler
# It is changing the name to shrooms_align_seqinr to show one of the changes is putting this into a format defined by seqinr package
shrooms_align_seqinr <- msaConvert(shrooms_align, type = "seqinr::alignment")

TODO: what is the output this produces Printing out the actual full alignment into the R console.

print_msa(alignment = shrooms_align_seqinr, 
          chunksize = 60)

Displaying an MSA XXXXXXXX

TODO: explain this output and how its different from the previous It is presenting only the desired chunk of amino acid sequence, whereas the previous printed out the whole sequence of the gene alignment.

## add necessary function
ggmsa::ggmsa(shrooms_align,   # shrooms_align, NOT shrooms_align_seqinr
      start = 2000, 
      end = 2100)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

TODO: explain what this command is doing This allows you to see where R is saving things on the device (get working directory).

getwd()

## [1] "/Users/gemmalouisefarquhar/Downloads"

Building a phylogeny in R

Gemma Farquhar

30/09/2021