Multiple sequence alignments line up different DNA/protein sequences and are a good way to compare them to determine whether a sequence is well conserved. A common way to do this is to use DNA/protein sequences (same gene) from different species and see how related they are.
# required packages
library(compbio4all)
library(rentrez)
library(seqinr)
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:seqinr':
##
## translate
## The following object is masked from 'package:base':
##
## strsplit
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
I am just using a random sequence of DNA. I am also saving the length of the sequence since this will be useful later.
dio1_table<-c("NP_000783", "P49895","NA","Homo sapiens" , "Human", "DIO1",
"NP_001116123","NA", "NA", "Pan troglodytes" , "Chimpanzee","DIO1",
"NP_031886", "Q61153","NA","Mus musculus", "Mouse" ,"DIO1",
"NP_001091083","P24389","NA","Rattus norvegicus", "Rat", "Dio1",
"NP_001243226","Q2QEI3","NA","Xenopus tropicalis","Frog", "dio1",
"NP_001007284","F1R7E6","NA","Danio rerio", "Fish", "dio1")
# convert the vector to matrix using matrix()
dio1_table_matrix <- matrix(dio1_table, byrow = T, nrow = 6)
# convert the matrix to a dataframe using data.frame()
dio1_table <- as.data.frame(dio1_table_matrix, stringsAsFactors = F)
# name columns of dataframe using names() function
colnames(dio1_table) <- c("ncbi.protein.accession", "UniProt.id", "PDB", "species", "common.name", "gene.name")
# convert table to dataframe
dio1_table <- as.data.frame(dio1_table)
This ste gives us the FASTA sequences for each of the accessions specified in the table above (in list form).
dio1s_list <- compbio4all::entrez_fetch_list(db = "protein",
id = dio1_table$ncbi.protein.accession,
rettype = "fasta")
This is then turned into a vector
# creating a vector without sequences with NA at every position
dio1s_vector <- rep(NA, length(dio1s_list))
# populating the vector with the fasta sequences from shroom_list
for(i in 1:length(dio1s_list)){
dio1s_vector[i] <- dio1s_list[[i]]
}
# labeling the sequences with the names from the list.
names(dio1s_vector) <- names(dio1s_list)
The vector needs to be converted into the required data type for the msa function.
dio1s_vector_ss <- Biostrings::AAStringSet(dio1s_vector)
dio1s_msa <- msa(dio1s_vector_ss)
## use default substitution matrix
ggmsaWe can make the MSA look nice using ggmsa First, it needs to be converted into the requried class type and then it can be plotted.
It is easier to visualize a section of the alignment rather than the entire alignment.
class(dio1s_msa) <- "AAMultipleAlignment"
ggmsa(dio1s_msa, start = 100, end = 150)
For more information on this topic, see
TODO: find one resource related to this topic, such as those found on https://www.statmethods.net/index.html, https://r-charts.com/, http://www.r-tutor.com/, http://www.sthda.com/. (http://www.sthda.com/ is run by the author of ggpubr and has lots of resources for it).