FAQ: How do I make a multiple sequence alignment in R?

QUESTION: How do I make a multiple sequence alignment in R?

Multiple sequence alignments line up different DNA/protein sequences and are a good way to compare them to determine whether a sequence is well conserved. A common way to do this is to use DNA/protein sequences (same gene) from different species and see how related they are.

Necessary Libraries

# required packages
library(compbio4all)
library(rentrez)
library(seqinr)
library(msa)

## Loading required package: Biostrings

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

## Loading required package: XVector

## Loading required package: GenomeInfoDb

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:seqinr':
## 
##     translate

## The following object is masked from 'package:base':
## 
##     strsplit

library(ggmsa)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

Data

I am just using a random sequence of DNA. I am also saving the length of the sequence since this will be useful later.

dio1_table<-c("NP_000783",   "P49895","NA","Homo sapiens" ,    "Human",      "DIO1",
              "NP_001116123","NA",    "NA", "Pan troglodytes" , "Chimpanzee","DIO1",
              "NP_031886",   "Q61153","NA","Mus musculus",      "Mouse"    ,"DIO1",
              "NP_001091083","P24389","NA","Rattus norvegicus", "Rat",      "Dio1",
              "NP_001243226","Q2QEI3","NA","Xenopus tropicalis","Frog",     "dio1",
              "NP_001007284","F1R7E6","NA","Danio rerio",       "Fish",     "dio1")

# convert the vector to matrix using matrix()
dio1_table_matrix <- matrix(dio1_table, byrow = T, nrow = 6)

# convert the matrix to a dataframe using data.frame()
dio1_table <- as.data.frame(dio1_table_matrix,  stringsAsFactors = F)

# name columns of dataframe using names() function
colnames(dio1_table) <- c("ncbi.protein.accession", "UniProt.id", "PDB", "species", "common.name", "gene.name")

# convert table to dataframe
dio1_table <- as.data.frame(dio1_table)

Get FASTA sequences

This ste gives us the FASTA sequences for each of the accessions specified in the table above (in list form).

dio1s_list <- compbio4all::entrez_fetch_list(db = "protein", 
                          id = dio1_table$ncbi.protein.accession, 
                          rettype = "fasta")

This is then turned into a vector

# creating a vector without sequences with NA at every position
dio1s_vector <- rep(NA, length(dio1s_list))

# populating the vector with the fasta sequences from shroom_list
for(i in 1:length(dio1s_list)){
  dio1s_vector[i] <- dio1s_list[[i]]
}

#  labeling the sequences with the names from the list.
names(dio1s_vector) <- names(dio1s_list)

Convert to AAStringSet

The vector needs to be converted into the required data type for the msa function.

dio1s_vector_ss <- Biostrings::AAStringSet(dio1s_vector)

Make the MSA

dio1s_msa <- msa(dio1s_vector_ss)

## use default substitution matrix

Plot using `ggmsa`

We can make the MSA look nice using ggmsa First, it needs to be converted into the requried class type and then it can be plotted.

It is easier to visualize a section of the alignment rather than the entire alignment.

class(dio1s_msa) <- "AAMultipleAlignment"
ggmsa(dio1s_msa, start = 100, end = 150)

Additional Reading

For more information on this topic, see

TODO: find one resource related to this topic, such as those found on https://www.statmethods.net/index.html, https://r-charts.com/, http://www.r-tutor.com/, http://www.sthda.com/. (http://www.sthda.com/ is run by the author of ggpubr and has lots of resources for it).

Keywords

multiple sequence alignment
msa()
ggmsa()
AAStringSet

FAQ: How do I make a multiple sequence alignment in R?

Ananya Venbakkam

12/17/2021

QUESTION: How do I make a multiple sequence alignment in R?

Necessary Libraries

Data

Get FASTA sequences

Convert to AAStringSet

Make the MSA

Plot using `ggmsa`

Additional Reading

Keywords

FAQ: How do I make a multiple sequence alignment in R?

Ananya Venbakkam

12/17/2021

QUESTION: How do I make a multiple sequence alignment in R?

Necessary Libraries

Data

Get FASTA sequences

Convert to AAStringSet

Make the MSA

Plot using ggmsa

Additional Reading

Keywords

Plot using `ggmsa`