Sequence dotplots in R

By: Avril Coghlan.

Adapted, edited and expanded: Nathan Brouwer under the Creative Commons 3.0 Attribution License (CC BY 3.0).

NOTE: I’ve added some new material that is rather terse and lacks explication.

Good sources of more info: https://omicstutorials.com/interpreting-dot-plot-bioinformatics-with-an-example/

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/Examples_interpretations_dot_plots.html

As a first step in comparing two protein, RNA or DNA sequences, it is a good idea to make a dotplot. A dotplot is a graphical method that allows the comparison of two protein or DNA sequences and identify regions of close similarity between them. A dotplot is essentially a two-dimensional matrix (like a grid), which has the sequences of the proteins being compared along the vertical and horizontal axes.

In order to make a simple dotplot to represent of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix. Identical proteins will have a line exactly on the main diagonal of the dotplot, that spans across the whole matrix.

For proteins that are not identical, but share regions of similarity, the dotplot will have shorter lines that may be on the main diagonal, or off the main diagonal of the matrix. In essence, a dotplot will reveal if there are any regions that are clearly very similar in two protein (or DNA) sequences.

Preliminaries

library(compbio4all)
library(rentrez)

Visualzing two identical sequences

To help build our intuition about dotplots we’ll first look at some artificial examples. First, we’ll see what happens when we make a dotplot comparing the alphabet versus itself. The build-in LETTERS object in R contains the alphabet from A to Z. This is a sequence with no repeats.

#LETTERS
seqinr::dotPlot(LETTERS , LETTERS) # add code

What we get is a perfect diagonal line.

Visualizing repeats

TODO: There are two grids; useful in studying in tandem repeats. It shows a identity line, off diagonals; the repeats are showing to create multiple diagonals e.g. 3 A-A; B-B; C-C matches.

LETTERS.2.times  <-c(LETTERS, LETTERS) # add code

seqinr::dotPlot(LETTERS.2.times, 
                LETTERS.2.times)

TODO: There are three grids; useful in studying in tandem repeats.The repeats are showing to create multiple diagonals.

LETTERS.3.times  <-c(LETTERS, LETTERS, LETTERS) # add code

seqinr::dotPlot(LETTERS.3.times, 
                LETTERS.3.times)

TODO: Assigning letters to a vector for the sequence repeat and then assigning the repeated sequence in seq.repeat 3 times.

seq.repeat<-c("A","C","D","E","F","G","H","I")

seq1 <- rep(seq.repeat, 3)

Make the dotplot:

seqinr::dotPlot(seq1, seq1)

# add code

TODO: Make and plot an inversion dotplot

TODO: The LETTERS 3x repeat vector will be inverted.

“invert” means “inversion”

LETTERS.3.times.with.invert <-c(LETTERS, rev(LETTERS), LETTERS)  # add code

seqinr::dotPlot(LETTERS.3.times.with.invert, LETTERS.3.times.with.invert)# add code

TODO: Taking different domains of the alphabet letters into its respective sequence vectors, then putting it into one vector, and then making a dotplot.

TODO: explain this

seg1 <- LETTERS[1:8]
seg2 <- LETTERS[9:18] # add code
seg3 <- LETTERS[18:26] # add code


LETTERS.with.transloc <-c(seg1, seg2, seg3)  # add  code

seqinr::dotPlot(LETTERS.with.transloc, LETTERS.with.transloc) # add code

TODO: Ploting a random LETTERS dotplot

TODO: Stimulate Evolution

sample(x = LETTERS, size = 26, replace = T)
##  [1] "B" "C" "O" "V" "Q" "A" "O" "K" "W" "Z" "B" "W" "Z" "X" "S" "X" "V" "A" "S"
## [20] "Z" "T" "X" "X" "R" "Y" "T"
letters.rand1 <- sample(x = LETTERS, size = 26, replace = F)# add code
letters.rand2 <- sample(x = LETTERS, size = 26, replace = F)# add code


seqinr::dotPlot(letters.rand1, letters.rand2)# add code

Download sequences

Now we’ll make a real dotplot of the chorismate lyase proteins from two closely related species, Mycobacterium leprae and Mycobacterium ulcerans.

Note - these are protein sequences so db = “protein”

TODO: Cleaning the protein sequrnces that in FASTA files. Comparing two different protein sequences in a FASTA files and then adding it to a vector from the compbio4all packages.

# sequence 1: Q9CD83
leprae_fasta <- rentrez::entrez_fetch(db = "protein",# add code
                        id = "Q9CD83",
                         rettype = "fasta")
# sequence 2: OIN17619.1
ulcerans_fasta <- rentrez::entrez_fetch(db ="protein", # add code
                         id = "OIN17619.1",
                         rettype = "fasta")
fasta_cleaner <- function(fasta_object, parse = TRUE){

  fasta_object <- sub("^(>)(.*?)(\\n)(.*)(\\n\\n)","\\4",fasta_object)
  fasta_object <- gsub("\n", "", fasta_object)

  if(parse == TRUE){
    fasta_object <- stringr::str_split(fasta_object,
                                       pattern = "",
                                       simplify = FALSE)
  }

  return(fasta_object[[1]])
}

leprae_vector   <- compbio4all::fasta_cleaner(leprae_fasta)  # add code
ulcerans_vector <- compbio4all:: fasta_cleaner(ulcerans_fasta ) # add code