Sequence dotplots in R

By: Avril Coghlan.

Adapted, edited and expanded: Nathan Brouwer under the Creative Commons 3.0 Attribution License (CC BY 3.0).

NOTE: I’ve added some new material that is rather terse and lacks explication.

Good sources of more info: https://omicstutorials.com/interpreting-dot-plot-bioinformatics-with-an-example/

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/Examples_interpretations_dot_plots.html

As a first step in comparing two protein, RNA or DNA sequences, it is a good idea to make a dotplot. A dotplot is a graphical method that allows the comparison of two protein or DNA sequences and identify regions of close similarity between them. A dotplot is essentially a two-dimensional matrix (like a grid), which has the sequences of the proteins being compared along the vertical and horizontal axes.

In order to make a simple dotplot to represent of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix. Identical proteins will have a line exactly on the main diagonal of the dotplot, that spans across the whole matrix.

For proteins that are not identical, but share regions of similarity, the dotplot will have shorter lines that may be on the main diagonal, or off the main diagonal of the matrix. In essence, a dotplot will reveal if there are any regions that are clearly very similar in two protein (or DNA) sequences.

Preliminaries

library(compbio4all)
library(rentrez)

Visualzing two identical sequences

To help build our intuition about dotplots we’ll first look at some artificial examples. First, we’ll see what happens when we make a dotplot comparing the alphabet versus itself. The build-in LETTERS object in R contains the alphabet from A to Z. This is a sequence with no repeats.

#LETTERS
seqinr::dotPlot(LETTERS, LETTERS) # add code

What we get is a perfect diagonal line.

Visualizing repeats

We are creating a vector which contains the same sequence 2 times, so we will get a graph that essentially has two grids. There will be three lines, because at the coordinate (1,1), we have a match at A, and then there is another match at A at (1,27) because the sequence starts over again.

LETTERS.2.times  <- c(LETTERS, LETTERS)

seqinr::dotPlot(LETTERS.2.times, 
                LETTERS.2.times)

Now we are creating a vector which contains the LETTERS sequence 3 times. When we plot this, we will see similar results as we did in the previous plot, but with 2 more lines, one on the top and bottom. This is because, using A for example again, there will be matches for A at (1,1), (1,27), and (1,54).

LETTERS.3.times <- c(LETTERS, LETTERS, LETTERS)

seqinr::dotPlot(LETTERS.3.times,
                LETTERS.3.times)

Here we create a vector containing a sequence and then produce a vector where we repeat the sequence a specified number of times using the rep() function.

seq.repeat <- c("A","C","D","E","F","G","H","I")

seq1 <- rep(seq.repeat, 3)

Make the dotplot:

seqinr::dotPlot(seq1,seq1)

Inversion of Sequences

We create a vector by repeating the sequence from LETTERS 3 times but inverting the middle one so that it’s backwards. When we plot it, we’ll see that two diagonal lines with a downward slope representing the matches from the inverted sequence.

“invert” means “inversion”

LETTERS.3.times.with.invert <- c(LETTERS,
                                 rev(LETTERS),
                                 LETTERS)
LETTERS.3.times.with.invert

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z" "Z" "Y" "X" "W" "V" "U" "T" "S" "R" "Q" "P" "O"
## [39] "N" "M" "L" "K" "J" "I" "H" "G" "F" "E" "D" "C" "B" "A" "A" "B" "C" "D" "E"
## [58] "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X"
## [77] "Y" "Z"

seqinr::dotPlot(LETTERS.3.times.with.invert,
                LETTERS.3.times.with.invert)

Translocation

We are splitting up the LETTERS sequence into 3 segments, which represent 3 different domains. We are swapping the segments and putting them in a different order, with segment 3 in the middle, and then making the dot plot.

seg1 <- LETTERS[1:8]
seg2 <- LETTERS[9:18]
seg3 <- LETTERS[19:26]


LETTERS.with.transloc <-  c(seg1, seg3, seg2)

seqinr::dotPlot(LETTERS.with.transloc,
                LETTERS.with.transloc)

Randomization

Here we are using the sample() function which randomly samples from the vector. When sampling with replacement (replace=T), when a letter is picked out it remains in the list and can be picked again. When sampling without replacement (replace=F), once a letter is picked out it cannot be picked again as it is taken out of the list.

sample(x = LETTERS, size = 26, replace = T)

##  [1] "T" "Z" "F" "N" "A" "Y" "S" "X" "F" "J" "F" "S" "F" "H" "A" "B" "P" "Z" "I"
## [20] "W" "P" "V" "R" "P" "P" "L"

letters.rand1 <- sample(x = LETTERS, size = 26, replace = F)
letters.rand2 <- sample(x = LETTERS, size = 26, replace = F)


letters.rand1

##  [1] "T" "J" "Z" "R" "L" "K" "F" "E" "U" "X" "W" "I" "V" "M" "Q" "C" "P" "N" "O"
## [20] "D" "S" "H" "Y" "B" "A" "G"

seqinr::dotPlot(letters.rand1,
                letters.rand2)

Download sequences

Now we’ll make a real dotplot of the chorismate lyase proteins from two closely related species, Mycobacterium leprae and Mycobacterium ulcerans.

# sequence 1: Q9CD83
leprae_fasta <- rentrez::entrez_fetch(db = "protein",
                        id = "Q9CD83",
                         rettype = "fasta")
# sequence 2: OIN17619.1
ulcerans_fasta <- rentrez::entrez_fetch(db ="protein",
                         id = "OIN17619.1",
                         rettype = "fasta")


leprae_vector <- compbio4all::fasta_cleaner(leprae_fasta, parse = T)
ulcerans_vector <- compbio4all::fasta_cleaner(ulcerans_fasta, parse = T)

Creating a Dotplot

seqinr::dotPlot(leprae_vector,
                ulcerans_vector)

Here, we can see a dotplot of the M. leprae sequence compared to the M. ulcerans sequence. While there are lots of dots all over the plot representing where the sequences are similar, we can see that there is clearly a diagonal line which shows that there is a pattern. This shows that the two sequences have similarity in amino acids in these positions, and indicates that they likely have similar amino acid sequences.

Introduction to dotplots in R- Portfolio Assignment

Colleen Petersen

10/26/2021