Sequence dotplots in R

By: Avril Coghlan.

Adapted, edited and expanded: Nathan Brouwer under the Creative Commons 3.0 Attribution License (CC BY 3.0).

NOTE: I’ve added some new material that is rather terse and lacks explication.

Good sources of more info: https://omicstutorials.com/interpreting-dot-plot-bioinformatics-with-an-example/

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/Examples_interpretations_dot_plots.html

As a first step in comparing two protein, RNA or DNA sequences, it is a good idea to make a dotplot. A dotplot is a graphical method that allows the comparison of two protein or DNA sequences and identify regions of close similarity between them. A dotplot is essentially a two-dimensional matrix (like a grid), which has the sequences of the proteins being compared along the vertical and horizontal axes.

In order to make a simple dotplot to represent of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix. Identical proteins will have a line exactly on the main diagonal of the dotplot, that spans across the whole matrix.

For proteins that are not identical, but share regions of similarity, the dotplot will have shorter lines that may be on the main diagonal, or off the main diagonal of the matrix. In essence, a dotplot will reveal if there are any regions that are clearly very similar in two protein (or DNA) sequences.

Preliminaries

library(compbio4all)
library(rentrez)

Visualzing two identical sequences

To help build our intuition about dotplots we’ll first look at some artificial examples. First, we’ll see what happens when we make a dotplot comparing the alphabet versus itself. The build-in LETTERS object in R contains the alphabet from A to Z. This is a sequence with no repeats.

#LETTERS
seqinr::dotPlot(LETTERS, LETTERS) # add code

What we get is a perfect diagonal line.

Visualizing repeats

This code chunk makes a new object that stores the alphabet twice in a row and plots it against each other.

LETTERS.2.times <- rep(LETTERS, 2) # add code

seqinr::dotPlot(LETTERS.2.times, 
                LETTERS.2.times)

This code chunk makes a new object that stores the alphabet thrice in a row and plots it against each other.

LETTERS.3.times <- rep(LETTERS, 3)# add code

seqinr::dotPlot(LETTERS.3.times, 
                LETTERS.3.times) # add code

This code creates a sequence that has the following string repeated thrice, and plots it against each other.

seq.repeat <- c("A","C","D","E","F","G","H","I")

# add code
seq1 <- rep(seq.repeat, 3)

Make the dotplot:

# add code
seqinr::dotPlot(seq1, seq1)

Visualizing inverted repeats

This code chunk creates a sequence of the alphabet repeated thrice but the second repeat is inverted(backwards), causing the diagonal of the second repeat to be opposite to the others.

“invert” means “inversion”

LETTERS.3.times.with.invert <-  c(LETTERS, rev(LETTERS), LETTERS)# add code

seqinr::dotPlot(LETTERS.3.times.with.invert, LETTERS.3.times.with.invert)# add code

This code chunk splits the alphabet contained in the LETTERS object into 3 chunks and then transposes the order of the alphabet chunks. This new sequence of letters is then plotted against the original LETTERS alphabet.

seg1 <- LETTERS[1:8]
seg2 <- LETTERS[9:17] # add code
seg3 <- LETTERS[18:26] # add code


LETTERS.with.transloc <- c(seg1, seg3, seg2) # add  code

LETTERS.with.transloc

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "I" "J"
## [20] "K" "L" "M" "N" "O" "P" "Q"

seqinr::dotPlot(LETTERS.with.transloc, LETTERS)  # add code

Visualizinng random sequences

This code uses the function sample which is used to randomly select a smaller subset of data from a large set, in this case LETTERS. Sampling can happen with and without replacement, though in this case it is happening without replacement. The ending sequences are exactly as long as LETTERS, so sample() produces 2 sequences of the alphabet with a different order to the letters and plots them against each other.

sample(x = LETTERS, size = 26, replace = T)

##  [1] "G" "P" "A" "C" "O" "N" "I" "Y" "O" "L" "K" "O" "S" "H" "O" "P" "J" "X" "P"
## [20] "Q" "O" "S" "Y" "J" "B" "R"

letters.rand1 <- sample(x = LETTERS, size = 26, replace = F)# add code
letters.rand2 <- sample(x = LETTERS, size = 26, replace = F)# add code


seqinr::dotPlot(letters.rand1, 
                letters.rand2)# add code

Download sequences

Now we’ll make a real dotplot of the chorismate lyase proteins from two closely related species, Mycobacterium leprae and Mycobacterium ulcerans.

Note - these are protein sequences so db = “protein”

TODO: briefly summarize these steps

# sequence 1: Q9CD83
leprae_fasta <- rentrez::entrez_fetch(db = "protein",# add code
                        id = "Q9CD83",
                         rettype = "fasta")
# sequence 2: OIN17619.1
ulcerans_fasta <- rentrez::entrez_fetch(db ="protein", # add code
                         id = "OIN17619.1",
                         rettype = "fasta")

# add code
leprae_vector   <- compbio4all::fasta_cleaner(leprae_fasta) # add code
ulcerans_vector <- compbio4all::fasta_cleaner(ulcerans_fasta) # add code

Investigating epeats in the chorismate lyase sequence from

Mycobacterium leprae and Mycobacterium ulcerans

We can create a dotplot for two sequences using the dotPlot() function in the seqinr package.

First, let’s look at a dotplot created using only a single sequence. This is frequently done to investigate a sequence for the presence of repeats.

(Note - and older version of this exercise stated this kind of analysis wasn’t normally done; this was written last year before I knew of the use of dotplots for investigating sequence repeats.)

seqinr::dotPlot(leprae_vector, ulcerans_vector, wsize = 20, nmatch = 5)# add code

The main pattern in this dotplot is the main diagonal representing sameness within the two sequences, and many smaller diagonals running parallel to the main diagonal representing repeating sequences. The repeats being parallel to the main diagonal means that the repeats are occurring in different places throughout the sequence but are not inverted.

In the dotplot above, the M. leprae sequence is plotted along the x-axis (horizontal axis), and the M. ulcerans sequence is plotted along the y-axis (vertical axis). The dotplot displays a dot at points where there is an identical amino acid in the two sequences.

For example, if amino acid 53 in the M. leprae sequence is the same amino acid (eg. “W”) as amino acid 70 in the M. ulcerans sequence, then the dotplot will show a dot the position in the plot where x =50 and y =53.

In this case you can see a lot of dots along a diagonal line, which indicates that the two protein sequences contain many identical amino acids at the same (or very similar) positions along their lengths. This is what you would expect, because we know that these two proteins are homologs (related proteins) because they share a close evolutionary history.

Introduction to dotplots in R

Sadhika Sampath

10/27/2021