Sequence dotplots in R

By: Avril Coghlan.

Adapted, edited and expanded: Nathan Brouwer under the Creative Commons 3.0 Attribution License (CC BY 3.0).

NOTE: I’ve added some new material that is rather terse and lacks explication.

Good sources of more info: https://omicstutorials.com/interpreting-dot-plot-bioinformatics-with-an-example/

http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/Examples_interpretations_dot_plots.html

As a first step in comparing two protein, RNA or DNA sequences, it is a good idea to make a dotplot. A dotplot is a graphical method that allows the comparison of two protein or DNA sequences and identify regions of close similarity between them. A dotplot is essentially a two-dimensional matrix (like a grid), which has the sequences of the proteins being compared along the vertical and horizontal axes.

In order to make a simple dotplot to represent of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix. Identical proteins will have a line exactly on the main diagonal of the dotplot, that spans across the whole matrix.

For proteins that are not identical, but share regions of similarity, the dotplot will have shorter lines that may be on the main diagonal, or off the main diagonal of the matrix. In essence, a dotplot will reveal if there are any regions that are clearly very similar in two protein (or DNA) sequences.

Preliminaries

library(compbio4all)
library(rentrez)

Visualzing two identical sequences

To help build our intuition about dotplots we’ll first look at some artificial examples. First, we’ll see what happens when we make a dotplot comparing the alphabet versus itself. The build-in LETTERS object in R contains the alphabet from A to Z. This is a sequence with no repeats.

#LETTERS
seqinr::dotPlot(LETTERS, LETTERS)

What we get is a perfect diagonal line.

Visualizing repeats

The code below creates a character vector containing the LETTERS object twice back-to-back. After this we compare this 52 character vector to itself on a dot plot. This helps us visualize what a repeating sequence looks like on a dot plot.

LETTERS.2.times  <- c(LETTERS, LETTERS)

seqinr::dotPlot(LETTERS.2.times, 
                LETTERS.2.times)

Now we’ll make a vector of LETTERS three times back-to-back-to-back and plot this against itself.

LETTERS.3.times <- c(LETTERS, LETTERS, LETTERS)
  
seqinr::dotPlot(LETTERS.3.times, LETTERS.3.times)

In a similar fashion as above, the code below creates a triple repeat in object seq1, which is uses the repeat function [rep()] to list the letters A through I three times in tandem.

seq.repeat <- c("A","C","D","E","F","G","H","I")

seq1 <- rep(seq.repeat, 3)

Make the dotplot:

seqinr::dotPlot(seq1, seq1)

Visualizing Inversions

Now lets visualize inversions. Inversions are sequences of code that have been flipped backwards and reinserted into the rest of the sequence. The code below creates a sequence with an inversion using the rev() function, short for reverse, and the LETTERS object. The reversed LETTERS object is inserted into the middle of the sequence.

“invert” means “inversion”

LETTERS.3.times.with.invert <-  c(LETTERS, rev(LETTERS), LETTERS)

seqinr::dotPlot(LETTERS.3.times.with.invert, LETTERS.3.times.with.invert)

Visualizing translocations

Translocations are sequences of code that have been removed from the main sequence and reinserted elsewhere. In this case, we will take letters 9:18, stored in object seg2, and place them at the end of the sequence of letters. As you can see, dot plots do not visualize translocations well.

seg1 <- LETTERS[1:8]
seg2 <-  LETTERS[9:18]
seg3 <- LETTERS[19:26]


LETTERS.with.transloc <- c(seg1, seg3, seg2)

seqinr::dotPlot(LETTERS.with.transloc, LETTERS.with.transloc)

Visualizing Non-Homologous Sequences

Now lets look at what two random, unrelated sequences look like plotted against one another. The code below tells R to take a random sample [sample()] of 26 letters with replacement as an example. The code below that stores random samples of 26 letters without replacement into two objects and plots them against one another. The result is noise.

sample(x = LETTERS, size = 26, replace = T)

##  [1] "I" "O" "O" "K" "I" "Y" "I" "Z" "N" "B" "Y" "Z" "Q" "I" "I" "E" "J" "G" "L"
## [20] "P" "E" "V" "A" "W" "C" "Z"

letters.rand1 <- sample(x = LETTERS, size = 26, replace = F)
letters.rand2 <- sample(x = LETTERS, size = 26, replace = F)


seqinr::dotPlot(letters.rand1, letters.rand2)

Download sequences

Now we’ll make a real dotplot of the chorismate lyase proteins from two closely related species, Mycobacterium leprae and Mycobacterium ulcerans.

Note - these are protein sequences so db = “protein”

First we’ll use the entrez_fetch() function from the rentrez package to fetch our fasta files of the sequences we want to work with. entrez_fetch() requires us to specify which database to pull from (db = ), the accession number we want, (id = ), and the file type (rettype = ). After we’ve stored both sequences in objects in R, we can clean them up using fasta_cleaner() from compbio4all package.

# sequence 1: Q9CD83
leprae_fasta <- rentrez::entrez_fetch(db = "protein",
                        id = "Q9CD83",
                         rettype = "fasta")
# sequence 2: OIN17619.1
ulcerans_fasta <- rentrez::entrez_fetch(db = "protein",
                         id = "OIN17619.1",
                         rettype = "fasta")

leprae_vector   <- compbio4all::fasta_cleaner(leprae_fasta)
ulcerans_vector <- compbio4all::fasta_cleaner(ulcerans_fasta)

Dot Plots on Real Data

We can create a dotplot for two sequences using the dotPlot() function in the seqinr package.

First, let’s look at a dotplot created using only a single sequence. This is frequently done to investigate a sequence for the presence of repeats.

(Note - and older version of this exercise stated this kind of anlysis wasn’t normally done; this was written last year before I knew of the use of dotplots for investigating sequence repeats.)

seqinr::dotPlot(leprae_vector, ulcerans_vector)

The main pattern observed here is that these two sequences are similar, as the line of identity along the diagonal is mostly preserved. However, few repeats are shown in the same spots in each sequence.

In the dotplot above, the M. leprae sequence is plotted along the x-axis (horizontal axis), and the M. ulcerans sequence is plotted along the y-axis (vertical axis). The dotplot displays a dot at points where there is an identical amino acid in the two sequences.

For example, if amino acid 53 in the M. leprae sequence is the same amino acid (eg. “W”) as amino acid 70 in the M. ulcerans sequence, then the dotplot will show a dot the position in the plot where x =50 and y =53.

In this case you can see a lot of dots along a diagonal line, which indicates that the two protein sequences contain many identical amino acids at the same (or very similar) positions along their lengths. This is what you would expect, because we know that these two proteins are homologs (related proteins) because they share a close evolutionary history.

Introduction to dotplots in R

Nathan Brouwer

10/21/2021