By: Avril Coghlan.
Adapted, edited and expanded: Nathan Brouwer under the Creative Commons 3.0 Attribution License (CC BY 3.0).
NOTE: I’ve added some new material that is rather terse and lacks explication.
Good sources of more info: https://omicstutorials.com/interpreting-dot-plot-bioinformatics-with-an-example/
As a first step in comparing two protein, RNA or DNA sequences, it is a good idea to make a dotplot. A dotplot is a graphical method that allows the comparison of two protein or DNA sequences and identify regions of close similarity between them. A dotplot is essentially a two-dimensional matrix (like a grid), which has the sequences of the proteins being compared along the vertical and horizontal axes.
In order to make a simple dotplot to represent of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix. Identical proteins will have a line exactly on the main diagonal of the dotplot, that spans across the whole matrix.
For proteins that are not identical, but share regions of similarity, the dotplot will have shorter lines that may be on the main diagonal, or off the main diagonal of the matrix. In essence, a dotplot will reveal if there are any regions that are clearly very similar in two protein (or DNA) sequences.
library(compbio4all)
library(rentrez)
To help build our intuition about dotplots we’ll first look at some artificial examples. First, we’ll see what happens when we make a dotplot comparing the alphabet versus itself. The build-in LETTERS object in R contains the alphabet from A to Z. This is a sequence with no repeats.
#LETTERS
seqinr::dotPlot(LETTERS, LETTERS) # add code
What we get is a perfect diagonal line.
The variable created here will contain the alphabet twice. The dot plot then compares this to itself which shows parallel lines since it is being compared to itself and there is a repeated pattern.
LETTERS.2.times <- c(LETTERS, LETTERS)
seqinr::dotPlot(LETTERS.2.times,
LETTERS.2.times)
This will do the exact same thing above except with the alphabet three times which creates a bigger scale of repeat.
LETTERS.3.times <- c(LETTERS, LETTERS, LETTERS)# add code
seqinr::dotPlot(LETTERS.3.times,
LETTERS.3.times) # add code
This will take the given sequence and print it out three times in a row. The dotplot will then be made and look very similar to the above dotplot.
seq.repeat <- c("A","C","D","E","F","G","H","I")
seq1 <- rep(seq.repeat, 3)# add code
Make the dotplot:
seqinr::dotPlot(seq1,
seq1)# add code
The variable will contain the alphabet three times with the middle alphabet being printed backwards. This can be referred to as inversion. The dotplot will show the pattern that inversion creates.
“invert” means “inversion”
LETTERS.3.times.with.invert <- c(LETTERS, rev(LETTERS), LETTERS) # add code
seqinr::dotPlot(LETTERS.3.times.with.invert,
LETTERS.3.times.with.invert) # add code
Sequences that we study can often be very long so it is important to be able to compare segments of the sequences. This code breaks the alphabet into three sequences. Another thing that DNA can do is translocate. This will be shown by mixing up the segmented parts and comparing them in a dotplot.
seg1 <- LETTERS[1:8]
seg2 <- LETTERS[9:18] # add code
seg3 <- LETTERS[18:26] # add code
LETTERS.with.transloc <- c(seg1, seg3, seg2) # add code
seqinr::dotPlot(LETTERS.with.transloc,
LETTERS.with.transloc) # add code
Sequences that are compared often are random and don’t have a pattern in common.The first list is a sample of the alphabet where the same letter is available to be selected twice. The sample in the dotplot is a random sample of the alphabet where each letter can only be picked once.
sample(x = LETTERS, size = 26, replace = T)
## [1] "W" "N" "X" "U" "V" "V" "C" "N" "H" "U" "L" "W" "T" "K" "H" "H" "A" "R" "T"
## [20] "O" "R" "N" "E" "H" "J" "C"
letters.rand1 <- sample(x = LETTERS, size = 26, replace = F) # add code
letters.rand2 <- sample(x = LETTERS, size = 26, replace = F) # add code
seqinr::dotPlot(letters.rand1,
letters.rand2) # add code
Now we’ll make a real dotplot of the chorismate lyase proteins from two closely related species, Mycobacterium leprae and Mycobacterium ulcerans.
Note - these are protein sequences so db = “protein”
These steps will download sequences in the form of fasta files so that they can be compared using a dotplot.
# sequence 1: Q9CD83
leprae_fasta <- rentrez::entrez_fetch(db = "protein", # add code
id = "Q9CD83",
rettype = "fasta")
# sequence 2: OIN17619.1
ulcerans_fasta <- rentrez::entrez_fetch(db ="protein", # add code
id = "OIN17619.1",
rettype = "fasta")
# add code
leprae_vector <- compbio4all::fasta_cleaner(leprae_fasta) # add code
ulcerans_vector <- compbio4all::fasta_cleaner(ulcerans_fasta) # add code
We can create a dotplot for two sequences using the dotPlot() function in the seqinr package.
First, let’s look at a dotplot created using only a single sequence. This is frequently done to investigate a sequence for the presence of repeats.
(Note - and older version of this exercise stated this kind of anlysis wasn’t normally done; this was written last year before I knew of the use of dotplots for investigating sequence repeats.)
seqinr::dotPlot(leprae_vector,
ulcerans_vector) # add code
There is a sort of diagonal going which means that there are many spots in the sequence that are the same at the same position. This could suggest that one is descended from another or that they share a close common ancestor.
In the dotplot above, the M. leprae sequence is plotted along the x-axis (horizontal axis), and the M. ulcerans sequence is plotted along the y-axis (vertical axis). The dotplot displays a dot at points where there is an identical amino acid in the two sequences.
For example, if amino acid 53 in the M. leprae sequence is the same amino acid (eg. “W”) as amino acid 70 in the M. ulcerans sequence, then the dotplot will show a dot the position in the plot where x =50 and y =53.
In this case you can see a lot of dots along a diagonal line, which indicates that the two protein sequences contain many identical amino acids at the same (or very similar) positions along their lengths. This is what you would expect, because we know that these two proteins are homologs (related proteins) because they share a close evolutionary history.