By: Avril Coghlan.
Adapted, edited and expanded: Nathan Brouwer under the Creative Commons 3.0 Attribution License (CC BY 3.0).
NOTE: I’ve added some new material that is rather terse and lacks explication.
Good sources of more info: https://omicstutorials.com/interpreting-dot-plot-bioinformatics-with-an-example/
As a first step in comparing two protein, RNA or DNA sequences, it is a good idea to make a dotplot. A dotplot is a graphical method that allows the comparison of two protein or DNA sequences and identify regions of close similarity between them. A dotplot is essentially a two-dimensional matrix (like a grid), which has the sequences of the proteins being compared along the vertical and horizontal axes.
In order to make a simple dotplot to represent of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix. Identical proteins will have a line exactly on the main diagonal of the dotplot, that spans across the whole matrix.
For proteins that are not identical, but share regions of similarity, the dotplot will have shorter lines that may be on the main diagonal, or off the main diagonal of the matrix. In essence, a dotplot will reveal if there are any regions that are clearly very similar in two protein (or DNA) sequences.
library(compbio4all)
library(rentrez)
To help build our intuition about dotplots we’ll first look at some artificial examples. First, we’ll see what happens when we make a dotplot comparing the alphabet versus itself. The build-in LETTERS object in R contains the alphabet from A to Z. This is a sequence with no repeats.
#LETTERS
seqinr::dotPlot(LETTERS, LETTERS ) # add code
What we get is a perfect diagonal line.
Creates a 52x52 grid with 3 diagonal lines, which is basically a 2x2 grid of the original, 1 diagonal line graph that was 26x26
LETTERS.2.times <- c(LETTERS, LETTERS)# add code
seqinr::dotPlot(LETTERS.2.times,
LETTERS.2.times)
Same as the above example except it becomes a 78x78 grid with 5 diagonal lines which is a 3x3 of the original 1 diagonal graph
LETTERS.3.times <- c(LETTERS, LETTERS, LETTERS)# add code
seqinr::dotPlot(LETTERS.3.times, LETTERS.3.times) # add code
Repeat this segment 3 times - creates a graph with 5 diagonal lines, similar to the last because the pattern of letters is repeated 3 times
seq.repeat <- c("A","C","D","E","F","G","H","I")
# add code
seq1<- rep(seq.repeat, 3)
Make the dotplot:
# add code
seqinr::dotPlot(seq1, seq1)
rev() reverses the middle segment of letters, creating an inversion in the amino acid sequence.
“invert” means “inversion”
LETTERS.3.times.with.invert <- c(LETTERS, rev(LETTERS), LETTERS) # add code
seqinr::dotPlot(LETTERS.3.times.with.invert, LETTERS.3.times.with.invert)# add code
3 domains
seg1 <- LETTERS[1:8]
seg2 <- LETTERS[9:18] # add code
seg3 <- LETTERS[18:26] # add code
wt <- c(seg1, seg2, seg3)
LETTERS.with.transloc <- c(seg1, seg3, seg2) # add code
seqinr::dotPlot(LETTERS.with.transloc, LETTERS.with.transloc) # add code
Generates variation to account for things like evolution - picking 2 random sequences out of the database (evolutionarily unrelated)
sample(x = LETTERS, size = 26, replace = T)
## [1] "O" "X" "C" "C" "A" "X" "B" "C" "V" "I" "B" "A" "W" "D" "U" "B" "N" "R" "M"
## [20] "U" "H" "T" "U" "R" "O" "X"
letters.rand1 <- sample(x = LETTERS, size = 26, replace = F)# add code
letters.rand2 <- sample(x = LETTERS, size = 26, replace = F)# add code
seqinr::dotPlot(letters.rand1, letters.rand2
)# add code
Now we’ll make a real dotplot of the chorismate lyase proteins from two closely related species, Mycobacterium leprae and Mycobacterium ulcerans.
Note - these are protein sequences so db = “protein”
Download and clean FASTA information from NCBI Database Create dotplot comparing 2 sequences
# sequence 1: Q9CD83
leprae_fasta <- rentrez::entrez_fetch(db = "protein",# add code
id = "Q9CD83",
rettype = "fasta")
# sequence 2: OIN17619.1
ulcerans_fasta <- rentrez::entrez_fetch(db ="protein", # add code
id = "OIN17619.1",
rettype = "fasta")
# add code
leprae_vector <- compbio4all::fasta_cleaner(leprae_fasta) # add code
ulcerans_vector <- compbio4all::fasta_cleaner(ulcerans_fasta) # add code