The goal of this document is to demonstrate that, mutatis mutandis, the implementations of the Smith-Waterman algorithm in the textreuse package and the Biostrings package return the same results.
suppressPackageStartupMessages(library("Biostrings"))
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
library("textreuse")
library("stringr")
Biostrings works by defining a substitution matrix, since there is a limited vocabulary (A, C, G, T). Create the substitution matrix, with the same scores we will use for textreuse.
s <- nucleotideSubstitutionMatrix(match = 2, mismatch = -1, baseOnly = TRUE,
type = "DNA")
s
## A C G T
## A 2 -1 -1 -1
## C -1 2 -1 -1
## G -1 -1 2 -1
## T -1 -1 -1 2
Test the Biostrings implementation. Note that Biostrings package implements both a gap opening and a gap extension penalty. Set the gap extension penalty to 0 since textreuse only implements a gap opening penalty.
a_letters <- "ACCCGTAAAAAA"
b_letters <- "ACCGGTCCCCCC"
bio <- pairwiseAlignment(a_letters, b_letters, type = "local",
gapOpening = 1, gapExtension = 0, substitutionMatrix = s)
Textreuse expects words, not letters, so add a space between each letter.
letters_to_words <- function(x) {
x %>% str_split("") %>% `[[`(1) %>% str_c(collapse = " ")
}
a_words <- letters_to_words(a_letters)
b_words <- letters_to_words(b_letters)
tr <- align_local(a_words, b_words, match = 2, mismatch = -1, gap = -1)
Compare the two:
bio
## Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: [1] ACCCGT
## subject: [1] ACCGGT
## score: 9
tr
## TextReuse alignment
## Alignment score: 9
## Document A:
## A C C C # G T
##
## Document B:
## A C C # G G T
The two implementations get the same score, 9. They extract the same substring, ACCCGT and ACCGGT. The main difference is in representation. Biostrings does not mark substitutions: e.g., G for C in the fourth character above, which makes sense with its limited vocabulary. But textreuse marks all substitutions as an insertion and a deletion, to draw attention to the change. (The code for that feature.)