The goal of this document is to demonstrate that, mutatis mutandis, the implementations of the Smith-Waterman algorithm in the textreuse package and the Biostrings package return the same results.

suppressPackageStartupMessages(library("Biostrings"))
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
library("textreuse")
library("stringr")

Biostrings works by defining a substitution matrix, since there is a limited vocabulary (A, C, G, T). Create the substitution matrix, with the same scores we will use for textreuse.

s <- nucleotideSubstitutionMatrix(match = 2, mismatch = -1, baseOnly = TRUE,
                                  type = "DNA")
s
##    A  C  G  T
## A  2 -1 -1 -1
## C -1  2 -1 -1
## G -1 -1  2 -1
## T -1 -1 -1  2

Test the Biostrings implementation. Note that Biostrings package implements both a gap opening and a gap extension penalty. Set the gap extension penalty to 0 since textreuse only implements a gap opening penalty.

a_letters <- "ACCCGTAAAAAA"
b_letters <- "ACCGGTCCCCCC"
bio <- pairwiseAlignment(a_letters, b_letters, type = "local",
                         gapOpening = 1, gapExtension = 0, substitutionMatrix = s)

Textreuse expects words, not letters, so add a space between each letter.

letters_to_words <- function(x) {
 x %>% str_split("") %>% `[[`(1) %>% str_c(collapse = " ")
}
a_words <- letters_to_words(a_letters)
b_words <- letters_to_words(b_letters)
tr <- align_local(a_words, b_words, match = 2, mismatch = -1, gap = -1)

Compare the two:

bio
## Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: [1] ACCCGT 
## subject: [1] ACCGGT 
## score: 9
tr
## TextReuse alignment
## Alignment score: 9 
## Document A:
## A C C C # G T
## 
## Document B:
## A C C # G G T

The two implementations get the same score, 9. They extract the same substring, ACCCGT and ACCGGT. The main difference is in representation. Biostrings does not mark substitutions: e.g., G for C in the fourth character above, which makes sense with its limited vocabulary. But textreuse marks all substitutions as an insertion and a deletion, to draw attention to the change. (The code for that feature.)