Finding near-duplicated strings

Trimming whitespace

teststr <- c(" abc", "def ghi", "jkl ")
gsub("(^ +| +$)", "", teststr)
## [1] "abc"     "def ghi" "jkl"
library(stringr)
str_trim(teststr)
## [1] "abc"     "def ghi" "jkl"

(Note that the strtrim() function in base R does something completely different …)

Finding nearby strings

This Stack Overflow post points out the basic method.

It's worth trimming and case-folding (using tolower() or toupper()) first.

The agrep() function will look for approximate matches; the adist() function computes between-string distances.

teststr2 <- c("abcd", "abce", "zzzx")
agrep("abcf", teststr2, value = TRUE)
## [1] "abcd" "abce"
(distmat <- adist(teststr2, teststr2))
##      [,1] [,2] [,3]
## [1,]    0    1    4
## [2,]    1    0    4
## [3,]    4    4    0
strmat <- outer(teststr2, teststr2, paste, sep = ":")
strmat[distmat < 2 & row(distmat) < col(distmat)]
## [1] "abcd:abce"