Regex Performance

Interfaces to at least four regular expression libraries are available to R users: PCRE and TRE, which are used by base R, ICU(4C), used by the stringi package, and Onigmo, via the ore package. Of these four, all but TRE have similar scope, being comparable to regular expression support in the Perl language. This notebook explores how the performance of the three remaining options measures up.

Benchmarking regular expressions is notoriously difficult, as performance will depend on the exact test strings and regex patterns used. There are typically some pathological regular expressions which will degrade performance dramatically, but these are likely to be different for different libraries. So this benchmark only aims to be indicative rather than definitive. Nevertheless, it should give some idea about relative real-world performance.

Set-up

Code performance will always depend to some extent on the system which it is running. For string functions it is also useful to know the locale that R is using. We can get information on these via R’s sessionInfo function.

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin17.2.0 (64-bit)
## Running under: macOS High Sierra 10.13.2
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.3  backports_1.0.5 magrittr_1.5    rprojroot_1.2  
##  [5] tools_3.4.3     htmltools_0.3.6 Rcpp_0.12.14    stringi_1.1.6  
##  [9] rmarkdown_1.6   knitr_1.17      stringr_1.2.0   digest_0.6.12  
## [13] evaluate_0.10.1

It is also important to know the compiler flags used to build R, since optimisations can improve speed noticeably. These flags are also used to build packages, unless they are overridden. We can obtain them from R’s main Makeconf file.

cat(grep("^[#\\s]+configure", readLines(file.path(R.home("etc"),"Makeconf")), perl=TRUE, value=TRUE))
## # configure  '--prefix=/usr/local/Cellar/r/3.4.3' '--enable-memory-profiling' '--without-cairo' '--without-x' '--with-aqua' '--with-lapack' '--enable-R-shlib' 'SED=/usr/bin/sed' '--with-blas=-framework Accelerate' '--disable-java' 'PKG_CONFIG_PATH=/usr/local/opt/libpng/lib/pkgconfig:/usr/local/opt/pcre/lib/pkgconfig:/usr/local/opt/xz/lib/pkgconfig' 'PKG_CONFIG_LIBDIR=/usr/lib/pkgconfig:/usr/local/Homebrew/Library/Homebrew/os/mac/pkgconfig/10.13' 'CC=clang' 'LDFLAGS=-L/usr/local/opt/gettext/lib -L/usr/local/opt/readline/lib' 'CPPFLAGS=-I/usr/local/opt/gettext/include -I/usr/local/opt/readline/include' 'F77=/usr/local/bin/gfortran' 'CXX=clang++' 'OBJC=clang' 'FC=/usr/local/bin/gfortran'

Finally, we want to know the versions of the packages we are testing.

installed.packages()[c("base","ore","stringi"),"Version"]
##    base     ore stringi 
## "3.4.3" "1.6.0" "1.1.6"

When running the benchmark it is wise to build ore and stringi from source if possible, to ensure that the compiler flags match those used for base R.

Test cases

The test cases used here are heavily influenced by previous regex performance comparisons here and here. The text that we are searching through is the Project Gutenberg version of The Adventures of Sherlock Holmes, available for free as a UTF-8 encoded text file. The regular expressions tested are as follows:

Regex pattern Meaning
Sherlock The literal string, “Sherlock”
^Sherlock “Sherlock” at the beginning of a line
Sherlock$ “Sherlock” at the end of a line
a[^x]{20}b The letters “a” and “b”, separated by 20 other characters that aren’t “x”
Holmes|Watson Either of the strings “Holmes” or “Watson”
.{0,3}(Holmes|Watson) Zero to three characters, followed by either of the strings “Holmes” or “Watson”
[a-zA-Z]+ing Any word ending in “ing”
^([a-zA-Z]{0,4}ing)[^a-zA-Z] Up to four letters followed by “ing” and then a non-letter, at the beginning of a line
[a-zA-Z]+ing$ Any word ending in “ing”, at the end of a line
^[a-zA-Z ]{5,}$ Lines consisting of five or more letters and spaces, only
^.{16,20}$ Lines of between 16 and 20 characters
([a-f](.[d-m].){0,2}[h-n]){2} Sequences of characters from certain sets (complex to explain!)
([A-Za-z]olmes|[A-Za-z]atson)[^a-zA-Z] A word ending in “olmes” or “atson”, followed by a non-letter
"[^"]{0,30}[?!\.]" A quoted string of between 0 and 30 characters, ending with a punctuation mark
Holmes.{10,60}Watson|Watson.{10,60}Holmes The names “Holmes” and “Watson” on the same line, separated by 10 to 60 other characters

We use the gregexpr (from base R), stri_locate_all_regex (from stringi) and ore.search (from ore) functions to find all matches, both considering the whole text as one long string, and with each line as a separate string. In the former case, the multiline option is given where needed to ensure that ^ and $ match the beginnings and ends of lines, rather than just the string as a whole. For ore we also consider the effect of precompiling the regular expression in each case. The microbenchmark package is used to run each regex on each library ten times, in a randomised order, and calculate run-times to microsecond precision.

R code

The following R code sets up the benchmark, loading in the target document and creating functions to run the searches and plot the results.

library(ore)
library(stringi)
library(microbenchmark)
library(methods)
library(ggplot2)

# Target document: The Adventures of Sherlock Holmes
text <- readLines("pg1661.txt", encoding="UTF-8")

regexes <- c("Sherlock",
             "^Sherlock",
             "Sherlock$",
             "a[^x]{20}b",
             "Holmes|Watson",
             ".{0,3}(Holmes|Watson)",
             "[a-zA-Z]+ing",
             "^([a-zA-Z]{0,4}ing)[^a-zA-Z]",
             "[a-zA-Z]+ing$",
             "^[a-zA-Z ]{5,}$",
             "^.{16,20}$",
             "([a-f](.[d-m].){0,2}[h-n]){2}",
             "([A-Za-z]olmes|[A-Za-z]atson)[^a-zA-Z]",
             "\"[^\"]{0,30}[?!\\.]\"",
             "Holmes.{10,60}Watson|Watson.{10,60}Holmes")

methods <- c("base (PCRE)", "ore", "ore (precompiled)", "stringi")

.makeResult <- function (benchmark, ...)
{
    benchmark$time <- benchmark$time / 1000
    benchmark$method <- benchmark$expr
    levels(benchmark$method) <- methods
    return (cbind(benchmark, data.frame(...)))
}

# This function runs the benchmark and collates the results. With vector=TRUE,
# the lines are passed as a vector, otherwise they are concatenated into a
# single long string.
regexTest <- function (vector = FALSE)
{
    results <- NULL
    
    if (!vector)
        text <- paste(text, collapse="\n")
    
    for (regex in regexes)
    {
        if (vector)
        {
            # No-op cases
            stringiOpts <- NULL
            prefixedRegex <- regex
        }
        else
        {
            # Set the multiline options for PCRE and ICU4C, needed to ensure
            # that ^ and $ are interpreted correctly. Oniguruma does this by
            # default, so no special option is needed.
            stringiOpts <- stri_opts_regex(multiline=TRUE)
            prefixedRegex <- paste("(?m)", regex, sep="")
        }
        
        # Compile the regex for use with ore.search
        compiledRegex <- ore(regex, encoding="UTF-8")
        
        # Run the benchmark for ten iterations each, in a randomised order
        benchmark <- microbenchmark(gregexpr(prefixedRegex, text, perl=TRUE),
                                    ore.search(regex, text, all=TRUE),
                                    ore.search(compiledRegex, text, all=TRUE),
                                    stri_locate_all_regex(text, regex, opts_regex=stringiOpts),
                                    times=10L)
        
        # Append the results to the main data frame
        results <- rbind(results, .makeResult(benchmark,regex=regex))
    }
    
    results$regex <- factor(results$regex, levels=regexes)
    return(results)
}

# This function plots the benchmark results. If log=TRUE then the times are
# shown on a logarithmic axis.
plot <- function (results, log = TRUE)
{
    # Build the plot
    plot <- ggplot(results, aes(x=method,y=time,fill=method)) +
            stat_summary(fun.y="median", geom="bar") +
            geom_point(colour=rgb(0,0,0,0.5)) +
            coord_flip() +
            xlab("") +
            ylab(expression(paste("time, ",mu,"s")))
    
    if ("regex" %in% colnames(results))
        plot <- plot + facet_wrap(~regex) + theme(strip.text=element_text(size=rel(0.6)), legend.position=c(0.9,0.12))
    else
        plot <- plot + facet_wrap(~task) + theme(legend.position=c(0.85,0.3))
        
    # Use a log scale if requested
    if (log)
        return(plot + scale_y_log10())
    else
        return(plot)
}

Results

We start with the unvectorised case, where the search text is stored in a single string. The barplot below shows median run-times, with individual times shown by dots. The time axis is necessarily logarithmic, because performance varies widely between packages.

results <- regexTest(vector=FALSE)
plot(results)

We see that base R performs extremely badly in this case, worse than stringi or ore by a large margin. On a non-logarithmic scale the run-times for the latter two packages are so small as to be invisible. ore and stringi are generally closely matched, although clearly each package has some better and some worse cases, relative to the other. Precompiling each regular expression makes little difference when searching through a large document like this.

In the vectorised case base R with PCRE does somewhat better, but it is still consistently outperformed by the other two packages, sometimes by a lot. The results with and without log axes are shown below.

results <- regexTest(vector=TRUE)
plot(results)

plot(results, log=FALSE)

Again, the relative performance of ore and stringi is mixed, although there is some evidence that the timings for ore tend to be somewhat more variable.

Additional operations

In addition to simple search, the basic regex operation where we are searching for one or more matches to the pattern, there are other standard operations provided by regex packages. These include extracting the matches themselves (although this is performed by default in ore), splitting strings at matches to the pattern, and replacing each match with new text. The following code evaluates the three packages’ performance in these tasks, using a single regular expression.

taskTest <- function (regex, vector = FALSE)
{
    results <- NULL
    
    if (!vector)
        text <- paste(text, collapse="\n")
    
    if (vector)
    {
        # No-op cases
        stringiOpts <- NULL
        prefixedRegex <- regex
    }
    else
    {
        # Set the multiline options for PCRE and ICU4C, needed to ensure
        # that ^ and $ are interpreted correctly. Oniguruma does this by
        # default, so no special option is needed.
        stringiOpts <- stri_opts_regex(multiline=TRUE)
        prefixedRegex <- paste("(?m)", regex, sep="")
    }
    
    # Compile the regex for use with ore.search
    compiledRegex <- ore(regex, encoding="UTF-8")
    
    # Search only
    benchmark <- microbenchmark(gregexpr(prefixedRegex, text, perl=TRUE),
                                ore.search(regex, text, all=TRUE),
                                ore.search(compiledRegex, text, all=TRUE),
                                stri_locate_all_regex(text, regex, opts_regex=stringiOpts),
                                times=10L)
    results <- rbind(results, .makeResult(benchmark,task="search only"))
    
    # Extracting matches
    benchmark <- microbenchmark(regmatches(text, gregexpr(prefixedRegex, text, perl=TRUE)),
                                ore.search(regex, text, all=TRUE),
                                ore.search(compiledRegex, text, all=TRUE),
                                stri_match_all_regex(text, regex, opts_regex=stringiOpts),
                                times=10L)
    results <- rbind(results, .makeResult(benchmark,task="extracting matches"))
    
    # Splitting by regex
    benchmark <- microbenchmark(strsplit(text, prefixedRegex, perl=TRUE),
                                ore.split(regex, text),
                                ore.split(compiledRegex, text),
                                stri_split_regex(text, regex, opts_regex=stringiOpts),
                                times=10L)
    results <- rbind(results, .makeResult(benchmark,task="splitting by regex"))
    
    # Simple substitution
    benchmark <- microbenchmark(gsub(prefixedRegex, "*...ing*", text, perl=TRUE),
                                ore.subst(regex, "*...ing*", text, all=TRUE),
                                ore.subst(compiledRegex, "*...ing*", text, all=TRUE),
                                stri_replace_all_regex(text, regex, "*...ing*", opts_regex=stringiOpts),
                                times=10L)
    results <- rbind(results, .makeResult(benchmark,task="simple substitution"))
    
    # Back-referenced substitution
    benchmark <- microbenchmark(gsub(prefixedRegex, "*\\1*", text, perl=TRUE),
                                ore.subst(regex, "*\\1*", text, all=TRUE),
                                ore.subst(compiledRegex, "*\\1*", text, all=TRUE),
                                stri_replace_all_regex(text, regex, "*$1*", opts_regex=stringiOpts),
                                times=10L)
    results <- rbind(results, .makeResult(benchmark,task="back-referenced substitution"))
    
    results$task <- factor(results$task)
    return(results)
}

Again, we consider the unvectorised case first.

results <- taskTest("^([a-zA-Z]{0,4}ing)[^a-zA-Z]", vector=FALSE)
plot(results)

Once again, base R is heavily outperformed by stringi and ore. Moreover, unlike the other two packages, it cannot currently capture the parenthesised group, ([a-zA-Z]{0,4}ing); only the match as a whole.

Let’s finally consider the vectorised case, with and without a log time scale.

results <- taskTest("^([a-zA-Z]{0,4}ing)[^a-zA-Z]", vector=TRUE)
plot(results)

plot(results, log=FALSE) + ylim(0, 3e4)
## Warning: Removed 20 rows containing non-finite values (stat_summary).
## Warning: Removed 20 rows containing missing values (geom_point).

For splitting by a regex, the three packages finally perform very similarly in this vectorised case, and for substitution stringi is currently the slowest of the three packages in this test, with base R and ore performing similarly.

Conclusions

As stated at the beginning of this notebook, definitive benchmarking of regular expression engines and their interfaces is very challenging. Nevertheless, on the basis of these results, it would seem that CRAN packages ore and stringi substantially outperform base R with PCRE when working with long strings, or when the task at hand is primarily search. When performing substitutions on vectors of strings, ore may have an edge over stringi, but otherwise the two packages’ performance is similar. The best choice of package is therefore likely to depend largely on taste.