This document is the result of an assignment from IS 607 of CUNY’s Masters of Data Analytics program. The intent of this document is to illustrate the performance differences between similar functions available to R developers. The subject of my assignment is comparing the performance of the base R gsub() function vs. the stri_replace_all_fixed() function from the stringi Package for string replacement operations.

In this performance test I’m leveraging the iris data set which is available in the base R installation. The first test below illustrates the performance differences between gsub() and stri_repleace_all_fixed() in the scenario of removing a single character from within a string (i.e., reducing the size of a string):

# Load require libraries
library(microbenchmark)
library(stringi)

# Load iris data set
data(iris)

# Run the perf test
microbenchmark(gsub("s", "", iris$Species), stri_replace_all_fixed(iris$Species, "s", ""))
## Unit: microseconds
##                                           expr     min      lq      mean
##                    gsub("s", "", iris$Species) 169.177 171.194 175.52155
##  stri_replace_all_fixed(iris$Species, "s", "")  66.938  68.785  76.42547
##   median      uq     max neval
##  172.276 174.745 220.477   100
##   70.172  71.509 499.089   100

As we can see from the performance above, stri_replace_all_fixed() is quite a bit faster than gsub(). Let’s try another scenario where we are replacing a single character with a string (i.e., expanding the size of a string):

# Run the perf test
microbenchmark(gsub("s", "Hello, World!", iris$Species), stri_replace_all_fixed(iris$Species, "s", "Hello, World!"))
## Unit: microseconds
##                                                        expr     min
##                    gsub("s", "Hello, World!", iris$Species) 178.231
##  stri_replace_all_fixed(iris$Species, "s", "Hello, World!")  71.632
##        lq      mean   median       uq     max neval
##  179.6095 185.00180 180.3725 186.3245 238.854   100
##   73.0465  76.21637  74.4685  75.5670 149.541   100

Again, we see that stri_replace_all_fixed() is quite a bit faster than gsub(). While not definitive, the above tests are indicative that it behooves the R programmer to look beyond the gsub() funciton for more high-performing alternatives - especially in scenarios where the programmer is working with large amounts of string data.