This document is the result of an assignment from IS 607 of CUNY’s Masters of Data Analytics program. The intent of this document is to illustrate the performance differences between similar functions available to R developers. The subject of my assignment is comparing the performance of the base R gsub()
function vs. the stri_replace_all_fixed()
function from the stringi Package for string replacement operations.
In this performance test I’m leveraging the iris
data set which is available in the base R installation. The first test below illustrates the performance differences between gsub()
and stri_repleace_all_fixed()
in the scenario of removing a single character from within a string (i.e., reducing the size of a string):
# Load require libraries
library(microbenchmark)
library(stringi)
# Load iris data set
data(iris)
# Run the perf test
microbenchmark(gsub("s", "", iris$Species), stri_replace_all_fixed(iris$Species, "s", ""))
## Unit: microseconds
## expr min lq mean
## gsub("s", "", iris$Species) 169.177 171.194 175.52155
## stri_replace_all_fixed(iris$Species, "s", "") 66.938 68.785 76.42547
## median uq max neval
## 172.276 174.745 220.477 100
## 70.172 71.509 499.089 100
As we can see from the performance above, stri_replace_all_fixed()
is quite a bit faster than gsub()
. Let’s try another scenario where we are replacing a single character with a string (i.e., expanding the size of a string):
# Run the perf test
microbenchmark(gsub("s", "Hello, World!", iris$Species), stri_replace_all_fixed(iris$Species, "s", "Hello, World!"))
## Unit: microseconds
## expr min
## gsub("s", "Hello, World!", iris$Species) 178.231
## stri_replace_all_fixed(iris$Species, "s", "Hello, World!") 71.632
## lq mean median uq max neval
## 179.6095 185.00180 180.3725 186.3245 238.854 100
## 73.0465 76.21637 74.4685 75.5670 149.541 100
Again, we see that stri_replace_all_fixed()
is quite a bit faster than gsub()
. While not definitive, the above tests are indicative that it behooves the R programmer to look beyond the gsub()
funciton for more high-performing alternatives - especially in scenarios where the programmer is working with large amounts of string data.