The stringr package is part of the Tidyverse group of libraries. Like all tidyverse packages, it is constantly maintained and the data and analysis that comes from this package can be trusted.
You can find a cheetsheat for the stringr package here or on the tidyverse website. The R for Data Science textbook has a very good chapter on stringr. If you are going to do any substantial work in text analysis, we encourage you to read the chapter.
The stringr package is built on the back of the stringi package. Both packages have similar conventions so familiarity with one gives you the ability to learn the other quickly. There are four main families of functions in stringr:
We load the stringr package (also available with the tidyverse package) and begin with an example. You will notice that most commands in this package begin with str_.
library(stringr)
This example is from the tidyverse website.
x <- c("why", "video", "cross", "extra", "deal", "authority")
To determine the length of a string, use str_length().
str_length(x)
## [1] 3 5 5 5 4 9
To find a sub-string, use the str_sub() command.
str_sub(x, 1, 3)
## [1] "why" "vid" "cro" "ext" "dea" "aut"
You will note that this command will also modify the string.
str_sub(x, 2, 3) <- "XY"
x
## [1] "wXY" "vXYeo" "cXYss" "eXYra" "dXYl" "aXYhority"
Since this command modifies x, we will redefine it here.
x <- c("why", "video", "cross", "extra", "deal", "authority")
We can merge or divide strings using the str_c() command.
str_c(x, collapse = ", ")
## [1] "why, video, cross, extra, deal, authority"
We can add, remove or change whitespace in this library. By default, the str_pad() command pads on the left.
str_pad(x, 5)
## [1] " why" "video" "cross" "extra" " deal" "authority"
The str_pad() command also pads on the right or on both the left and right (both) by changing the third input value.
str_pad(x, 5, "right")
## [1] "why " "video" "cross" "extra" "deal " "authority"
To ensure that all of your strings are of the same length, combine the str_pad() and str_trunc() commands.
x %>% str_trunc(7) %>%
str_pad(7, "right")
## [1] "why " "video " "cross " "extra " "deal " "auth..."
The str_trim() command is the opposite of str_pad(). It removes whitespace from the end of your string. By default, it removes everything, but you can specify only to remove left or right whitespace strings.
y <- c(" a", "b ", " c ")
str_trim(y)
## [1] "a" "b" "c"
You can use str_wrap() to modify existing whitespace in order to wrap a paragraph of text, such that the length of each line is as similar as possible.
jabberwocky <- str_c(
"`Twas brillig, and the slithy toves ",
"did gyre and gimble in the wabe: ",
"All mimsy were the borogoves, ",
"and the mome raths outgrabe. "
)
cat(str_wrap(jabberwocky, width = 40))
## `Twas brillig, and the slithy toves did
## gyre and gimble in the wabe: All mimsy
## were the borogoves, and the mome raths
## outgrabe.
A handful of stringr functions are locale-sensitive. They will perform differently in different regions of the world. These functions are case transformation functions. (Note that these commands can be changed for regional language differences - the default is English but we can change based on an ISO-3166 country code).
z <- "Dsci 300 is the best."
str_to_upper(z)
## [1] "DSCI 300 IS THE BEST."
str_to_title(z)
## [1] "Dsci 300 Is The Best."
We can reorder a string. The str_order() command tells us the proper order for the string. The str_sort() command sorts the string.
str_order(x)
## [1] 6 3 5 4 2 1
str_sort(x)
## [1] "authority" "cross" "deal" "extra" "video" "why"
There are seven main verbs that work with patterns:
strings <- c(
"Not A Phone Number",
"236 233 4567",
"468-123-5523",
"Work: 250-762-5445; Home: 321.123.4567"
)
phone <- "([2-9][0-9][0-9])[- .]([0-9][0-9][0-9])[- .]([0-9][0-9][0-9][0-9])"
# Note that the () are unnecessary above. However, we will take advantage of them in an example below.
str_detect(strings, phone)
## [1] FALSE TRUE TRUE TRUE
str_count(strings, phone)
## [1] 0 1 1 2
str_locate(strings, phone)
## start end
## [1,] NA NA
## [2,] 1 12
## [3,] 1 12
## [4,] 7 18
Where multiple patterns exist, try str_locate_all().
str_locate_all(strings, phone)
## [[1]]
## start end
##
## [[2]]
## start end
## [1,] 1 12
##
## [[3]]
## start end
## [1,] 1 12
##
## [[4]]
## start end
## [1,] 7 18
## [2,] 27 38
str_extract(strings, phone)
## [1] NA "236 233 4567" "468-123-5523" "250-762-5445"
str_match(strings, phone)
## [,1] [,2] [,3] [,4]
## [1,] NA NA NA NA
## [2,] "236 233 4567" "236" "233" "4567"
## [3,] "468-123-5523" "468" "123" "5523"
## [4,] "250-762-5445" "250" "762" "5445"
str_replace(strings, phone, "XXX-XXX-XXXX")
## [1] "Not A Phone Number"
## [2] "XXX-XXX-XXXX"
## [3] "XXX-XXX-XXXX"
## [4] "Work: XXX-XXX-XXXX; Home: 321.123.4567"
str_split(c("a,b", "c,d,e"), ",")
## [[1]]
## [1] "a" "b"
##
## [[2]]
## [1] "c" "d" "e"
str_split("250-762-5445", "-", n=2)
## [[1]]
## [1] "250" "762-5445"
str_subset(strings, phone)
## [1] "236 233 4567"
## [2] "468-123-5523"
## [3] "Work: 250-762-5445; Home: 321.123.4567"
Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.
“Introduction to Stringr.” Accessed May 3, 2021. Available Here.
“R for Data Science.” Accessed May 3, 2021. Available Here.
“Regular Expressions.” Accessed May 3, 2021. Available Here.
Sanchez, Gaston. Handling Strings With R. Accessed May 3, 2021. https://leanpub.com/r4strings.