The Idea

The stringr package is part of the Tidyverse group of libraries. Like all tidyverse packages, it is constantly maintained and the data and analysis that comes from this package can be trusted.

You can find a cheetsheat for the stringr package here or on the tidyverse website. The R for Data Science textbook has a very good chapter on stringr. If you are going to do any substantial work in text analysis, we encourage you to read the chapter.

stringr

The stringr package is built on the back of the stringi package. Both packages have similar conventions so familiarity with one gives you the ability to learn the other quickly. There are four main families of functions in stringr:

We load the stringr package (also available with the tidyverse package) and begin with an example. You will notice that most commands in this package begin with str_.

library(stringr)

Characters

This example is from the tidyverse website.

x <- c("why", "video", "cross", "extra", "deal", "authority")

To determine the length of a string, use str_length().

str_length(x) 
## [1] 3 5 5 5 4 9

To find a sub-string, use the str_sub() command.

str_sub(x, 1, 3)
## [1] "why" "vid" "cro" "ext" "dea" "aut"

You will note that this command will also modify the string.

str_sub(x, 2, 3) <- "XY"
x
## [1] "wXY"       "vXYeo"     "cXYss"     "eXYra"     "dXYl"      "aXYhority"

Since this command modifies x, we will redefine it here.

x <- c("why", "video", "cross", "extra", "deal", "authority")

We can merge or divide strings using the str_c() command.

str_c(x, collapse = ", ")
## [1] "why, video, cross, extra, deal, authority"

Whitespace

We can add, remove or change whitespace in this library. By default, the str_pad() command pads on the left.

str_pad(x, 5)
## [1] "  why"     "video"     "cross"     "extra"     " deal"     "authority"

The str_pad() command also pads on the right or on both the left and right (both) by changing the third input value.

str_pad(x, 5, "right")
## [1] "why  "     "video"     "cross"     "extra"     "deal "     "authority"

To ensure that all of your strings are of the same length, combine the str_pad() and str_trunc() commands.

x %>% str_trunc(7) %>%
        str_pad(7, "right")
## [1] "why    " "video  " "cross  " "extra  " "deal   " "auth..."

The str_trim() command is the opposite of str_pad(). It removes whitespace from the end of your string. By default, it removes everything, but you can specify only to remove left or right whitespace strings.

y <- c("    a", "b   ", "    c    ")
str_trim(y)
## [1] "a" "b" "c"

You can use str_wrap() to modify existing whitespace in order to wrap a paragraph of text, such that the length of each line is as similar as possible.

jabberwocky <- str_c(
  "`Twas brillig, and the slithy toves ",
  "did gyre and gimble in the wabe: ",
  "All mimsy were the borogoves, ",
  "and the mome raths outgrabe. "
)
cat(str_wrap(jabberwocky, width = 40))
## `Twas brillig, and the slithy toves did
## gyre and gimble in the wabe: All mimsy
## were the borogoves, and the mome raths
## outgrabe.

Locale Sensitive Operations

A handful of stringr functions are locale-sensitive. They will perform differently in different regions of the world. These functions are case transformation functions. (Note that these commands can be changed for regional language differences - the default is English but we can change based on an ISO-3166 country code).

z <- "Dsci 300 is the best."
str_to_upper(z)
## [1] "DSCI 300 IS THE BEST."
str_to_title(z)
## [1] "Dsci 300 Is The Best."

We can reorder a string. The str_order() command tells us the proper order for the string. The str_sort() command sorts the string.

str_order(x)
## [1] 6 3 5 4 2 1
str_sort(x)
## [1] "authority" "cross"     "deal"      "extra"     "video"     "why"

Pattern Matching

There are seven main verbs that work with patterns:

  • str_detect(x, pattern) tells you if there’s any match to the pattern.
strings <- c(
  "Not A Phone Number", 
  "236 233 4567", 
  "468-123-5523", 
  "Work: 250-762-5445; Home: 321.123.4567"
)
phone <- "([2-9][0-9][0-9])[- .]([0-9][0-9][0-9])[- .]([0-9][0-9][0-9][0-9])"
#  Note that the () are unnecessary above.  However, we will take advantage of them in an example below.

str_detect(strings, phone)
## [1] FALSE  TRUE  TRUE  TRUE
  • str_count(x, pattern) counts the number of patterns.
str_count(strings, phone)
## [1] 0 1 1 2
  • str_locate(x, pattern) gives the position of the FIRST match (both start and end values).
str_locate(strings, phone)
##      start end
## [1,]    NA  NA
## [2,]     1  12
## [3,]     1  12
## [4,]     7  18

Where multiple patterns exist, try str_locate_all().

str_locate_all(strings, phone)
## [[1]]
##      start end
## 
## [[2]]
##      start end
## [1,]     1  12
## 
## [[3]]
##      start end
## [1,]     1  12
## 
## [[4]]
##      start end
## [1,]     7  18
## [2,]    27  38
  • str_extract(x, pattern) extracts the text of the FIRST match. We can use str_extract_all() to extract all matches.
str_extract(strings, phone)
## [1] NA             "236 233 4567" "468-123-5523" "250-762-5445"
  • str_match(x, pattern) extracts parts of the match defined by parentheses. The str_match_all() extracts all matches.
str_match(strings, phone)
##      [,1]           [,2]  [,3]  [,4]  
## [1,] NA             NA    NA    NA    
## [2,] "236 233 4567" "236" "233" "4567"
## [3,] "468-123-5523" "468" "123" "5523"
## [4,] "250-762-5445" "250" "762" "5445"
  • str_replace(x, pattern, replacement) replaces the FIRST match with new text. str_replace_all() replaces all matches.
str_replace(strings, phone, "XXX-XXX-XXXX")
## [1] "Not A Phone Number"                    
## [2] "XXX-XXX-XXXX"                          
## [3] "XXX-XXX-XXXX"                          
## [4] "Work: XXX-XXX-XXXX; Home: 321.123.4567"
  • str_split(x, pattern) splits up a string into multiple pieces. Using the n= command, we can specify the number of pieces we split the string into.
str_split(c("a,b", "c,d,e"), ",")
## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "c" "d" "e"
str_split("250-762-5445", "-", n=2)
## [[1]]
## [1] "250"      "762-5445"
  • str_subset(x, pattern) extracts the matching components.
str_subset(strings, phone)
## [1] "236 233 4567"                          
## [2] "468-123-5523"                          
## [3] "Work: 250-762-5445; Home: 321.123.4567"

Citations

Camm, Jeffrey D. Business Analytics. Third edition, Cengage, 2019.

“Introduction to Stringr.” Accessed May 3, 2021. Available Here.

“R for Data Science.” Accessed May 3, 2021. Available Here.

“Regular Expressions.” Accessed May 3, 2021. Available Here.

Sanchez, Gaston. Handling Strings With R. Accessed May 3, 2021. https://leanpub.com/r4strings.