stringr

The stringr package is built on the back of the stringi package. Both packages have similar conventions so familiarity with one gives you the ability to learn the other quickly. There are four main families of functions in stringr:

Character manipulation. Functions that allow you to manipulate individual characters within strings of character vectors.
Whitespace tools. These add, remove, and manipulate whitespace.
Locale sensitive operations. Operations that vary from locale to locale.
Pattern matching functions. These recognize four engines of pattern description.

We load the stringr package (also available with the tidyverse package) and begin with an example. You will notice that most commands in this package begin with str_.

library(stringr)

Characters

This example is from the tidyverse website.

x <- c("why", "video", "cross", "extra", "deal", "authority")

To determine the length of a string, use str_length().

str_length(x)

## [1] 3 5 5 5 4 9

To find a sub-string, use the str_sub() command.

str_sub(x, 1, 3)

## [1] "why" "vid" "cro" "ext" "dea" "aut"

You will note that this command will also modify the string.

str_sub(x, 2, 3) <- "XY"
x

## [1] "wXY"       "vXYeo"     "cXYss"     "eXYra"     "dXYl"      "aXYhority"

Since this command modifies x, we will redefine it here.

x <- c("why", "video", "cross", "extra", "deal", "authority")

We can merge or divide strings using the str_c() command.

str_c(x, collapse = ", ")

## [1] "why, video, cross, extra, deal, authority"

Whitespace

We can add, remove or change whitespace in this library. By default, the str_pad() command pads on the left.

str_pad(x, 5)

## [1] "  why"     "video"     "cross"     "extra"     " deal"     "authority"

The str_pad() command also pads on the right or on both the left and right (both) by changing the third input value.

str_pad(x, 5, "right")

## [1] "why  "     "video"     "cross"     "extra"     "deal "     "authority"

To ensure that all of your strings are of the same length, combine the str_pad() and str_trunc() commands.

x %>% str_trunc(7) %>%
        str_pad(7, "right")

## [1] "why    " "video  " "cross  " "extra  " "deal   " "auth..."

The str_trim() command is the opposite of str_pad(). It removes whitespace from the end of your string. By default, it removes everything, but you can specify only to remove left or right whitespace strings.

y <- c("    a", "b   ", "    c    ")
str_trim(y)

## [1] "a" "b" "c"

You can use str_wrap() to modify existing whitespace in order to wrap a paragraph of text, such that the length of each line is as similar as possible.

jabberwocky <- str_c(
  "`Twas brillig, and the slithy toves ",
  "did gyre and gimble in the wabe: ",
  "All mimsy were the borogoves, ",
  "and the mome raths outgrabe. "
)
cat(str_wrap(jabberwocky, width = 40))

## `Twas brillig, and the slithy toves did
## gyre and gimble in the wabe: All mimsy
## were the borogoves, and the mome raths
## outgrabe.

Locale Sensitive Operations

A handful of stringr functions are locale-sensitive. They will perform differently in different regions of the world. These functions are case transformation functions. (Note that these commands can be changed for regional language differences - the default is English but we can change based on an ISO-3166 country code).

z <- "Dsci 300 is the best."
str_to_upper(z)

## [1] "DSCI 300 IS THE BEST."

str_to_title(z)

## [1] "Dsci 300 Is The Best."

We can reorder a string. The str_order() command tells us the proper order for the string. The str_sort() command sorts the string.

str_order(x)

## [1] 6 3 5 4 2 1

str_sort(x)

## [1] "authority" "cross"     "deal"      "extra"     "video"     "why"

Pattern Matching

There are seven main verbs that work with patterns:

str_detect(x, pattern) tells you if there’s any match to the pattern.

strings <- c(
  "Not A Phone Number", 
  "236 233 4567", 
  "468-123-5523", 
  "Work: 250-762-5445; Home: 321.123.4567"
)
phone <- "([2-9][0-9][0-9])[- .]([0-9][0-9][0-9])[- .]([0-9][0-9][0-9][0-9])"
#  Note that the () are unnecessary above.  However, we will take advantage of them in an example below.

str_detect(strings, phone)

## [1] FALSE  TRUE  TRUE  TRUE

str_count(x, pattern) counts the number of patterns.

str_count(strings, phone)

## [1] 0 1 1 2

str_locate(x, pattern) gives the position of the FIRST match (both start and end values).

str_locate(strings, phone)

##      start end
## [1,]    NA  NA
## [2,]     1  12
## [3,]     1  12
## [4,]     7  18

Where multiple patterns exist, try str_locate_all().

str_locate_all(strings, phone)

## [[1]]
##      start end
## 
## [[2]]
##      start end
## [1,]     1  12
## 
## [[3]]
##      start end
## [1,]     1  12
## 
## [[4]]
##      start end
## [1,]     7  18
## [2,]    27  38

str_extract(x, pattern) extracts the text of the FIRST match. We can use str_extract_all() to extract all matches.

str_extract(strings, phone)

## [1] NA             "236 233 4567" "468-123-5523" "250-762-5445"

str_match(x, pattern) extracts parts of the match defined by parentheses. The str_match_all() extracts all matches.

str_match(strings, phone)

##      [,1]           [,2]  [,3]  [,4]  
## [1,] NA             NA    NA    NA    
## [2,] "236 233 4567" "236" "233" "4567"
## [3,] "468-123-5523" "468" "123" "5523"
## [4,] "250-762-5445" "250" "762" "5445"

str_replace(x, pattern, replacement) replaces the FIRST match with new text. str_replace_all() replaces all matches.

str_replace(strings, phone, "XXX-XXX-XXXX")

## [1] "Not A Phone Number"                    
## [2] "XXX-XXX-XXXX"                          
## [3] "XXX-XXX-XXXX"                          
## [4] "Work: XXX-XXX-XXXX; Home: 321.123.4567"

str_split(x, pattern) splits up a string into multiple pieces. Using the n= command, we can specify the number of pieces we split the string into.

str_split(c("a,b", "c,d,e"), ",")

## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "c" "d" "e"

str_split("250-762-5445", "-", n=2)

## [[1]]
## [1] "250"      "762-5445"

str_subset(x, pattern) extracts the matching components.

str_subset(strings, phone)

## [1] "236 233 4567"                          
## [2] "468-123-5523"                          
## [3] "Work: 250-762-5445; Home: 321.123.4567"

Character Strings with stringr

OC Data Science

The Idea

stringr

Characters

Whitespace

Locale Sensitive Operations

Pattern Matching

Citations