Working with strings

stringr is a powerful package to work with strings. First, two important points to be considered. This packages uses REGEX to identify and match patterns in a string. You could refer to REGEX handout to read more about them. Of particular significance is that stringr uses strings to represent REGEXs. So, whenever we want to embed some especial operators like n or ., we should use double backslash to escape them. In normal string, just one is needed!

Another important point here is that the printed representation of a string in R is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines().

Here I introduce some useful functions.

Str lenght

str_lenght(): Technically this returns the number of “code points”, in a string. One code point usually corresponds to one character, but not always.

Combine strs

str_c(): Joins two or more vectors element-wise into a single character vector, optionally inserting sep between input vectors. If collapse is not NULL, it will be inserted between elements of the result, returning a character vector of length 1.

sep is working with several individual strings. collapse is used with a vector of strings and turn them to a single vector.

Substitution

str_sub() : Extract and replace substrings from a character vector.

str_c(x, 2,5)

str_sub() won’t fail if the string is too short: it will just return as much as possible.
str_sub() could be merged with other functions as str_to_lower() to modify certain parts of a string.

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))

The functions of str_to_lower(), str_to_upper(),, and str_to_title() are clear! Also, str_sort() and str_order(). - These functions also get a locale option!

Remove withespaces

str_trim(): Removes whitespace from start and end of string; str_squish() also reduces repeated whitespace inside a string.

The above functions do not accept regexs; theywork on strings. The functions that I introduce below use regex to work with strings.

Detect matches, their frequencies, and locations

str_detect(): Detect the presence or absence of a pattern in a string.

Important point: str_detect() returns a logical vector determining if a character vector matches a pattern.

str_count(): Count the number of matches in a string. It returns the number of matches per string.

str_subset(): Choose the strings which match a pattern. Returns the actual matches, the whole string which matches a pattern.

str_which(): Find the position of matches in a string vector.

str_locate(): Locate the position of patterns in a string. Give the starting and ending positions of each match.

Here is a example to show how these functions work:

# patterns matche if the word has a vowel 
head(str_detect(words, "[aieou]"))

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

head(str_count(words, "[aieou]"))

## [1] 1 2 3 4 2 3

head(str_subset(words, "[aieou]"))

## [1] "a"        "able"     "about"    "absolute" "accept"   "account"

head(str_which(words, "[aieou]"))

## [1] 1 2 3 4 5 6

head(str_locate(words, "[aieou]"))

##      start end
## [1,]     1   1
## [2,]     1   1
## [3,]     1   1
## [4,]     1   1
## [5,]     1   1
## [6,]     1   1

Since str_detect() and str_count() return numeric values, we can use statistical functions like sum and mean with them (why and how str_detect() returns numerics?). For instance:

# How many common words start with t?
sum(str_detect(words, "^t"))

## [1] 65

# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))

## [1] 1.991837

Note the difference. detect counts the number of words match the pattern, count could return the sum/mean of matches per word.

Extract individual and group matches

str_extract(): Extract matching patterns from a string. It returns the exact matched part. To extract the whole string containing the match, you should use str_subset().

str_extract_all() returns a list. If you use simplify = TRUE, str_extract_all() will return a matrix with short matches expanded to the same length as the longest.

str_match(): Extract matched groups from a string. We can use () to group matches. If we use str_extract() to extract the matches it gives us a string per grouped match. We should use str_match() to extract each individual component. Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group.

hint: If your data is in a tibble, it’s often easier to use tidyr::extract(). It works like str_match() but requires you to name the matches, which are then placed in new columns. Example:

tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE
  )
#> # A tibble: 720 x 3
#>   sentence                                    article noun   
#>   <chr>                                       <chr>   <chr>  
#> 1 The birch canoe slid on the smooth planks.  the     smooth 
#> 2 Glue the sheet to the dark blue background. the     sheet  
#> 3 It's easy to tell the depth of a well.      the     depth  
#> 4 These days a chicken leg is a rare dish.    a       chicken
#> 5 Rice is often served in round bowls.        <NA>    <NA>   
#> 6 The juice of lemons makes fine punch.       <NA>    <NA>   
#> # … with 714 more rows

    `str_extract_all()` and `str_match_all()` return lists. Thus, the importance of working with lists.

Replace patterns

str_replace(): Replace matched patterns in a string.

We can use a fixed string as replacement. Also, we can can use regexs and backreferences.

Instance:

# Switch the first and last letters in _words_.
# WE USE BACKREFRENCES TO DO THIS TASK!
head(str_replace(words, "(^.)(.+)(.$)", "\\3\\2\\1"))

## [1] "a"        "ebla"     "tboua"    "ebsoluta" "tccepa"   "tccouna"

Splitting strs

str_split(): Split a string up into pieces. It returns a list. Like the other stringr functions that return a list, you can use simplify = TRUE to return a matrix.

You can also request a maximum number of pieces by n=?.

Stringr

Hossein Kermani

May 23, 2021