Working with strings
stringr is a powerful package to work with strings. First, two important points to be considered. This packages uses REGEX to identify and match patterns in a string. You could refer to REGEX handout to read more about them. Of particular significance is that stringr uses strings to represent REGEXs. So, whenever we want to embed some especial operators like n or ., we should use double backslash to escape them. In normal string, just one is needed!
Another important point here is that the printed representation of a string in R is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines()
.
Here I introduce some useful functions.
Str lenght
str_lenght(): Technically this returns the number of “code points”, in a string. One code point usually corresponds to one character, but not always.
Combine strs
str_c(): Joins two or more vectors element-wise into a single character vector, optionally inserting sep between input vectors. If collapse is not NULL, it will be inserted between elements of the result, returning a character vector of length 1.
sep is working with several individual strings. collapse is used with a vector of strings and turn them to a single vector.
Substitution
str_sub() : Extract and replace substrings from a character vector.
str_c(x, 2,5)
str_sub()
won’t fail if the string is too short: it will just return as much as possible.str_sub()
could be merged with other functions asstr_to_lower()
to modify certain parts of a string.str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
The functions of
str_to_lower()
,str_to_upper()
,, andstr_to_title()
are clear! Also,str_sort()
andstr_order()
. - These functions also get a locale option!
Remove withespaces
str_trim(): Removes whitespace from start and end of string; str_squish()
also reduces repeated whitespace inside a string.
The above functions do not accept regexs; theywork on strings. The functions that I introduce below use regex to work with strings.
Detect matches, their frequencies, and locations
str_detect(): Detect the presence or absence of a pattern in a string.
Important point: str_detect()
returns a logical vector determining if a character vector matches a pattern.
str_count(): Count the number of matches in a string. It returns the number of matches per string.
str_subset(): Choose the strings which match a pattern. Returns the actual matches, the whole string which matches a pattern.
str_which(): Find the position of matches in a string vector.
str_locate(): Locate the position of patterns in a string. Give the starting and ending positions of each match.
Here is a example to show how these functions work:
# patterns matche if the word has a vowel
head(str_detect(words, "[aieou]"))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
head(str_count(words, "[aieou]"))
## [1] 1 2 3 4 2 3
head(str_subset(words, "[aieou]"))
## [1] "a" "able" "about" "absolute" "accept" "account"
head(str_which(words, "[aieou]"))
## [1] 1 2 3 4 5 6
head(str_locate(words, "[aieou]"))
## start end
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
## [4,] 1 1
## [5,] 1 1
## [6,] 1 1
- Since
str_detect()
andstr_count()
return numeric values, we can use statistical functions like sum and mean with them (why and howstr_detect()
returns numerics?). For instance:
# How many common words start with t?
sum(str_detect(words, "^t"))
## [1] 65
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
## [1] 1.991837
Note the difference. detect counts the number of words match the pattern, count could return the sum/mean of matches per word.
Extract individual and group matches
str_extract(): Extract matching patterns from a string. It returns the exact matched part. To extract the whole string containing the match, you should use str_subset()
.
str_extract_all()
returns a list. If you use simplify = TRUE
, str_extract_all()
will return a matrix with short matches expanded to the same length as the longest.
str_match(): Extract matched groups from a string. We can use () to group matches. If we use str_extract()
to extract the matches it gives us a string per grouped match. We should use str_match()
to extract each individual component. Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group.
hint: If your data is in a tibble, it’s often easier to use tidyr::extract()
. It works like str_match()
but requires you to name the matches, which are then placed in new columns. Example:
tibble(sentence = sentences) %>%
::extract(
tidyrc("article", "noun"), "(a|the) ([^ ]+)",
sentence, remove = FALSE
)#> # A tibble: 720 x 3
#> sentence article noun
#> <chr> <chr> <chr>
#> 1 The birch canoe slid on the smooth planks. the smooth
#> 2 Glue the sheet to the dark blue background. the sheet
#> 3 It's easy to tell the depth of a well. the depth
#> 4 These days a chicken leg is a rare dish. a chicken
#> 5 Rice is often served in round bowls. <NA> <NA>
#> 6 The juice of lemons makes fine punch. <NA> <NA>
#> # … with 714 more rows
`str_extract_all()` and `str_match_all()` return lists. Thus, the importance of working with lists.
Replace patterns
str_replace(): Replace matched patterns in a string.
We can use a fixed string as replacement. Also, we can can use regexs and backreferences.
Instance:
# Switch the first and last letters in _words_.
# WE USE BACKREFRENCES TO DO THIS TASK!
head(str_replace(words, "(^.)(.+)(.$)", "\\3\\2\\1"))
## [1] "a" "ebla" "tboua" "ebsoluta" "tccepa" "tccouna"
Splitting strs
str_split(): Split a string up into pieces. It returns a list. Like the other stringr functions that return a list, you can use simplify = TRUE
to return a matrix.
You can also request a maximum number of pieces by n=?
.