Introduction

I am practicing working with string using the online version of the book “R for data science.”

Packages

library(stringr)
## Warning: package 'stringr' was built under R version 4.0.4
library(stringi)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.4
## v tidyr   1.1.2     v forcats 0.5.1
## v readr   1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

14.2.5 Exercise

  1. In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

    paste() separates strings by spaces into a single string by default, while paste0() doesn,’t separate strings by space by default. The two functions are equivalent to str_c from stringr. paste0() and paste convert NA into a string and concatenates it with other strings in the function. str_c only returns NA if it has a NA value.

  2. In your own words, describe the difference between the sep and collapse arguments to str_c().

    Collapse and sep argument controls how strings in str_c() are separated while combining them into one string. collapse argument combines a vector of strings in str_c(), and sep argument combines individual strings inputted into str_c().

  3. Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

d <-ceiling(str_length("character")/2)
str_sub("character",d, d)
## [1] "a"

If the string has an even number of characters, I would use the floor function to select the second middle character not captured by using ceiling function.

  1. What does str_wrap() do? When might you want to use it?

    str_wrap() wraps strings into a paragraph format. You would use str_wrap() when you want to format lines of strings into a paragraph.

  2. What does str_trim() do? What’s the opposite of str_trim()?

    str_trim() removes whitespace from the start and end of a string. str_pad() is the opposite of str_trim().

  3. Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

b <- str_c(c("a", "b", "c"), collapse = ",")
str_replace(b, "c", c("c" = "and c") )
## [1] "a,b,and c"

14.3.1.1 Exercises

  1. Explain why each of these strings don’t match a : “",”\“,”\".

    "" The one backslash will escape the following character and will not match backslash in a string.

    “\” Two backslashes will result in a backslash in a regular expression which will escape the following character, and not match to a backslash in a string.

    "\" Three backslashes will also result in a backslash in a regular expression which will escape the following character.

  2. How would you match the sequence "’?

str_view("\"'\\", "\"'\\\\", match = TRUE)
  1. What patterns will the regular expression ...... match? How would you represent it as a string?

    The pattern will match strings like “.a.b.c” which are single characters separated by periods.

14.3.2.1 Exercises

  1. How would you match the literal string “\(^\)”?
str_view("$^$","^$")
  1. Given the corpus of common words in stringr::words, create regular expressions that find all words that:

    1. Start with “y”.
str_view_all(words, "^y", match = TRUE)
2) End with “x”
str_view_all(words, "x$", match = TRUE)
3) Are exactly three letters long. (Don’t cheat by using str_length()!)
str_view_all(words, "^...$", match = TRUE)
4) Have seven letters or more.

Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.
str_view_all(words, ".......+", match = TRUE)

14.3.3.1 Exercises

  1. Create regular expressions to find all words that:

    1. Start with a vowel.
str_view_all(words,"^[aeiou]", match = TRUE )
2) That only contain consonants. (Hint: thinking about matching “not”-vowels.)
str_view_all(words,"^[^aeiou]*$", match = TRUE )
3) End with ed, but not with eed.
str_view_all(words,"[^e]ed$", match = TRUE )
4) End with ing or ise.
str_view_all(words,"ing$|ise$", match = TRUE )
  1. Empirically verify the rule “i before e except after c”.
str_view_all(words, "cie", match = TRUE)
  1. Is “q” always followed by a “u”?

    In the words dataset the letter q is always followed by the letter u.

str_view_all(words, "q[^u]", match = TRUE)
  1. Write a regular expression that matches a word if it’s probably written in British English, not American English.

    “ou|yse$”

    The word color in British English is written as colour. The word analyze in British English is written as analyse.

  2. Create a regular expression that will match telephone numbers as commonly written in your country.

str_view("301-123-4567","\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d")

14.3.4.1 Exercises

  1. Describe the equivalents of ?, +, * in {m,n} form.

    ? would be equivalent to {0,1} because the expression matches for 0 or 1 values.

    • would be equivalent to {1,} because the expression matches for one or more values.

    • would be equivalent to {0,} because the expression matches for zero or more values.

  2. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

    1. ^.*$

    The expression matches a word in a string that starts with a character and is followed by 0 or more characters.

    1. “\{.+\}”

    The string will match a string surrounded by curly brackets.

    1. --

    The expression matches a string that contains a sequence of four digits followed by a dash, followed by two digits followed by a dash, and followed by two digits.

    1. “\\{4}” The string will match four backslashes in a string.
  3. Create regular expressions to find all words that:

    1. Start with three consonants.
str_view_all(words, "^[^aeiou]{3}", match = TRUE)
2) Have three or more vowels in a row.
str_view_all(words, "[aeiou]{3,}", match = TRUE)
3) Have two or more vowel-consonant pairs in a row.
str_view_all(words, "([aeiou][^aeiou]){2,}", match = TRUE)

Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.

14.3.5.1 Exercise

  1. Describe, in words, what these expressions will match:

    1. (.)\1\1

    The expression will match a word with three repeating characters.

    1. “(.)(.)\2\1”

    The expression will match a palindrome of four characters.

    1. (..)\1

    The expression will match a word with two characters repeating.

    1. “(.).\1.\1”

    The expression will match a word with a character that is repeated three times, and each repeated character is separated by one other character.

    1. "(.)(.)(.).*\3\2\1"

    The expression will match a word with three letters in the beginning followed by any number of characters, and ending with same three characters as the beginning but in reverse order.

2)Construct regular expressions to match words that:

1) Start and end with the same character.
str_view_all(words,"^(.)((.*\\1$)|\\1?$)",  match = TRUE )
2) Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view_all("church","((.)(.)).*\\1",  match = TRUE )
3) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view_all(words,"(.).\\1.\\1",  match = TRUE )

14.4.1.1 Exercises

  1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

    1. Find all words that start or end with x.
str_view_all(words, "^x|x$", match = TRUE)
xstart <- str_detect(words, "^x")
xend <- str_detect(words, "x$")
words[xstart|xend]
## [1] "box" "sex" "six" "tax"
2) Find all words that start with a vowel and end with a consonant.
str_view_all(words, "^[aeiou].*[^aeiou]$", match = TRUE)
vow <- str_detect(words, "^[aieou]")
con <- str_detect(words, "[^aeiou]$")
words[vow & con] %>%
  head(10)
##  [1] "about"   "accept"  "account" "across"  "act"     "actual"  "add"    
##  [8] "address" "admit"   "affect"
3) Are there any words that contain at least one of each different vowel?
a <- str_detect(words, "[a]")
e <- str_detect(words, "[e]")
i <- str_detect(words, "[i]")
o <- str_detect(words, "[o]")
u <- str_detect(words, "[u]")
words[a & e & i & o & u]
## character(0)
  1. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
max(str_count(words, "[aeiou]"))
## [1] 5
words[str_count(words, "[aeiou]") == 5]
## [1] "appropriate" "associate"   "available"   "colleague"   "encourage"  
## [6] "experience"  "individual"  "television"
prop <- str_count(words, "[aeiou]")/str_length(words)
words[prop == 1]
## [1] "a"

14.4.2.1 Exercises

  1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c("\\b(", str_c(colours, collapse = "|"), ")\\b")

more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
  1. From the Harvard sentences data, extract:

    1. The first word from each sentence.
str_extract(sentences,"[A-ZAa-z]+") %>%
  head(10)
##  [1] "The"   "Glue"  "It"    "These" "Rice"  "The"   "The"   "The"   "Four" 
## [10] "Large"
2) All words ending in ing.
str_extract(sentences,"\\b[A-Za-z]+ing\\b") %>%
  head(10)
##  [1] NA NA NA NA NA NA NA NA NA NA
3) All plurals.
str_extract(sentences,"\\b[A-Za-z]{3,}s\\b") %>%
  head(10)
##  [1] "planks"    NA          NA          "days"      "bowls"     "lemons"   
##  [7] NA          "hogs"      "hours"     "stockings"

14.4.3.1 Exercises

  1. Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
num <-  "\\b(one|two|three|four|five|six|seven|eight|nine|ten) ([^ ]+)"

has_num <- sentences %>%
  str_subset(num) %>%
  head(10)

has_num %>%
  str_match(num)
##       [,1]           [,2]    [,3]      
##  [1,] "seven books"  "seven" "books"   
##  [2,] "two met"      "two"   "met"     
##  [3,] "two factors"  "two"   "factors" 
##  [4,] "three lists"  "three" "lists"   
##  [5,] "seven is"     "seven" "is"      
##  [6,] "two when"     "two"   "when"    
##  [7,] "ten inches."  "ten"   "inches." 
##  [8,] "one war"      "one"   "war"     
##  [9,] "one button"   "one"   "button"  
## [10,] "six minutes." "six"   "minutes."
  1. Find all contractions. Separate out the pieces before and after the apostrophe.
str_match(sentences, "([A-Za-z]+)'([A-Za-z]+)") %>%
  head(10)
##       [,1]   [,2] [,3]
##  [1,] NA     NA   NA  
##  [2,] NA     NA   NA  
##  [3,] "It's" "It" "s" 
##  [4,] NA     NA   NA  
##  [5,] NA     NA   NA  
##  [6,] NA     NA   NA  
##  [7,] NA     NA   NA  
##  [8,] NA     NA   NA  
##  [9,] NA     NA   NA  
## [10,] NA     NA   NA

14.4.4.1 Exercises

  1. Replace all forward slashes in a string with backslashes.
str_replace_all("a/b/c", "/", "\\\\")
## [1] "a\\b\\c"
  1. Implement a simple version of str_to_lower() using replace_all().
str_replace_all("The Apple", c("A" = "a", "T" = "t"))
## [1] "the apple"
  1. Switch the first and last letters in words. Which of those strings are still words?

    Some examples of strings which are still words are sub, tub, read, and lead.

words %>% 
  str_replace("^([A-Za-z])(.*)([A-Za-z])$", "\\3\\2\\1") %>%
  head(10)
##  [1] "a"        "ebla"     "tboua"    "ebsoluta" "tccepa"   "tccouna" 
##  [7] "echieva"  "scrosa"   "tca"      "ectiva"

14.4.5.1 Exercises

  1. Split up a string like “apples, pears, and bananas” into individual components.
x <- "apples, pears, and bananas"
str_split(x, ", ")
## [[1]]
## [1] "apples"      "pears"       "and bananas"
  1. Why is it better to split up by boundary(“word”) than " "?

    It is better to split up by boundary(“word”) than by " " because boundary(“word”) is more refined. It can pick up on words while ignoring other punctuation marks which aren’t part of the word.

  2. What does splitting with an empty string ("") do? Experiment, and then read the documentation.

    It splits every part of the string into individual characters.

x <- "apples, pears, and bananas"
str_split(x, "")
## [[1]]
##  [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " "
## [20] "b" "a" "n" "a" "n" "a" "s"

14.5.1 Exercises

  1. How would you find all strings containing  with regex() vs. with fixed()?
str_view_all(c("go\\od", "good"), "\\\\")
str_view_all(c("go\\od", "good"), fixed("\\"))
  1. What are the five most common words in sentences?
tibble(list = unlist(str_extract_all(sentences, boundary("word"))))%>%
  mutate(list = str_to_lower(list)) %>%
  count(list, sort = TRUE) %>%
  head(5)
## # A tibble: 5 x 2
##   list      n
##   <chr> <int>
## 1 the     751
## 2 a       202
## 3 of      132
## 4 to      123
## 5 and     118

14.7.1 Exercises

  1. Find the stringi functions that:

    1. Count the number of words.

      stri_count_words()

    2. Find duplicated strings.

      stri_duplicated()

    3. Generate random text.

      stri_rand_strings()

      stri_rand_shuffle()

      stri_rand_lipsum()

  2. How do you control the language that stri_sort() uses for sorting?

    The language can be controlled by using the opts_collator argument in stri_sort().

Source

Wickham, H., & Grolemund, G. (2017). R for data science. Retrieved April 14, 2021, from https://r4ds.had.co.nz/strings.html