14.4.1.1. Exercises

1.For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
  1. Find all words that start or end with x.
# single

# combination
words[str_detect(words, "^x") | str_detect(words, "x$")]
  1. Find all words that start with a vowel and end with a consonant.
# single
str_view(words, "^[aiueo].*[^aiueo]$", match = TRUE)
# combination
words[str_detect(words, "^[aiueo]") & str_detect(words, "[^aiueo]$")]
  1. Are there any words that contain at least one of each different vowel?
# single

# combination
words[str_detect(words, "a") & str_detect(words, "i") & str_detect(words, "u") &
      str_detect(words, "e") & str_detect(words, "o")]
2. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
df <- tibble(
  word = words
)
df <- df %>% mutate(number = str_count(word, "[aiueo]"), prop = number / str_length(word))
# highest number
df %>% filter(number == max(df$number))
## # A tibble: 8 x 3
##   word        number  prop
##   <chr>        <int> <dbl>
## 1 appropriate      5 0.455
## 2 associate        5 0.556
## 3 available        5 0.556
## 4 colleague        5 0.556
## 5 encourage        5 0.556
## 6 experience       5 0.5  
## 7 individual       5 0.5  
## 8 television       5 0.5
# prop
df %>% filter(prop == max(df$prop))
## # A tibble: 1 x 3
##   word  number  prop
##   <chr>  <int> <dbl>
## 1 a          1     1

14.4.2.1 Exercises

1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.
color <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <- str_c(color, collapse = "|")
2. From the Harvard sentences data, extract:
  1. The first word from each sentence.
str_extract(sentences, "^[a-zA-Z]+")
  1. All words ending in ing.
str_extract_all(sentences, "[a-zA-Z]+ing")
  1. All plurals.
str_extract_all(sentences, "[a-zA-Z]{3,}s")

14.4.3.1 Exercises

1. Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
sentences %>%
  str_subset("(one | two | three | four | five | six | seven | eight | nine | ten)([^ ]+)") %>%
  str_extract("(one | two | three | four | five | six | seven | eight | nine | ten)([^ ]+)")
##  [1] "one over"       " seven books"   " two met"       " two factors"  
##  [5] "one and"        " three lists"   " seven is"      " two when"     
##  [9] "one floor."     "one with"       "one war"        " tender"       
## [13] "one button"     " six minutes."  "one in"         "one like"      
## [17] " two shares"    " two distinct"  "one costs"      " two pins"     
## [21] " five robins."  " four kinds"    "one rang"       " tenth"        
## [25] " three story"   "one wall."      " tent,"         " tent"         
## [29] " three inches"  " six comes"     " tender"        "one before"    
## [33] " tender"        " three batches" " two leaves."
2. Find all contractions. Separate out the pieces before and after the apostrophe.
sentences %>%
  str_subset("([A-Za-z]+)'([A-Za-z]+)") %>%
  str_extract("([A-Za-z]+)'([A-Za-z]+)")
##  [1] "It's"       "man's"      "don't"      "store's"    "workmen's" 
##  [6] "Let's"      "sun's"      "child's"    "king's"     "It's"      
## [11] "don't"      "queen's"    "don't"      "pirate's"   "neighbor's"

14.4.4.1 Exercises

1. Replace all forward slashes in a string with backslashes.
x <- c("a/b/c/d/e")
str_replace_all(x, "\\/", "\\\\") %>% writeLines()
## a\b\c\d\e
2. Implement a simple version of str_to_lower() using replace_all().
x <- c("AA", "BB", "CC")
str_replace_all(x, c("A" = "a", "B" = "b", "C" = "c")) %>% writeLines()
## aa
## bb
## cc
3. Switch the first and last letters in words. Which of those strings are still words?
replace <- words %>% str_replace("(^[A-Za-z])([A-Za-z]*)([A-Za-z]$)", "\\3\\2\\1")
str_view(words, replace,match = TRUE)

14.4.5.1 Exercises

1. Split up a string like “apples, pears, and bananas” into individual components.
x <- "apple, pears, and bananas"
str_split(x, ", and |, ")
## [[1]]
## [1] "apple"   "pears"   "bananas"
2. Why is it better to split up by boundary(“word”) than " “?
""でやるとピリオドが単語に一部になるが、boundary("word")でやるとピリオドを無視できる
##### 3. What does splitting with an empty string (”") do? Experiment, and then read the documentation.
x <- "This is a sentence. This is another sentence."
str_split(x, "")[[1]]
##  [1] "T" "h" "i" "s" " " "i" "s" " " "a" " " "s" "e" "n" "t" "e" "n" "c"
## [18] "e" "." " " "T" "h" "i" "s" " " "i" "s" " " "a" "n" "o" "t" "h" "e"
## [35] "r" " " "s" "e" "n" "t" "e" "n" "c" "e" "."

14.5.1 Exercises

1. How would you find all strings containing  with regex() vs. with fixed()?
x <- c("a\\b", "ab")
# regex()
str_subset(x, regex("\\\\"))
## [1] "a\\b"
# fixed()
str_subset(x, fixed("\\"))
## [1] "a\\b"
2. What are the five most common words in sentences?
sentences %>% str_extract_all(boundary("word")) %>% unlist() %>% str_to_lower() %>%
  as_tibble() %>% set_names("words") %>% group_by(words) %>% summarise(n = n()) %>%
  arrange(desc(n)) %>% head(5)
## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.
## # A tibble: 5 x 2
##   words     n
##   <chr> <int>
## 1 the     751
## 2 a       202
## 3 of      132
## 4 to      123
## 5 and     118