Data 101 - Stringr_Exercises

pacman::p_load(tidyverse)

14.2.5 Exercises

In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

 paste concatenates with a space between elements, and changes NA value to character "NA"

 paste0 concatenates without space and also changes NA to character. 

 str_c concatenates without space, but does not change NA to character.

t1 <- c(1, 2, NA, 3)            # create a couple of vectors
t2 <- c("a","b","c", "d")

paste(t1, t2)                   # concatenate the vectors with various functions

## [1] "1 a"  "2 b"  "NA c" "3 d"

paste0(t1,t2)

## [1] "1a"  "2b"  "NAc" "3d"

str_c(t1, t2)

## [1] "1a" "2b" NA   "3d"

In your own words, describe the difference between the sep and collapse arguments to str_c().

 collapse will combine all elements of a character vector into a single string

 sep uses a separator character placed between concatenated elements in a resulting vector of strings

Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

s <- "12345"                                                    # a string with odd number of characters
mid <- 0                                                        # initialize midpoint variable
ev <- "even # chars returns null"                               # set a message string
if (str_length(s) %% 2 == 1) mid <- str_length(s)/2 + 1         # if string length is odd set midpoint
if (mid == 0) even                                              # if it's even, show the message
str_sub(s, mid, mid)                                            # get the middle character

## [1] "3"

s <- "123456"                                                   # a string with even number of characters   
mid <- 0                                                        # initialize midpoint variable
if (str_length(s) %% 2 == 1) mid <- str_length(s)/2 + 1         # if string length is odd, set midpoint
if (mid == 0) ev                                                # if it's even, show the message

## [1] "even # chars returns null"

str_sub(s, mid, mid)                                            # substr at (0,0) returns null

## [1] ""

What does str_wrap() do? When might you want to use it?

 str_wrap formats long strings into paragraphs with a specified line length, that defaults to 80 characters. Use it when the width of the output needs to be specific for displaying on a page.

What does str_trim() do? What’s the opposite of str_trim()?

 str_trim removes white space before and/or after the string, 
 str_squish also removes repeated white space in a string. 
 str_pad will pad a string with spaces or a specified character.

Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

new_str <- function (s) {                           # define a new function with one input
     len <- length (s)                                # get the length of the input
     if (len > 1) 
       {s[len] <- str_c("and ", s[len])                  # add "and " to the last element
        if (len > 2) s <- str_c(s, collapse = ", ")      # concatenate the string with comma separator
        else s <- str_c(s, collapse = " ")               # or space, if len is 2
        }
s                                                   # return new string if len > 1, else return original input
}

new_str(c("a", "b", "c", "z"))

## [1] "a, b, c, and z"

new_str(c("And a 1", "a 2"))

## [1] "And a 1 and a 2"

new_str("1")

## [1] "1"

new_str("")

## [1] ""

14.3.1.1 Exercises

Explain why each of these strings don’t match a \: “\\”, “\\\",”\\\".

 Back-slash is the escape character for both strings and regular expressions. In order to match a back-slash the regular expression requires four back-slashes.

# 2.  How would you match the sequence "'\?
 x <- " \"'\\ b"
str_view(x, "\"'\\\\")                 # an escape char before the " and four backslashes after the single quote

# 3. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

x <- "ab.c.d.e.fgh.i.j...k"

#str_view(x,"\..\..\..")                        # This doesn't match anything, it gives an error

# 4. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

x <- "ab.c.d.e.fgh.i.j...k"

str_view(x, "\\..\\..\\..")                    # The regular expression needs two back slashes

14.3.2.1 Exercises

# How would you match the literal string "$^$"?

x <- "123 $^$  abc"
x

## [1] "123 $^$  abc"

str_view(x, "\\$^$")

Given the corpus of common words in stringr::words, create regular expressions that find all words that:
```
 1. Start with “y”
```

str_view(words,"^y", match = TRUE)      # words that start with y

    2. End with “x”

str_view(words, "x$", match = TRUE)     # words that end with x

    3. Are exactly three letters long. (Don’t cheat by using str_length()!)

str_subset(words, '^.{3}$') %>%     # words with three letters
        head(18)

##  [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad"
## [13] "bag" "bar" "bed" "bet" "big" "bit"

    4. Have seven letters or more.

str_subset(words, ".{7}") %>%   # words of seven or more letters
        head(15)

##  [1] "absolute"  "account"   "achieve"   "address"   "advertise" "afternoon"
##  [7] "against"   "already"   "alright"   "although"  "america"   "another"  
## [13] "apparent"  "appoint"   "approach"

14.3.3.1 Exercises

Create regular expressions to find all words that:
```
 1. Start with a vowel.
```

str_subset(words, "^[aeiouy]")

##   [1] "a"           "able"        "about"       "absolute"    "accept"     
##   [6] "account"     "achieve"     "across"      "act"         "active"     
##  [11] "actual"      "add"         "address"     "admit"       "advertise"  
##  [16] "affect"      "afford"      "after"       "afternoon"   "again"      
##  [21] "against"     "age"         "agent"       "ago"         "agree"      
##  [26] "air"         "all"         "allow"       "almost"      "along"      
##  [31] "already"     "alright"     "also"        "although"    "always"     
##  [36] "america"     "amount"      "and"         "another"     "answer"     
##  [41] "any"         "apart"       "apparent"    "appear"      "apply"      
##  [46] "appoint"     "approach"    "appropriate" "area"        "argue"      
##  [51] "arm"         "around"      "arrange"     "art"         "as"         
##  [56] "ask"         "associate"   "assume"      "at"          "attend"     
##  [61] "authority"   "available"   "aware"       "away"        "awful"      
##  [66] "each"        "early"       "east"        "easy"        "eat"        
##  [71] "economy"     "educate"     "effect"      "egg"         "eight"      
##  [76] "either"      "elect"       "electric"    "eleven"      "else"       
##  [81] "employ"      "encourage"   "end"         "engine"      "english"    
##  [86] "enjoy"       "enough"      "enter"       "environment" "equal"      
##  [91] "especial"    "europe"      "even"        "evening"     "ever"       
##  [96] "every"       "evidence"    "exact"       "example"     "except"     
## [101] "excuse"      "exercise"    "exist"       "expect"      "expense"    
## [106] "experience"  "explain"     "express"     "extra"       "eye"        
## [111] "idea"        "identify"    "if"          "imagine"     "important"  
## [116] "improve"     "in"          "include"     "income"      "increase"   
## [121] "indeed"      "individual"  "industry"    "inform"      "inside"     
## [126] "instead"     "insure"      "interest"    "into"        "introduce"  
## [131] "invest"      "involve"     "issue"       "it"          "item"       
## [136] "obvious"     "occasion"    "odd"         "of"          "off"        
## [141] "offer"       "office"      "often"       "okay"        "old"        
## [146] "on"          "once"        "one"         "only"        "open"       
## [151] "operate"     "opportunity" "oppose"      "or"          "order"      
## [156] "organize"    "original"    "other"       "otherwise"   "ought"      
## [161] "out"         "over"        "own"         "under"       "understand" 
## [166] "union"       "unit"        "unite"       "university"  "unless"     
## [171] "until"       "up"          "upon"        "use"         "usual"      
## [176] "year"        "yes"         "yesterday"   "yet"         "you"        
## [181] "young"

    2. That only contain consonants. (Hint: thinking about matching “not”-vowels.)

str_view(words, "^[^aeyiuo]+$", match = TRUE)

    3. End with ed, but not with eed.

str_view(words, "[^e]ed$", match = TRUE)

    4.  End with ing or ise.

str_subset(words, "ing$|ise$")

##  [1] "advertise" "bring"     "during"    "evening"   "exercise"  "king"     
##  [7] "meaning"   "morning"   "otherwise" "practise"  "raise"     "realise"  
## [13] "ring"      "rise"      "sing"      "surprise"  "thing"

Empirically verify the rule “i before e except after c”.

 Or when sounding like "a" as in neighbor or weigh.  Science and society break the rule.

str_subset(words, "ie|ei")                             # science &  doesn't follow the rule

##  [1] "achieve"    "believe"    "brief"      "client"     "die"       
##  [6] "eight"      "either"     "experience" "field"      "friend"    
## [11] "lie"        "piece"      "quiet"      "receive"    "science"   
## [16] "society"    "tie"        "view"       "weigh"

length(str_subset(stringr::words, "(cei|[^c]ie)"))      # words with cei and not cie

## [1] 14

str_subset(stringr::words, "(cei|[^c]ie)")              # there are 14

##  [1] "achieve"    "believe"    "brief"      "client"     "die"       
##  [6] "experience" "field"      "friend"     "lie"        "piece"     
## [11] "quiet"      "receive"    "tie"        "view"

length(str_subset(stringr::words, "(cie|[^c]ei)"))      # words with cie and not cei

## [1] 3

str_subset(stringr::words, "(cie|[^c]ei)")              # there are 3

## [1] "science" "society" "weigh"

Is “q” always followed by a “u”?
```
 In this data, yes. 
```

str_subset(words, "q")           # see words with q

##  [1] "equal"    "quality"  "quarter"  "question" "quick"    "quid"    
##  [7] "quiet"    "quite"    "require"  "square"

str_subset(words, "q[^u]")       # find words without qu

## character(0)

Write a regular expression that matches a word if it’s probably written in British English, not American English.
```
 This is not exact, ise$ gives words that are common to American and English spelline, such as rise, raise.
```

str_subset(words, ".+[^aeiou]our$|yse$|ae|ise$")

##  [1] "advertise" "colour"    "exercise"  "favour"    "labour"    "otherwise"
##  [7] "practise"  "raise"     "realise"   "rise"      "surprise"

Create a regular expression that will match telephone numbers as commonly written in your country.
```
  ^?$?\d{3}?$??-??$?\d{3}?$??-??$?\d{4}?$??-?$ 
```

14.3.4.1 Exercises

Describe the equivalents of ?, +, * in {m,n} form.
```
 ? = {0,1}

 + = {1,}

 * = {0,}
```

Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

 1. ^.*$  - matches one or more characters, any string

 2. "\\{.+\\}" - matches quote, open curly brace, one or more characters, including spaces, close curly brace, quote
                 if the quotes are ignored, then it matches any character inside curly braces

 3. \d{4}-\d{2}-\d{2} - matches 4 digits, hyphen, 2 digits, hyphen, 2 digits, will match dates in format yyyy-mm-dd

 4. "\\\\{4}" - matches four back-slashes

Create regular expressions to find all words that:
```
 1. Start with three consonants = ^[^aeiou]{3}
```

str_subset(words, "^[^aeiouy]{3}")

##  [1] "Christ"    "Christmas" "mrs"       "scheme"    "school"    "straight" 
##  [7] "strategy"  "street"    "strike"    "strong"    "structure" "three"    
## [13] "through"   "throw"

    2. Have three or more vowels in a row = [aeiou]{3,}

str_subset(words, "[aeiou]{3,}")

## [1] "beauty"   "obvious"  "previous" "quiet"    "serious"  "various"

    3. Have two or more vowel-consonant pairs in a row = ([aeiou][^aeiou]){2,}

str_subset(words, "([aeiou][^aeiou]){2,}")

##   [1] "absolute"    "agent"       "along"       "america"     "another"    
##   [6] "apart"       "apparent"    "authority"   "available"   "aware"      
##  [11] "away"        "balance"     "basis"       "become"      "before"     
##  [16] "begin"       "behind"      "benefit"     "business"    "character"  
##  [21] "closes"      "community"   "consider"    "cover"       "debate"     
##  [26] "decide"      "decision"    "definite"    "department"  "depend"     
##  [31] "design"      "develop"     "difference"  "difficult"   "direct"     
##  [36] "divide"      "document"    "during"      "economy"     "educate"    
##  [41] "elect"       "electric"    "eleven"      "encourage"   "environment"
##  [46] "europe"      "even"        "evening"     "ever"        "every"      
##  [51] "evidence"    "exact"       "example"     "exercise"    "exist"      
##  [56] "family"      "figure"      "final"       "finance"     "finish"     
##  [61] "friday"      "future"      "general"     "govern"      "holiday"    
##  [66] "honest"      "hospital"    "however"     "identify"    "imagine"    
##  [71] "individual"  "interest"    "introduce"   "item"        "jesus"      
##  [76] "level"       "likely"      "limit"       "local"       "major"      
##  [81] "manage"      "meaning"     "measure"     "minister"    "minus"      
##  [86] "minute"      "moment"      "money"       "music"       "nature"     
##  [91] "necessary"   "never"       "notice"      "okay"        "open"       
##  [96] "operate"     "opportunity" "organize"    "original"    "over"       
## [101] "paper"       "paragraph"   "parent"      "particular"  "photograph" 
## [106] "police"      "policy"      "politic"     "position"    "positive"   
## [111] "power"       "prepare"     "present"     "presume"     "private"    
## [116] "probable"    "process"     "produce"     "product"     "project"    
## [121] "proper"      "propose"     "protect"     "provide"     "quality"    
## [126] "realise"     "reason"      "recent"      "recognize"   "recommend"  
## [131] "record"      "reduce"      "refer"       "regard"      "relation"   
## [136] "remember"    "report"      "represent"   "result"      "return"     
## [141] "saturday"    "second"      "secretary"   "secure"      "separate"   
## [146] "seven"       "similar"     "specific"    "strategy"    "student"    
## [151] "stupid"      "telephone"   "television"  "therefore"   "thousand"   
## [156] "today"       "together"    "tomorrow"    "tonight"     "total"      
## [161] "toward"      "travel"      "unit"        "unite"       "university" 
## [166] "upon"        "visit"       "water"       "woman"

14.3.5.1 Exercises

Describe, in words, what these expressions will match:

 1. (.)\1\1 - matches groups of 3 repeated characters

 2. "(.)(.)\\2\\1" - quote, 2 characters, back-slash, 2, back-slash, 1  -  e.g. "ab\2\1"
                 without the quotes, it matches any two characters followed by the same two characters in reverse order, e.g. "abba" or "deed"                             

 3. (..)\1 - a 2 character sequence repeated twice  - e.g. abab or ffff

 4. "(.).\\1.\\1" - A character followed by any character, the original character, any other character, the original character again. E.g. "abaca", "b8b.b"

 5. "(.)(.)(.).*\\3\\2\\1" - quote, 3 or more characters, back-slash, 3, back-slash, 2, back-slash, 1, quote  
         - e.g. "abc\3\2\1" or "abcdef\3\2\1" 
         ignoring the quotes - three characters followed by zero or more characters of any kind followed by the same three characters but in reverse order. E.g. "abcsgasgddsadgsdgcba" or "abccba" or "abc1cba"

Construct regular expressions to match words that:
```
 1. Start and end with the same character  
```

str_subset(words, "^(.)((.*\\1$)|\\1?$)")   # start & end with same character

##  [1] "a"          "america"    "area"       "dad"        "dead"      
##  [6] "depend"     "educate"    "else"       "encourage"  "engine"    
## [11] "europe"     "evidence"   "example"    "excuse"     "exercise"  
## [16] "expense"    "experience" "eye"        "health"     "high"      
## [21] "knock"      "level"      "local"      "nation"     "non"       
## [26] "rather"     "refer"      "remember"   "serious"    "stairs"    
## [31] "test"       "tonight"    "transport"  "treat"      "trust"     
## [36] "window"     "yesterday"

    2. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_subset(words,"([A-Za-z][A-Za-z]).*\\1")

##  [1] "appropriate" "church"      "condition"   "decide"      "environment"
##  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
## [11] "pressure"    "remember"    "represent"   "require"     "sense"      
## [16] "therefore"   "understand"  "whether"

    3. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_subset(words, "([a-z]).*\\1.*\\1")

##  [1] "appropriate" "available"   "believe"     "between"     "business"   
##  [6] "degree"      "difference"  "discuss"     "eleven"      "environment"
## [11] "evidence"    "exercise"    "expense"     "experience"  "individual" 
## [16] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
## [21] "therefore"   "tomorrow"

14.4.1.1 Exercises

For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
```
 1. Find all words that start or end with x.
```

words[str_detect(words, "^x|x$")]      # with one regex

## [1] "box" "sex" "six" "tax"

                                       # now split into multiple statements
strtx <- str_detect(words, "^x")       # start w/x
endx <- str_detect(words, "x$")        # end w/x
words[strtx | endx]                    # join 'em up

## [1] "box" "sex" "six" "tax"

    2. Find all words that start with a vowel and end with a consonant.

words[str_detect(words, "^[aeiou].*[^aeiou]$")]   # with single regex

##   [1] "about"       "accept"      "account"     "across"      "act"        
##   [6] "actual"      "add"         "address"     "admit"       "affect"     
##  [11] "afford"      "after"       "afternoon"   "again"       "against"    
##  [16] "agent"       "air"         "all"         "allow"       "almost"     
##  [21] "along"       "already"     "alright"     "although"    "always"     
##  [26] "amount"      "and"         "another"     "answer"      "any"        
##  [31] "apart"       "apparent"    "appear"      "apply"       "appoint"    
##  [36] "approach"    "arm"         "around"      "art"         "as"         
##  [41] "ask"         "at"          "attend"      "authority"   "away"       
##  [46] "awful"       "each"        "early"       "east"        "easy"       
##  [51] "eat"         "economy"     "effect"      "egg"         "eight"      
##  [56] "either"      "elect"       "electric"    "eleven"      "employ"     
##  [61] "end"         "english"     "enjoy"       "enough"      "enter"      
##  [66] "environment" "equal"       "especial"    "even"        "evening"    
##  [71] "ever"        "every"       "exact"       "except"      "exist"      
##  [76] "expect"      "explain"     "express"     "identify"    "if"         
##  [81] "important"   "in"          "indeed"      "individual"  "industry"   
##  [86] "inform"      "instead"     "interest"    "invest"      "it"         
##  [91] "item"        "obvious"     "occasion"    "odd"         "of"         
##  [96] "off"         "offer"       "often"       "okay"        "old"        
## [101] "on"          "only"        "open"        "opportunity" "or"         
## [106] "order"       "original"    "other"       "ought"       "out"        
## [111] "over"        "own"         "under"       "understand"  "union"      
## [116] "unit"        "university"  "unless"      "until"       "up"         
## [121] "upon"        "usual"

strtv <- str_detect(words, "^[aeiou]")            # start w/ vowel
endc  <- str_detect(words, "[^aeiou]$")           # end w/ consonant
words[strtv & endc]                               # intersection

##   [1] "about"       "accept"      "account"     "across"      "act"        
##   [6] "actual"      "add"         "address"     "admit"       "affect"     
##  [11] "afford"      "after"       "afternoon"   "again"       "against"    
##  [16] "agent"       "air"         "all"         "allow"       "almost"     
##  [21] "along"       "already"     "alright"     "although"    "always"     
##  [26] "amount"      "and"         "another"     "answer"      "any"        
##  [31] "apart"       "apparent"    "appear"      "apply"       "appoint"    
##  [36] "approach"    "arm"         "around"      "art"         "as"         
##  [41] "ask"         "at"          "attend"      "authority"   "away"       
##  [46] "awful"       "each"        "early"       "east"        "easy"       
##  [51] "eat"         "economy"     "effect"      "egg"         "eight"      
##  [56] "either"      "elect"       "electric"    "eleven"      "employ"     
##  [61] "end"         "english"     "enjoy"       "enough"      "enter"      
##  [66] "environment" "equal"       "especial"    "even"        "evening"    
##  [71] "ever"        "every"       "exact"       "except"      "exist"      
##  [76] "expect"      "explain"     "express"     "identify"    "if"         
##  [81] "important"   "in"          "indeed"      "individual"  "industry"   
##  [86] "inform"      "instead"     "interest"    "invest"      "it"         
##  [91] "item"        "obvious"     "occasion"    "odd"         "of"         
##  [96] "off"         "offer"       "often"       "okay"        "old"        
## [101] "on"          "only"        "open"        "opportunity" "or"         
## [106] "order"       "original"    "other"       "ought"       "out"        
## [111] "over"        "own"         "under"       "understand"  "union"      
## [116] "unit"        "university"  "unless"      "until"       "up"         
## [121] "upon"        "usual"

    3. Are there any words that contain at least one of each different vowel?
            
            No, not in this data.  A single regex would be very complex, it would need to account for all permutations of the order of the vowels.

words[str_detect(words, "a") &
        str_detect(words, "e") &
        str_detect(words, "i") &
        str_detect(words, "o") &
        str_detect(words, "u")
  ]

## character(0)

What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
```
 The maximum vowels is 5, there are eight words with 5 vowels.
 The proportion of vowels to the length of the entire word.
```

vowels <- str_count(words, "[aeiou]")               # count the vowels in each word
max(vowels)                                         # find the maximum

## [1] 5

words[which(vowels == max(vowels))]                 # find words with max vowels

## [1] "appropriate" "associate"   "available"   "colleague"   "encourage"  
## [6] "experience"  "individual"  "television"

prop_vowels <- str_count(words, "[aeiou]") / str_length(words)   # find the proportion for each word
max(prop_vowels)                                                 # show the max. proportion

## [1] 1

words[which(prop_vowels == max(prop_vowels))]                    # find words w/ max. proportion, there's only one @ 100%

## [1] "a"

14.4.2.1 Exercises

    1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.

colors <- c("red", "orange", "yellow", "green", "blue", "purple", "pink")  # colors vector
color_match <- str_c("\\b(", str_c(colors, collapse = "|"), ")\\b")    # add word boundaries to regex     

has_color <- str_subset(sentences, color_match)           # find all the sentences with color matches
str_extract(has_color, color_match)                       # extract the colors from the sentences

##  [1] "blue"   "pink"   "blue"   "blue"   "yellow" "green"  "red"    "blue"  
##  [9] "blue"   "blue"   "green"  "red"    "red"    "red"    "green"  "green" 
## [17] "purple" "green"  "red"    "red"    "blue"   "blue"   "red"    "green" 
## [25] "green"  "green"  "yellow" "orange" "red"    "red"    "pink"

more <- sentences[str_count(sentences, color_match) > 1]  # find sentences with more than one color in them
str_view_all(more, color_match)                           # flickered is not included

    2. From the Harvard sentences data, extract:

            1. The first word from each sentence.

str_extract(sentences, "[A-ZAa-z]+") %>% head(10)        # this misses "It's" in sentence 3

##  [1] "The"   "Glue"  "It"    "These" "Rice"  "The"   "The"   "The"   "Four" 
## [10] "Large"

str_extract(sentences, "[A-ZAa-z][A-Za-z']*") %>%        # allow for apostrophe
        head(10)

##  [1] "The"   "Glue"  "It's"  "These" "Rice"  "The"   "The"   "The"   "Four" 
## [10] "Large"

            2. All words ending in ing.

s_ing <- str_detect(sentences, "\\b[A-Za-z]+ing\\b")
unique(unlist(str_extract_all(sentences[s_ing], "\\b[A-Za-z]+ing\\b")))

##  [1] "spring"    "evening"   "morning"   "winding"   "living"    "king"     
##  [7] "Adding"    "making"    "raging"    "playing"   "sleeping"  "ring"     
## [13] "glaring"   "sinking"   "dying"     "Bring"     "lodging"   "filing"   
## [19] "wearing"   "wading"    "swing"     "nothing"   "sing"      "painting" 
## [25] "walking"   "bring"     "shipping"  "puzzling"  "landing"   "thing"    
## [31] "waiting"   "whistling" "timing"    "changing"  "drenching" "moving"   
## [37] "working"

            3. All plurals.

s1 <- unique(unlist(str_extract_all(sentences, "\\b[A-Za-z]{3,}s\\b")))   # find all words ending in s
s1 <- s1[!(str_detect(s1, "\\b[A-Za-z]{3,}ss\\b"))]                       # eliminate words ending in ss
s2 <- unique(unlist(str_extract_all(sentences, "\\b[A-Za-z]{3,}es\\b")))  # find all words ending in es
head(s1, 10)

##  [1] "planks"    "days"      "bowls"     "lemons"    "makes"     "hogs"     
##  [7] "hours"     "stockings" "helps"     "pass"

head(s2, 10)

##  [1] "makes"    "fires"    "lives"    "busses"   "hikes"    "strokes" 
##  [7] "slices"   "gives"    "Thieves"  "improves"

14.4.3.1 Exercises

Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.

numword <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten) +(\\w+)"
sentences[str_detect(sentences, numword)] %>%
  str_extract(numword)

##  [1] "seven books"   "two met"       "two factors"   "three lists"  
##  [5] "seven is"      "two when"      "ten inches"    "one war"      
##  [9] "one button"    "six minutes"   "ten years"     "two shares"   
## [13] "two distinct"  "five cents"    "two pins"      "five robins"  
## [17] "four kinds"    "three story"   "three inches"  "six comes"    
## [21] "three batches" "two leaves"

Find all contractions. Separate out the pieces before and after the apostrophe.

contraction <- "([A-Za-z]+)'([A-Za-z]+)"             # any number of letters before & after apostrophe
sentences[str_detect(sentences, contraction)] %>%    # find sentences with contractions
  str_match(contraction)                             # extract and separate based on groups

##       [,1]         [,2]       [,3]
##  [1,] "It's"       "It"       "s" 
##  [2,] "man's"      "man"      "s" 
##  [3,] "don't"      "don"      "t" 
##  [4,] "store's"    "store"    "s" 
##  [5,] "workmen's"  "workmen"  "s" 
##  [6,] "Let's"      "Let"      "s" 
##  [7,] "sun's"      "sun"      "s" 
##  [8,] "child's"    "child"    "s" 
##  [9,] "king's"     "king"     "s" 
## [10,] "It's"       "It"       "s" 
## [11,] "don't"      "don"      "t" 
## [12,] "queen's"    "queen"    "s" 
## [13,] "don't"      "don"      "t" 
## [14,] "pirate's"   "pirate"   "s" 
## [15,] "neighbor's" "neighbor" "s"

14.4.4.1 Exercises

Replace all forward slashes in a string with backslashes.

str_replace_all("Once/upon/a/time", "/", "\\\\")

## [1] "Once\\upon\\a\\time"

Implement a simple version of str_to_lower() using replace_all().

cap_words <- str_subset(words,"[A-Z]")    # Look for Caps
cap_words

## [1] "Christ"    "Christmas"

lowcase <- c("A" = "a", "B" = "b", "C" = "c", "D" = "d", "E" = "e",
                  "F" = "f", "G" = "g", "H" = "h", "I" = "i", "J" = "j", 
                  "K" = "k", "L" = "l", "M" = "m", "N" = "n", "O" = "o", 
                  "P" = "p", "Q" = "q", "R" = "r", "S" = "s", "T" = "t", 
                  "U" = "u", "V" = "v", "W" = "w", "X" = "x", "Y" = "y", 
                  "Z" = "z")
str_replace_all(cap_words, pattern = lowcase)

## [1] "christ"    "christmas"

Switch the first and last letters in words. Which of those strings are still words?

swapped <- str_replace_all(words, "^([A-Za-z])(.*)([A-Za-z])$", "\\3\\2\\1")  # swap 1st & last letters
head(swapped)

## [1] "a"        "ebla"     "tboua"    "ebsoluta" "tccepa"   "tccouna"

intersect(swapped, words)                                                     # find intersection with words

##  [1] "a"          "america"    "area"       "dad"        "dead"      
##  [6] "lead"       "read"       "depend"     "god"        "educate"   
## [11] "else"       "encourage"  "engine"     "europe"     "evidence"  
## [16] "example"    "excuse"     "exercise"   "expense"    "experience"
## [21] "eye"        "dog"        "health"     "high"       "knock"     
## [26] "deal"       "level"      "local"      "nation"     "on"        
## [31] "non"        "no"         "rather"     "dear"       "refer"     
## [36] "remember"   "serious"    "stairs"     "test"       "tonight"   
## [41] "transport"  "treat"      "trust"      "window"     "yesterday"

14.4.5.1 Exercises

Split up a string like “apples, pears, and bananas” into individual components.

("avocados, oranges, and grapes") %>% 
        str_split(", +(and +)?") %>% 
        .[[1]]

## [1] "avocados" "oranges"  "grapes"

Why is it better to split up by boundary(“word”) than " "?

 Using space (" ") will include any extra spaces output as empty elements and includes punctuation with the word,
 boundary("word") strips punctuation.

x <- "This is a sentence.  This is another sentence.  Another sentence with   extra spaces?"
str_view_all(x, boundary("word"))

str_split(x, " ")[[1]]

##  [1] "This"      "is"        "a"         "sentence." ""          "This"     
##  [7] "is"        "another"   "sentence." ""          "Another"   "sentence" 
## [13] "with"      ""          ""          "extra"     "spaces?"

str_split(x, boundary("word"))[[1]]

##  [1] "This"     "is"       "a"        "sentence" "This"     "is"      
##  [7] "another"  "sentence" "Another"  "sentence" "with"     "extra"   
## [13] "spaces"

What does splitting with an empty string ("") do? Experiment, and then read the documentation.
```
 It splits the string into separate characters, it's equivalent to boundary("character")
```

str_split("In the year 2525, if we are still alive....", "")[[1]]

##  [1] "I" "n" " " "t" "h" "e" " " "y" "e" "a" "r" " " "2" "5" "2" "5" "," " " "i"
## [20] "f" " " "w" "e" " " "a" "r" "e" " " "s" "t" "i" "l" "l" " " "a" "l" "i" "v"
## [39] "e" "." "." "." "."

14.5.1 Exercises

How would you find all strings containing with regex() vs. with fixed()?

str_subset(c("1\\2", "123", "once\\upon\\a time"), "\\\\")

## [1] "1\\2"               "once\\upon\\a time"

str_subset(c("1\\2", "123", "once\\upon\\a time"), fixed("\\"))

## [1] "1\\2"               "once\\upon\\a time"

What are the five most common words in sentences?

tibble(word = unlist(str_extract_all(sentences, boundary("word")))) %>%
  mutate(word = str_to_lower(word)) %>%
  count(word, sort = TRUE) %>%
  head(5)

## # A tibble: 5 x 2
##   word      n
##   <chr> <int>
## 1 the     751
## 2 a       202
## 3 of      132
## 4 to      123
## 5 and     118

14.7.1 Exercises

Find the stringi functions that:

 1. Count the number of words.

         Use stri_count_words.

pacman::p_load(stringi)                           # load the stringi package
stri_count_words(head(sentences,10))              # count the words in the first ten sentences

##  [1] 8 8 9 9 7 7 8 8 7 8

    2. Find duplicated strings.
            
            Use stri_duplicated.

str_split("I am what I am, what a yam", boundary("word"))[[1]] %>% 
        stri_duplicated()

## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE

    3. Generate random text.
    
            Use stri_rand_strings, stri_rand_lipsum, or stri_rand_shuffle.

stri_rand_strings(5, 7)              # generate five random strings of length 7

## [1] "86kplQ2" "HNrAkNM" "oDRiIp9" "gO2y3ut" "NK8qfMs"

stri_rand_lipsum(1)                  # generate a paragraph of random text

## [1] "Lorem ipsum dolor sit amet, natoque sociis eget venenatis platea ornare, cursus rhoncus maximus sed lacinia ac lacus. Eget tortor nisl, tincidunt vel eleifend varius. Phasellus consectetur vel metus dapibus ut auctor. Sit maecenas luctus turpis mattis nec a a porta. Massa ac luctus at nibh ante. Cursus elementum est litora vitae felis consequat sed lectus malesuada. Quam maecenas ligula maecenas neque diam amet volutpat mauris mauris tortor maecenas amet gravida vivamus. Sed aliquam. Ridiculus ex eros finibus. Sed torquent amet sem nascetur vestibulum nulla ac id mauris. Amet dictum, aliquet dis dis ante. Ex neque et mauris ligula himenaeos sed justo, vestibulum senectus."

stri_rand_shuffle("It was a dark and stormy night in Transylvania.")

## [1] "ads  lyad  tsnvr tgritaraTmao.inIh nn a syiakwn"

How do you control the language that stri_sort() uses for sorting?
```
 Use locale argument.
```

stri_sort(c('hladny', 'chladny'), locale='pl_PL')

## [1] "chladny" "hladny"

stri_sort(c('hladny', 'chladny'), locale='sk_SK')

## [1] "hladny"  "chladny"

Resources

I used the following resources when I got stuck or to look at alternatives.

Regular Expressions 101
‘R for Data Science’ Exercise Solutions - Jeffrey B. Arnold

Data 101 - Stringr_Exercises_Macy

Marilyn Macy

3/16/2021