Chapter 11 Strings with stringr

This chapter introduces you to string manipulation in R. But the focus will be on regular expressions, or regexps for short. Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings. When you first look at a regexp, you’ll think a cat walked across your keyboard, but as your understanding improves, they will soon start to make sense.

Pre requisites

library(tidyverse) library(stringr)

String Basics

You can use double or single quotes. It doesnt matter. Ideal to use double quotes unless you want to create a string that contains multiple double quotes.

string1 <-  "This is a string"
string2 <- 'To put a "quote" inside a string, use singe quotes'
string1
[1] "This is a string"
string2
[1] "To put a \"quote\" inside a string, use singe quotes"

You can also include a literal single or double quote in a string by escapeing it with  

double_quote <-  "\"" # or '"'
single_quote <-  '\'' #  or "'"
double_quote
[1] "\""
single_quote
[1] "'"

To see the raw contents of the string, use writelines()

writeLines(double_quote)
"
writeLines(single_quote)
'

Other special charatacters include for newline and for tab. Also 0b5 sample way of writing non-English characters that works on all platforms.

x <-  "\u00b5"
x
[1] "µ"
writeLines(x)
µ

Multiple strings can be stored in a charactacter vector with c()

c("one","two","three")
[1] "one"   "two"   "three"
c
function (...)  .Primitive("c")

Functions and packages coverered

  • stringr package
  • str_length
  • str_c
  • str_replace_na
  • str_sub
  • str_to_uppser, str_sort, str_to_lower, str_order
  • str_length, str_pad, str_trim, str_sub
  • For regex = str_view, str_view_all
  • regex syntax
  • str_detect
  • str_subset
  • str_count
  • str_extract
  • str_match
  • tidyr::extract
  • str_split
  • str_locate
  • str_sub
  • the stringi package

Ideas

  • mention rex. A package with friendly regular expressions.
  • Use it to match country names? Extract numbers from text?
  • Discuss fuzzy joining and string distance, approximate matching.

Also see

String Length

str_length() tells you the number of characters in a string

library(tidyverse)
package <U+393C><U+3E31>tidyverse<U+393C><U+3E32> was built under R version 3.3.3Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ------------------------------------------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
library(stringr)
str_length( c("a","R for data science", NA))
[1]  1 18 NA

Combining Strings

Use str_c() to combine one or more strings. Use the sep= argument to control how they’re separated.
# much like paste0 function (combines strings without spaces in between them)

str_c("x","y")
[1] "xy"
str_c("x","y","z")
[1] "xyz"
str_c("x","y", sep = ", ")
[1] "x, y"

Like most other functions in R, missing values are contagious. If you want them to print as “NA”, use str_replace_na()

x <-  c("abc", NA)
str_c("|-",x,"-|")
[1] "|-abc-|" NA       
str_c("|-",str_replace_na(x),"-|")
[1] "|-abc-|" "|-NA-|" 

Str_c is vectorized and it automatically recylces shorter vectors to the same length as the longest

str_c("prefix-",c("a","b","c"), "-suffix", sep=":")
[1] "prefix-:a:-suffix" "prefix-:b:-suffix" "prefix-:c:-suffix"

Objects of length 0 are dropped.

name <- "Hadley"
time_of_day <-  "morning"
birthday <-  TRUE
str_c("Good",time_of_day,name, if(birthday) " and Happy Birthday", sep = " ")
[1] "Good morning Hadley  and Happy Birthday"

To collapse a vector of strings into a single string, use collapse argument

str_c(c("x","y","z"), collapse = ", ")
[1] "x, y, z"

Subsetting Strings

You can extract parts of a string using str_sub(). It takes start and end arguments that give the position of the substring. Negative numbers ocunt backwards from the end. It won’t fail if the string is too short. It will return as much as possible of the string.

x <-  c("Apple", "Banana", "CaRrots")
str_sub(x, 1, 3)
[1] "App" "Ban" "CaR"
str_sub(x, -3, -1)
[1] "ple" "ana" "ots"
str_sub(x, 1, 7)
[1] "Apple"   "Banana"  "CaRrots"

Using the assignment form of str_sub, we can change the capitalization on the first character of each string in x

x
[1] "Apple"   "Banana"  "CaRrots"
str_sub(x,1,1) <- str_to_lower(str_sub(x,1,1))
x
[1] "apple"   "banana"  "caRrots"

Locales

You can also use str_to_upper() and str_to_title(). However, changing case is more complicated because of different languages . You can set which rules to apply by specifyinga locale

# turkish has two i's: with and without a dot
# it has a different rule for capitalizing them
str_to_upper(c("i","I"), locale="tr")
[1] "I" "I"

Locales also affects sorting. The base R order() and sort() functions sort strings using the current locale. If you want robust behavior across different computers, you want to use str_sort() ans str_order() which take an additional locale argument.

x <- c("apple","eggplant","banana")
str_sort(x, locale = "en")
[1] "apple"    "banana"   "eggplant"
str_sort(x, locale = "haw")
[1] "apple"    "eggplant" "banana"  

Exercises: In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

The function paste seperates strings by spaces by default, while paste0 does not seperate strings with spaces by default.Since str_c does not seperate strings with spaces by default it is closer in behabior to paste0.

paste("foo", "bar")
[1] "foo bar"
#> [1] "foo bar"
paste0("foo", "bar")
[1] "foobar"
#> [1] "foobar"

However, str_c and the paste function handle NA differently. The function str_c propogates NA, if any argument is a missing value, it returns a missing value. This is in line with how the numeric R functions, e.g. sum, mean, handle missing values. However, the paste functions, convert NA to the string “NA” and then treat it as any other character vector.

str_c("foo", NA)
[1] NA
paste("foo", NA)
[1] "foo NA"
paste0("foo", NA)
[1] "fooNA"
In your own words, describe the difference between the sep and collapse arguments to str_c().
The sep argument is the string inserted between argugments to str_c, while collapse is the string used to separate any elements of the character vector into a character vector of length one.

Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?
The following function extracts the middle character. If the string has an even number of characters the choice is arbitrary. We choose to select n/2 , because that case works even if the string is only of length one. A more general method would allow the user to select either the floor or ceiling for the middle character of an even string.

x <- c("a", "abc", "abcd", "abcde", "abcdef")
L <- str_length(x)
m <- ceiling(L / 2)
str_sub(x, m, m)
[1] "a" "b" "b" "c" "c"
What does str_wrap() do? When might you want to use it?
The function str_wrap wraps text so that it fits within a certain width. This is useful for wrapping long strings of text to be typeset.

What does str_trim() do? What’s the opposite of str_trim()?
The function str_trim trims the whitespace from a string.

str_trim(" abc ")
[1] "abc"
#> [1] "abc"
str_trim(" abc ", side = "left")
[1] "abc "
#> [1] "abc "
str_trim(" abc ", side = "right")
[1] " abc"
#> [1] " abc"
The opposite of str_trim is str_pad which adds characters to each side.

str_pad("abc", 5, side = "both")
[1] " abc "
#> [1] " abc "
str_pad("abc", 4, side = "right")
[1] "abc "
#> [1] "abc "
str_pad("abc", 4, side = "left")
[1] " abc"
#> [1] " abc"
Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

str_commasep <- function(x, sep = ", ", last = ", and ") {
  if (length(x) > 1) {
    str_c(str_c(x[-length(x)], collapse = sep),
                x[length(x)],
                sep = last)
  } else {
    x
  }
}
str_commasep("")
[1] ""
str_commasep("a")
[1] "a"
str_commasep(c("a", "b"))
[1] "a, and b"
str_commasep(c("a", "b", "c"))
[1] "a, b, and c"

Matching Patterns with regular expressions

Regexp are very terse language that allow you to describe patterns in strings. They take a little while to get your head around. To learn regular expressions, we’ll use str_view() and str_view_all(). These functions take a character vector and a regular expression and show you how they match. We’ll start with very simple regular expresions and then gradually get more and more complicated.

basic matches

The simplest patterns match exact strings:

# need to instal htmlwidgets package first
library(tidyverse)
library(stringr)
x <-  c("apple", "banana", "pear")
str_view(x,"an")

. matches any character

str_view(x, ".a.")

to match special characters: use \

# dot
dot <- "\\."
writeLines(dot)
\.

This tells R to look for an explict .

str_view(c("abc", "a.c","bef"),"a\\.c")
str_view(c("abc", "a.c","bef"),".\\..")

To match a literal ,you need four \\

x <- "a\\b"
str_view(x, "\\\\")

Explain why each of these strings don’t match a : “",”\“,”\".
“": This will escape the next character in the R string.
”\“: This will resolve to  in the regular expression, which will escape the next character in the regular expression.
”\": The first two backslashes will resolve to a literal backslash in the regular expression, the third will escape the next character. So in the regular expresion, this will escape some escaped character.

How would you match the sequence “’ ?

x <- c("'\\","a","b")
writeLines(x)
'\
a
b
str_view(x, "\'\\\\")

What patterns will the regular expression ...... match? How would you represent it as a string?
It will match any patterns that are a dot followed by any character, repeated three times.

Anchors

It’s often useful to anchor the regular expression so that it matches fromt he start or end of the string. You can use
^ to match the start of the string
$ to match the end of the string

Mnemonic begin with Power ^ and end with money($)

x <-  c("apple","banana","pear")
str_view(x,"^a")

str_view(x,"a$")

To force a regular expression to only match a complete string, enclose it with ^ and $

x <- c("apple pie", "apple", "apple cake")
str_view(x,"apple")

# will list all three
str_view(x, "^apple$")

# willlist only apple

You can use to match the boundary between words

x <- c("applecrust pie", "apple crust pie", "apple crumble cake")
str_view(x, "\\bcrust\\b")

How would you match the literal string " ^^ “?

str_view(c("$^$", "ab$^$sfas"), "^\\$\\^\\$$")

Given the corpus of common words in stringr::words, create regular expressions that find all words that:
Since this list is long, you might want to use the match=TRUE argument to str_view() to show only the matching or non-matching words.

Start with “y”.

str_view(words,"^y.", match=TRUE)

End with “x”

str_view(words,"x$", match = TRUE)

Are exactly three letters long. (Don’t cheat by using str_length()!)

str_view(words,"^...$", match = TRUE)

Have seven letters or more.

str_view(words,"^.......", match = TRUE)

Character Classes and Alternatives

matches any digit
matches any whitespace
[abc] matches a,b or c
[^abc] matches anything except a, b or c

remember to use \d and \s for those special characters

You can use alternates to pick between one or more alternative patterns. For example, “abc|d..f” will match either “abc” or “deaf” Note that the precendence for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Use parenthesis to make it clear what you want

str_view(c("grey","gray"),"gr(e|a)y")

Exercises:
Create regular expressions to find all words that:

Start with a vowel.

str_view(words,"^[aeiou].", match = TRUE)

That only contain consonants. (Hint: thinking about matching “not”-vowels.)

str_view(stringr::words, "^[^aeiou]+$", match = TRUE)

End with ed, but not with eed.

str_view(stringr::words, "^ed$|[^e]ed$", match = TRUE)

End with ing or ise.

str_view(stringr::words, "ing$|ise$", match = TRUE)

Empirically verify the rule “i before e except after c”.

str_view(stringr::words, "(cei|[^c]ie)", match = TRUE)
str_view(stringr::words, "(cie|[^c]ei)", match = TRUE)

Is “q” always followed by a “u”?

str_view(stringr::words, "qu", match = TRUE)
str_view(stringr::words, "q[^u]", match = TRUE)

Create a regular expression that will match telephone numbers as commonly written in your country. Using what has been covered in R4DS thus far

x <- c("123-456-7890", "1235-2351")
str_view(x, "\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d")

Repetition

Controls how many times a pattern matches.
? 0 or 1
+ 1 or more
* 0 or more

x <-  "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, "CC[LX]+")

You can also specify the number of matches precisely:
{n} exactly n times
{n,} n or more
{,m} at most m
{n,m} between n and m

str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
str_view(x, "C{,2}")
Error in stri_locate_first_regex(string, pattern, opts_regex = opts(pattern)) : 
  Error in {min,max} interval. (U_REGEX_BAD_INTERVAL)

by default regexp will be greedy. They will match the longest string possible. Make them lazy by putting a [?] after them.

str_view(x, 'C{2,3}?')
str_view(x, "C[LX]+?")
Describe the equivalents of ?, +, * in {m,n} form.
The equivalent of ? is {,1}, matching at most 1. The equivalent of + is {1,}, matching 1 or more. There is no direct equivalent of * in {m,n} form since there are no bounds on the matches: it can be 0 up to infinity matches.

Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

^.*$: Any string
“\{.+\}”: Any string with curly braces surrounding at least one character.
--: A date in “%Y-%m-%d” format: four digits followed by a dash, followed by two digits followed by a dash, followed by another two digits followed by a dash.

“\\{4}”: This resolves to the regex \{4}, which is four backslashes.

Create regular expressions to find all words that:
find all words starting with three consonants

str_view(words, '^[^aeiou]{3}?', match = TRUE)

find three or more vowels in a row:

str_view(words, '[aeiou]{3,}?', match = TRUE)

Find Two or more vowel-consonant pairs in a row.

str_view(words, "([aeiou][^aeiou]){2,}", match = TRUE)

Grouping and Backreferences

Parenttheses is a way to disambiguate complex expressions. They also define groups that you can refer to with backreferences , like , etc. For example the following regular expression finds all fruits that have a repeated pair of leters:

str_view(fruit, "(..)\\1", match = TRUE)

Exercises

Describe, in words, what these expressions will match:

(.) : The same character apearing three times in a row. E.g. “aaa”
“(.)(.)\2\1”: A pair of characters followed by the same pair of characters in reversed order. E.g. “abba”.
(..): Any two characters repeated. E.g. “a1a1”.
“(.).\1.\1”: A character followed by any character, the original character, any other character, the original character again. E.g. “abaca”, “b8b.b”.
“(.)(.)(.).*\3\2\1" Three characters followed by zero or more characters of any kind followed by the same three characters but in reverse order. E.g. “abcsgasgddsadgsdgcba” or “abccba” or “abc1cba”.

Construct regular expressions to match words that:

Start and end with the same character. Assuming the word is more than one character and all strings are considered words, ^(.).*$

str_view(words, "^(.).*\\1$", match = TRUE)

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(words, "(.).*\\1.*\\1", match = TRUE)

Tools

Learn about stringr functions that let you :
Determine which strings match a pattern str_view, str_view_all, str_detect() and str_count()
find the position of matches
Extract the content of matches str_extract()
Replaces matches with new values str_replace() and str_replace_all()
Split a string based on a match str_split()

Detect Matches str_detect()

x <- c("apple", "banana", "pear")
str_detect(x,"e")
[1]  TRUE FALSE  TRUE

Since Fales = 0 and True = 1 you can sum() and mean() it:

# how many common words start with t?
sum(str_detect(words, "^t"))
[1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
[1] 0.2765306

A common use of str_detect is to select the elements that match a pattern. You can do this with logical subsetting , or the convenient str_subset() wrapper.

# logical subsetting method
words[str_detect(words, "x$")]
[1] "box" "sex" "six" "tax"
# using str_subset method
str_subset(words, "x$")
[1] "box" "sex" "six" "tax"

Typically your strings will be one colum of a data frame, and you’ll want to use filter instead:

df <-  tibble(word=words, i=seq_along(word))
df %>%
  filter(str_detect(words, "x$"))

A variation on str_detect() is str_count(). rather than a simple yes or no, it tells you how many matches there are in a string.

x <- c("apple","banana","pear")
str_count(x, "a")
[1] 1 3 1
# on the average, how many vowels per word?
mean(str_count(words,"[aeiou]"))
[1] 1.991837

You can also use str_count() with mutate

df %>%
  mutate(
    vowels = str_count(word,"[aeiou]"), 
    consonants = str_count(word, "[^aeiou]")
  )

Note: matches never overlap. For example, in “abababa” how many times will the pattern “aba” match?

str_count("abababa","aba")
[1] 2
str_view_all("abababa","aba")

For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

Find all words that start or end with x.

str_view(words, "^x|x$", match = TRUE)

words[str_detect(words, "^x|x$")]
[1] "box" "sex" "six" "tax"
start_with_x <- str_detect(words, "^x")
end_with_x <- str_detect(words, "x$")
words[start_with_x | end_with_x]
[1] "box" "sex" "six" "tax"

Find all words that start with a vowel and end with a consonant.

words[str_detect(words,"^[aieou].*[^aeiou]$")]
  [1] "about"       "accept"      "account"     "across"      "act"         "actual"      "add"        
  [8] "address"     "admit"       "affect"      "afford"      "after"       "afternoon"   "again"      
 [15] "against"     "agent"       "air"         "all"         "allow"       "almost"      "along"      
 [22] "already"     "alright"     "although"    "always"      "amount"      "and"         "another"    
 [29] "answer"      "any"         "apart"       "apparent"    "appear"      "apply"       "appoint"    
 [36] "approach"    "arm"         "around"      "art"         "as"          "ask"         "at"         
 [43] "attend"      "authority"   "away"        "awful"       "each"        "early"       "east"       
 [50] "easy"        "eat"         "economy"     "effect"      "egg"         "eight"       "either"     
 [57] "elect"       "electric"    "eleven"      "employ"      "end"         "english"     "enjoy"      
 [64] "enough"      "enter"       "environment" "equal"       "especial"    "even"        "evening"    
 [71] "ever"        "every"       "exact"       "except"      "exist"       "expect"      "explain"    
 [78] "express"     "identify"    "if"          "important"   "in"          "indeed"      "individual" 
 [85] "industry"    "inform"      "instead"     "interest"    "invest"      "it"          "item"       
 [92] "obvious"     "occasion"    "odd"         "of"          "off"         "offer"       "often"      
 [99] "okay"        "old"         "on"          "only"        "open"        "opportunity" "or"         
[106] "order"       "original"    "other"       "ought"       "out"         "over"        "own"        
[113] "under"       "understand"  "union"       "unit"        "university"  "unless"      "until"      
[120] "up"          "upon"        "usual"      

Are there any words that contain at least one of each different vowel?

 
words[str_detect(words, "a") &
        str_detect(words, "e") &
        str_detect(words, "i") &
        str_detect(words, "o") &
        str_detect(words, "u")]
character(0)

What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)

prop_vowels <- str_count(words, "[aeiou]") / str_length(words)
words[which(prop_vowels == max(prop_vowels))]
[1] "a"

Extract Matches

To extract the actual text of a match, use st_extract()

# using harvard sentences which were designed to test voip systems
length(sentences)
[1] 720
head(sentences)
[1] "The birch canoe slid on the smooth planks."  "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."      "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."        "The juice of lemons makes fine punch."      

Imagine we want to find all sentences that contain a color. We first create a vector of color names, and then turn it into a single regular expression.

library(stringr)
colors <- c("red","orange","yellow","green","blue","purple")
color_match <-  str_c(colors, collapse = "|")
color_match
[1] "red|orange|yellow|green|blue|purple"
colour_match2 <- str_c("\\b(", str_c(colors, collapse = "|"), ")\\b")
colour_match2
[1] "\\b(red|orange|yellow|green|blue|purple)\\b"

Now we can select the sentences that contain a color, and then extract the color to figure out which one it is:

has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
[1] "blue" "blue" "red"  "red"  "red"  "blue"
# but str_extract only extracts the first match.
more <-  sentences[str_count(sentences, color_match)>1]
str_view_all(more, color_match)
more2 <- sentences[str_count(sentences, colour_match2) > 1]
str_view_all(more2, colour_match2, match = TRUE)
str_extract(more, color_match)
[1] "blue"   "green"  "orange"

To get ALL matches, use str_extract_all

str_extract_all(more, color_match)
[[1]]
[1] "blue" "red" 

[[2]]
[1] "green" "red"  

[[3]]
[1] "orange" "red"   

If you use simplify = TRUE, str_extract_all will return a matrix with short matches expanded to the same legnth as the longest.

str_extract_all(more, color_match, simplify=TRUE)
     [,1]     [,2] 
[1,] "blue"   "red"
[2,] "green"  "red"
[3,] "orange" "red"
x <-  c("a","a b","a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "b"  ""  
[3,] "a"  "b"  "c" 

Grouped Matches

You can also use parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the” Defining a word in a regular expression is a little tricky.

noun <- "(a|the) ([^ ]+)"
has_noun <-sentences %>%
  str_subset(noun) %>%
head(10)
has_noun %>%
  str_extract(noun)
 [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked" "the sun"    "the huge"   "the ball"  
 [9] "the woman"  "a helps"   

str_extract() gives us the complete match; str_match() gives us each individual component

has_noun %>%
  str_match(noun)
      [,1]         [,2]  [,3]     
 [1,] "the smooth" "the" "smooth" 
 [2,] "the sheet"  "the" "sheet"  
 [3,] "the depth"  "the" "depth"  
 [4,] "a chicken"  "a"   "chicken"
 [5,] "the parked" "the" "parked" 
 [6,] "the sun"    "the" "sun"    
 [7,] "the huge"   "the" "huge"   
 [8,] "the ball"   "the" "ball"   
 [9,] "the woman"  "the" "woman"  
[10,] "a helps"    "a"   "helps"  

You can also used tidyr::extract()

tibble(sentence = sentences) %>%
  tidyr::extract(
    sentence, c("article", "noun"),"(a|the) ([^ ]+)", remove = FALSE
  )

Like str_extract, if you want ALL matches for each string, you’ll need str_match_all()

Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.

numword <- "(one|two|three|four|five|six|seven|eight|nine|ten) +(\\S+)"
sentences[str_detect(sentences, numword)] %>%
  str_extract(numword)
 [1] "ten served"    "one over"      "seven books"   "two met"       "two factors"   "one and"      
 [7] "three lists"   "seven is"      "two when"      "one floor."    "ten inches."   "one with"     
[13] "one war"       "one button"    "six minutes."  "ten years"     "one in"        "ten chased"   
[19] "one like"      "two shares"    "two distinct"  "one costs"     "ten two"       "five robins." 
[25] "four kinds"    "one rang"      "ten him."      "three story"   "ten by"        "one wall."    
[31] "three inches"  "ten your"      "six comes"     "one before"    "three batches" "two leaves."  

Find all contractions. Separate out the pieces before and after the apostrophe.

contraction <- "([A-Za-z]+)'([A-Za-z]+)"
sentences %>%
  `[`(str_detect(sentences, contraction)) %>%
  str_extract(contraction)
 [1] "It's"       "man's"      "don't"      "store's"    "workmen's"  "Let's"      "sun's"      "child's"   
 [9] "king's"     "It's"       "don't"      "queen's"    "don't"      "pirate's"   "neighbor's"

Replacing Matches

str_replace() and str_replace_all() allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string:

x <- c("apple","pear", "banana")
str_replace(x, "[aeiou]", "-")
[1] "-pple"  "p-ar"   "b-nana"
x <- c("apple","pear", "banana")
str_replace_all(x, "[aeiou]", "-")
[1] "-ppl-"  "p--r"   "b-n-n-"

You can insert backreferences (identified when you use parenthesis) to insert components of the match. In the following code, I flip the order of the second and third words:

sentences %>%
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
  head(5)
[1] "The canoe birch slid on the smooth planks."  "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."      "These a days chicken leg is a rare dish."   
[5] "Rice often is served in round bowls."       

Splitting

Use str_split() to split a string up into pieces. For example, we could split sentences into words:

sentences %>%
  head(5) %>%
  str_split(" ")
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"         "dark"        "blue"       
[8] "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

[[4]]
[1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"       "rare"    "dish."  

[[5]]
[1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."
sentences %>%
  head(5) %>%
  str_split(" ", simplify =TRUE)
     [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]          [,9]   
[1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."     ""     
[2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background." ""     
[3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"           "well."
[4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"        "dish."
[5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""            ""     

Split up a string like “apples, pears, and bananas” into individual components.

x <- c("apples, pears, and bananas")
str_split(x, ", +(and +)?")[[1]]
[1] "apples"  "pears"   "bananas"

Why is it better to split up by boundary(“word”) than " “?
Answer: Splitting by boundary(”word“) splits on punctuation and not just whitespace.

What does splitting with an empty string (“”) do?

str_split("ab. cd|agt", "")[[1]]
 [1] "a" "b" "." " " "c" "d" "|" "a" "g" "t"

Answer: It splits the string into individual characters.

Find Matches

str_locate() and str_locate_all() give you the starting and ending positiong of each match. These are particulary useful when on of the other ufnctions does exactly what you want. You can use str_locate() to find the matching pattern, and str_sub() to extract and/or modify them.

# the regular call
str_locate(fruit, "nana") %>%
  head(10)
      start end
 [1,]    NA  NA
 [2,]    NA  NA
 [3,]    NA  NA
 [4,]     3   6
 [5,]    NA  NA
 [6,]    NA  NA
 [7,]    NA  NA
 [8,]    NA  NA
 [9,]    NA  NA
[10,]    NA  NA

Other Types of Pattern

When you use a pattern that’s a string, it’s automatically wrapped into a call to regex()

# the regular call
str_view(fruit, "nan", match = TRUE)

# Is shorthand for
str_view(fruit, regex("nana"), match = TRUE)

You can use other arguments of regex() to control details of the match like:
ignore_case = TRUE
multiline = TRUE (allows the ^ and $ to match the start and end of each line rather than the start and end of the complete string )
comments =true
dotall = TRUE allows . to match everything, including

bananas <-  c("banana","Banana", "BANANA")
str_view(bananas,"banana")
str_view(bananas,regex("banana", ignore.case= TRUE ))
# multiline example
x <-  "Line 1\nLine 2 \nLine 3"
str_extract_all(x, "^Line") [[1]]
[1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE)) [[1]]
[1] "Line" "Line" "Line"
# comments = TRUE example
phone <-  regex("
                \\(?    # optiona opening parens
                (\\d{3}) # area code
                [)- ]?  # optional closing parens, dash, or space
                (\\d{3}) # another three numbers
                [ -]?    # optional space or dash
                (\\d{3}) # three more numbers
                ", comments = TRUE)
str_match("514-791-8141", phone)
     [,1]          [,2]  [,3]  [,4] 
[1,] "514-791-814" "514" "791" "814"

There are three other functions you can use nistead of regex()
fixed() matches exactly the specified sequence of bytes. It ignores all special regular expressions and operates at a very low level. This allows you to avoid complex escaping and can be much faster than regular expressions.

# microbenchmark shows that it's about 3x faster for a simple example
# you need to install microbenchmark package
microbenchmark::microbenchmark(
  fixed = str_detect(sentences, fixed("the")),
  regex = str_detect(sentences, "the"),
  times =20
  )
Unit: microseconds
  expr     min     lq     mean   median       uq     max neval
 fixed 323.410 340.05 365.8062 348.0145 364.7970 629.469    20
 regex 849.057 858.16 874.1598 862.9955 870.2485 990.425    20

fixed() is faster but problematic when there are multiple ways of representing the same character. Use coll() instead.

a1 <-  "\u00e1"
a2 <- "a\u0301"
c(a1,a2)
[1] "á" "a´"
# but a1 is not the same as a2
a1==a2
[1] FALSE

Coll() compares strings using standard collation rules. This is useful for doing case-insensitive matching. Note that coll() takes a locale parameter that controls which rules are used for comparing characters.

str_detect(a1, fixed(a2))
[1] FALSE
str_detect(a1, coll(a2))
[1] TRUE

Both fixed() and regex() have ignore_case statement. but only coll() allows you to ick the locale. They always use the default locale. The downside of coll() is speed. You can use boundary() to match boundaries.

x <-  "this is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
[[1]]
[1] "this"     "is"       "a"        "sentence"

How would you find all strings containing  with regex() vs. with fixed()?

str_subset(c("a\\b", "ab"), "\\\\")
[1] "a\\b"
#> [1] "a\\b"
str_subset(c("a\\b", "ab"), fixed("\\"))
[1] "a\\b"
#> [1] "a\\b"

What are the five most common words in sentences?

library(tidyverse)
package <U+393C><U+3E31>tidyverse<U+393C><U+3E32> was built under R version 3.3.3Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ------------------------------------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
str_extract_all(sentences, boundary("word")) %>%
  unlist() %>%
  str_to_lower() %>%
  tibble() %>%
  set_names("word") %>%
  group_by(word) %>%
  count(sort = TRUE) %>%
  head(5)

Other uses of regular expressions

Two other useful functions in base R that also use regular expressions:
apropos() searches allobjects available fromt he global environment. This is useful if you can’t quite remember the name of the function:

apropos("replace")
 [1] "%+replace%"              ".rs.registerReplaceHook" ".rs.replaceBinding"     
 [4] "replace"                 "replace_na"              "setReplaceMethod"       
 [7] "str_replace"             "str_replace_all"         "str_replace_na"         
[10] "theme_replace"          

dir() lists all the files in a directory. The pattern argument takes a regular expression and only returns filenmaes that match the pattern

head(dir(pattern = "\\.Rmd$"))
[1] "chapter1.Rmd"  "chapter10.Rmd" "chapter11.Rmd" "Chapter2.Rmd"  "Chapter3.Rmd"  "Chapter4.Rmd" 

Stringi

stringr is built on top of the stringi package. stingr is useful when you’r learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.

Stringi on the other hand is designed to be comperehensive. It contains almost every function you might ever need. Stringi has 234 functions to stringr’s 42.

Find the stringi functions that:

Count the number of words. stri_count_words() Find duplicated strings. stri_duplicated() Generate random text. There are several functions beginning with stri_rand_.
stri_rand_lipsum generates lorem ipsum text,
stri_rand_strings generates random strings,
stri_rand_shuffle randomly shuffles the code points in the text.

How do you control the language that stri_sort() uses for sorting?
Use the locale argument to the opts_collator argument.

---
title: "R For Data Science"
output: html_notebook
---

<h1> Chapter 11 Strings with stringr </h1>
This chapter introduces you to string manipulation in R. But the focus will be on regular expressions, or regexps for short. Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings. When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves, they will soon start to make sense. 

<h2> Pre requisites </h2>
library(tidyverse)
library(stringr)

<h2>String Basics </h2>
You can use double or single quotes. It doesnt matter. Ideal to use double quotes unless you want to create a string that contains multiple double quotes.
```{r}
string1 <-  "This is a string"
string2 <- 'To put a "quote" inside a string, use singe quotes'

string1
string2
```

You can also include a literal single or double quote in a string by escapeing it with \ </br>


```{r}
double_quote <-  "\"" # or '"'
single_quote <-  '\'' #  or "'"
double_quote
single_quote

```
To see the raw contents of the string, use writelines()

```{r}
writeLines(double_quote)
writeLines(single_quote)
```
Other special charatacters include \n for newline and \t for tab. Also \u00b5 sample way of writing non-English characters that works on all platforms. 

```{r}
x <-  "\u00b5"
x
writeLines(x)
```

Multiple strings can be stored in a charactacter vector with c()

```{r}
c("one","two","three")
c

```




<div id="strings" class="section level1">
<div id="introduction-4" class="section level2">
<p>Functions and packages coverered</p>
<ul>
<li><strong>stringr</strong> package</li>
<li><code>str_length</code></li>
<li><code>str_c</code></li>
<li><code>str_replace_na</code></li>
<li><code>str_sub</code></li>
<li><code>str_to_uppser</code>, <code>str_sort</code>, <code>str_to_lower</code>, <code>str_order</code></li>
<li><code>str_length</code>, <code>str_pad</code>, <code>str_trim</code>, <code>str_sub</code></li>
<li>For regex = <code>str_view</code>, <code>str_view_all</code></li>
<li>regex syntax</li>
<li><code>str_detect</code></li>
<li><code>str_subset</code></li>
<li><code>str_count</code></li>
<li><code>str_extract</code></li>
<li><code>str_match</code></li>
<li><code>tidyr::extract</code></li>
<li><code>str_split</code></li>
<li><code>str_locate</code></li>
<li><code>str_sub</code></li>
<li>the <strong>stringi</strong> package</li>
</ul>
<p>Ideas</p>
<ul>
<li>mention <a href="https://github.com/kevinushey/rex"><code>rex</code></a>. A package with friendly regular expressions.</li>
<li>Use it to match country names? Extract numbers from text?</li>
<li>Discuss fuzzy joining and string distance, approximate matching.</li>
</ul>
<p>Also see</p>
<ul>
<li><a href="http://stat545.com/block032_character-encoding.html">Character encoding</a> Stat 545. Jenny Bryan.</li>
<li><a href="http://stat545.com/block028_character-data.html">Character data</a>. Stat 545. Jenny Bryan.</li>
<li><a href="http://stat545.com/block022_regular-expression.html">Regular expression in R</a>. Stat 545. Jenny Bryan.</li>
</ul>


<h2> String Length </h2>
str_length() tells you the number of characters in a string 

```{r}
library(tidyverse)
library(stringr)
str_length( c("a","R for data science", NA))

```


<h2> Combining Strings </h2>
Use str_c() to combine one or more strings. Use the sep= argument to control how they're separated.</br>
# much like paste0 function (combines strings without spaces in between them)

```{r}
str_c("x","y")
str_c("x","y","z")
```

```{r}
str_c("x","y", sep = ", ")
```

Like most other functions in R, missing values are contagious. If you want them to print as "NA", use str_replace_na()

```{r}
x <-  c("abc", NA)
str_c("|-",x,"-|")
str_c("|-",str_replace_na(x),"-|")

```

Str_c is vectorized and it automatically recylces shorter vectors to the same length as the longest

```{r}
str_c("prefix-",c("a","b","c"), "-suffix")
```

Objects of length 0 are dropped. 

```{r}
name <- "Hadley"
time_of_day <-  "morning"
birthday <-  TRUE
# birthday <-  FALSE
str_c("Good",time_of_day,name, if(birthday) " and Happy Birthday", sep = " ")


```

To collapse a vector of strings into a single string, use collapse argument

```{r}
str_c(c("x","y","z"), collapse = ", ")
```

<h2> Subsetting Strings </h2>
You can extract parts of a string using str_sub(). It takes start and end arguments that give the position of the substring. Negative numbers ocunt backwards from the end. It won't fail if the string is too short. It will return as much as possible of the string.

```{r}
x <-  c("Apple", "Banana", "CaRrots")
str_sub(x, 1, 3)
str_sub(x, -3, -1)
str_sub(x, 1, 7)
```

Using the assignment form of str_sub, we can change the capitalization on the first character of each string in x

```{r}
x
str_sub(x,1,1) <- str_to_lower(str_sub(x,1,1))
x

```


<h2> Locales </h2>
You can also use str_to_upper() and str_to_title(). However, changing case is more complicated because of different languages . You can set which rules to apply by specifyinga locale

```{r}
# turkish has two i's: with and without a dot
# it has a different rule for capitalizing them

str_to_upper(c("i","I"), locale = "tr")
```

Locales also affects sorting. The base R order() and sort() functions sort strings using the current locale. If you want robust behavior across different computers, you want to use str_sort() ans str_order() which take an additional locale argument.

```{r}
x <- c("apple","eggplant","banana")
str_sort(x, locale = "en")
str_sort(x, locale = "haw")
```

Exercises:
In code that doesn't use stringr, you'll often see paste() and paste0(). What's the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA? </br>

The function paste seperates strings by spaces by default, while paste0 does not seperate strings with spaces by default.Since str_c does not seperate strings with spaces by default it is closer in behabior to paste0.<br>

```{r}
paste("foo", "bar")
#> [1] "foo bar"
paste0("foo", "bar")
#> [1] "foobar"
```


However, str_c and the paste function handle NA differently. The function str_c propogates NA, if any argument is a missing value, it returns a missing value. This is in line with how the numeric R functions, e.g. sum, mean, handle missing values. However, the paste functions, convert NA to the string "NA" and then treat it as any other character vector.</br>

```{r}
str_c("foo", NA)

paste("foo", NA)

paste0("foo", NA)

```
In your own words, describe the difference between the <b>sep</b> and <b>collapse</b> arguments to str_c(). </br>
The sep argument is the string inserted between argugments to str_c, while collapse is the string used to separate any elements of the character vector into a character vector of length one. </p>


Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters? </br>
The following function extracts the middle character. If the string has an even number of characters the choice is arbitrary. We choose to select  n/2 , because that case works even if the string is only of length one. A more general method would allow the user to select either the floor or ceiling for the middle character of an even string. </p>

```{r}
x <- c("a", "abc", "abcd", "abcde", "abcdef")
L <- str_length(x)
m <- ceiling(L / 2)
str_sub(x, m, m)

```

What does str_wrap() do? When might you want to use it? </br>
The function str_wrap wraps text so that it fits within a certain width. This is useful for wrapping long strings of text to be typeset.</p>

What does str_trim() do? What's the opposite of str_trim()? </br>
The function str_trim trims the whitespace from a string.

```{r}
str_trim(" abc ")
#> [1] "abc"
str_trim(" abc ", side = "left")
#> [1] "abc "
str_trim(" abc ", side = "right")
#> [1] " abc"
```

The opposite of str_trim is str_pad which adds characters to each side.</p>
```{r}
str_pad("abc", 5, side = "both")
#> [1] " abc "
str_pad("abc", 4, side = "right")
#> [1] "abc "
str_pad("abc", 4, side = "left")
#> [1] " abc"
```

Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2. </p>

```{r}
str_commasep <- function(x, sep = ", ", last = ", and ") {
  if (length(x) > 1) {
    str_c(str_c(x[-length(x)], collapse = sep),
                x[length(x)],
                sep = last)
  } else {
    x
  }
}
str_commasep("")
str_commasep("a")
str_commasep(c("a", "b"))
str_commasep(c("a", "b", "c"))

```

<h2> Matching Patterns with regular expressions </h2>
Regexp are very terse language that allow you to describe patterns in strings. They take a little while to get your head around. To learn regular expressions, we'll use str_view() and str_view_all(). These functions take a character vector and a regular expression and show you how they match. We'll start with very simple regular expresions and then gradually get more and more complicated. 


<h3> basic matches </h3>
The simplest patterns match exact strings:

```{r}
# need to instal htmlwidgets package first
library(tidyverse)
library(stringr)
x <-  c("apple", "banana", "pear")
str_view(x,"an")
```

. matches any character

```{r}
str_view(x, ".a.")
```

to match special characters: use \\

```{r}
# dot
dot <- "\\."
writeLines(dot)
```

This tells R to look for an explict .

```{r}
str_view(c("abc", "a.c","bef"),"a\\.c")


```
```{r}
str_view(c("abc", "a.c","bef"),".\\..")

```

To match a literal \,you need four \\\\

```{r}
x <- "a\\b"
str_view(x, "\\\\")

```


Explain why each of these strings don't match a \: "\", "\\", "\\\". </br>
"\": This will escape the next character in the R string. </br>
"\\": This will resolve to \ in the regular expression, which will escape the next character in the regular expression.</br>
"\\\": The first two backslashes will resolve to a literal backslash in the regular expression, the third will escape the next character. So in the regular expresion, this will escape some escaped character.</br>


How would you match the sequence "'\ ?

```{r}
x <- c("'\\","a","b")
writeLines(x)
str_view(x, "\'\\\\")
```


What patterns will the regular expression \..\..\.. match? How would you represent it as a string? </br>
It will match any patterns that are a dot followed by any character, repeated three times.


<h3> Anchors </h3>
It's often useful to anchor the regular expression so that it matches fromt he start or end of the string. You can use </br>
^ to match the start of the string </br>
$ to match the end of the string </br>

Mnemonic begin with Power ^ and end with money($)

```{r}
x <-  c("apple","banana","pear")
str_view(x,"^a")
str_view(x,"a$")
```

To force a regular expression to only match a complete string, enclose it with ^ and $

```{r}
x <- c("apple pie", "apple", "apple cake")
str_view(x,"apple")
# will list all three
str_view(x, "^apple$")
# willlist only apple

```
You can use \b to match the boundary between words

```{r}
x <- c("applecrust pie", "apple crust pie", "apple crumble cake")
str_view(x, "\\bcrust\\b")

```


How would you match the literal string " ^^ "?
```{r}
str_view(c("$^$", "ab$^$sfas"), "^\\$\\^\\$$")
```

Given the corpus of common words in stringr::words, create regular expressions that find all words that: </br>
Since this list is long, you might want to use the <b>match=TRUE</b> argument to str_view() to show only the matching or non-matching words.</br>

Start with "y".</br>
```{r}
str_view(words,"^y.", match = TRUE)

```

End with "x"</br>

```{r}
str_view(words,"x$", match = TRUE)

```

Are exactly three letters long. (Don't cheat by using str_length()!)</br>

```{r}
str_view(words,"^...$", match = TRUE)
```

Have seven letters or more.</br>
```{r}
str_view(words,"^.......", match = TRUE)
```

<h3> Character Classes and Alternatives </h3>
\d matches any digit </br>
\s matches any whitespace </br>
[abc] matches a,b or c </br>
[^abc] matches anything except a, b or c </br>

remember to use \\d and \\s for those special characters </br>

You can use alternates to pick between one or more alternative patterns. For example, "abc|d..f" will match either "abc" or "deaf" Note that the precendence for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Use parenthesis to make it clear what you want

```{r}
str_view(c("grey","gray"),"gr(e|a)y")
```


Exercises: </br>
Create regular expressions to find all words that:

Start with a vowel.

```{r}
str_view(words,"^[aeiou].", match = TRUE)
```

That only contain consonants. (Hint: thinking about matching "not"-vowels.)
```{r}
str_view(stringr::words, "^[^aeiou]+$", match = TRUE)
```

End with ed, but not with eed.
```{r}
str_view(stringr::words, "^ed$|[^e]ed$", match = TRUE)
```

End with ing or ise.
```{r}
str_view(stringr::words, "ing$|ise$", match = TRUE)
# str_view(stringr::words, "i(ng|se)$", match = TRUE)
```


Empirically verify the rule "i before e except after c".

```{r}
str_view(stringr::words, "(cei|[^c]ie)", match = TRUE)
```


```{r}
str_view(stringr::words, "(cie|[^c]ei)", match = TRUE)
```
Is "q" always followed by a "u"?

```{r}
str_view(stringr::words, "qu", match = TRUE)
```

```{r}
str_view(stringr::words, "q[^u]", match = TRUE)
```

Create a regular expression that will match telephone numbers as commonly written in your country.
Using what has been covered in R4DS thus far


```{r}
x <- c("123-456-7890", "1235-2351")
str_view(x, "\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d")
```


<h3> Repetition </h3>
Controls how many times a pattern matches.</br>
? 0 or 1 </br>
+ 1 or more</br>
* 0 or more </br>


```{r}
x <-  "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
```

```{r}
str_view(x, "CC+")
```

```{r}
str_view(x, "CC[LX]+")
```

You can also specify the number of matches precisely: </br>
{n} exactly n times </br>
{n,} n or more </br>
{,m} at most m </br>
{n,m} between n and m </br>

```{r}
str_view(x, "C{2}")
```

```{r}
str_view(x, "C{2,}")
```
```{r}
str_view(x, "C{2,3}")
```

```{r}
str_view(x, "C{,2}")
```

by default regexp will be greedy. They will match the longest string possible. Make them lazy by putting a [?] after them. 

```{r}
str_view(x, 'C{2,3}?')
```
```{r}
str_view(x, "C[LX]+?")
```
Describe the equivalents of ?, +, * in {m,n} form. </br>
The equivalent of ? is {,1}, matching at most 1. The equivalent of + is {1,}, matching 1 or more. There is no direct equivalent of * in {m,n} form since there are no bounds on the matches: it can be 0 up to infinity matches. </p>


Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.) </br>

^.*$: Any string </br>
"\\{.+\\}": Any string with curly braces surrounding at least one character. </br>
\d{4}-\d{2}-\d{2}: A date in "%Y-%m-%d" format: four digits followed by a dash, followed by two digits followed by a dash, followed by another two digits followed by a dash. </p>


"\\\\{4}": This resolves to the regex \\{4}, which is four backslashes. </p>



Create regular expressions to find all words that: </br>
find all words starting with three consonants
```{r}
str_view(words, '^[^aeiou]{3}?', match = TRUE)
```

find three or more vowels in a row:
```{r}
str_view(words, '[aeiou]{3,}?', match = TRUE)
```

Find Two or more vowel-consonant pairs in a row.

```{r}
str_view(words, "([aeiou][^aeiou]){2,}", match = TRUE)
```

<h3> Grouping and Backreferences </h3>
Parenttheses is a way to disambiguate complex expressions. They also define groups that you can refer to with backreferences , like \1, \2 etc. For example the following regular expression finds all fruits that have a repeated pair of leters:

```{r}
str_view(fruit, "(..)\\1", match = TRUE)
```

Exercises </br>

Describe, in words, what these expressions will match: </br>

(.)\1\1 : The same character apearing three times in a row. E.g. "aaa" </br>
"(.)(.)\\2\\1": A pair of characters followed by the same pair of characters in reversed order. E.g. "abba". </br>
(..)\1: Any two characters repeated. E.g. "a1a1". </br>
"(.).\\1.\\1": A character followed by any character, the original character, any other character, the original character again. E.g. "abaca", "b8b.b". </br>
"(.)(.)(.).*\\3\\2\\1" Three characters followed by zero or more characters of any kind followed by the same  three characters but in reverse order. E.g. "abcsgasgddsadgsdgcba" or "abccba" or "abc1cba". </br>


Construct regular expressions to match words that:</br>

Start and end with the same character. Assuming the word is more than one character and all strings are considered words, ^(.).*\1$ </br>

```{r}
str_view(words, "^(.).*\\1$", match = TRUE)
```


Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.) 

```{r}
str_view(words, "(.).*\\1.*\\1", match = TRUE)

```



<h2> Tools </h2>

Learn about stringr functions that let you : </br>
Determine which strings match a pattern str_view, str_view_all, str_detect() and str_count()</br>
find the position of matches </br>
Extract the content of matches str_extract() </br>
Replaces matches with new values str_replace() and str_replace_all()</br>
Split a string based on a match str_split()</br>

<h3> Detect Matches str_detect() </h3>

```{r}
x <- c("apple", "banana", "pear")
str_detect(x,"e")
```

Since Fales = 0 and True = 1 you can sum() and mean() it:

```{r}
# how many common words start with t?
sum(str_detect(words, "^t"))
```

```{r}
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
```

A common use of str_detect is to select the elements that match a pattern. You can do this with logical subsetting , or the convenient str_subset() wrapper.

```{r}
# logical subsetting method
words[str_detect(words, "x$")]
```


```{r}
# using str_subset method
str_subset(words, "x$")
```

Typically your strings will be one colum of a data frame, and you'll want to use filter instead:

```{r}
df <-  tibble(word=words, i=seq_along(word))
df %>%
  filter(str_detect(words, "x$"))
```

A variation on str_detect() is str_count(). rather than a simple yes or no, it tells you how many matches there are in a string. 

```{r}
x <- c("apple","banana","pear")
str_count(x, "a")
```

```{r}
# on the average, how many vowels per word?
mean(str_count(words,"[aeiou]"))
```

You can also use str_count() with mutate

```{r}
df %>%
  mutate(
    vowels = str_count(word,"[aeiou]"), 
    consonants = str_count(word, "[^aeiou]")
  )
```


Note: matches never overlap. For example, in "abababa" how many times will the pattern "aba" match? 

```{r}
str_count("abababa","aba")
str_view_all("abababa","aba")
```




For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls. </br>

Find all words that start or end with x.
```{r}
str_view(words, "^x|x$", match = TRUE)
words[str_detect(words, "^x|x$")]
start_with_x <- str_detect(words, "^x")
end_with_x <- str_detect(words, "x$")
words[start_with_x | end_with_x]

```

Find all words that start with a vowel and end with a consonant.
```{r}
words[str_detect(words,"^[aieou].*[^aeiou]$")]
```

Are there any words that contain at least one of each different vowel?

```{r}
 
words[str_detect(words, "a") &
        str_detect(words, "e") &
        str_detect(words, "i") &
        str_detect(words, "o") &
        str_detect(words, "u")]


```
What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)

```{r}
prop_vowels <- str_count(words, "[aeiou]") / str_length(words)
words[which(prop_vowels == max(prop_vowels))]
```



<h2> Extract Matches </h2>
To extract the actual text of a match, use st_extract()

```{r}
# using harvard sentences which were designed to test voip systems
length(sentences)
head(sentences)
```


Imagine we want to find all sentences that contain a color. We first create a vector of color names, and then turn it into a single regular expression. 

```{r}
library(stringr)
colors <- c("red","orange","yellow","green","blue","purple")
color_match <-  str_c(colors, collapse = "|")
color_match

colour_match2 <- str_c("\\b(", str_c(colors, collapse = "|"), ")\\b")
colour_match2


```

Now we can select the sentences that contain a color, and then extract the color to figure out which one it is:

```{r}
has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
```

```{r}
# but str_extract only extracts the first match.
more <-  sentences[str_count(sentences, color_match) >  1]
str_view_all(more, color_match)
```

```{r}
# a better example using \\b for colour_match to take out flickeRED

more2 <- sentences[str_count(sentences, colour_match2) > 1]
str_view_all(more2, colour_match2, match = TRUE)

```


```{r}
str_extract(more, color_match)
```

To get ALL matches, use str_extract_all

```{r}
str_extract_all(more, color_match)
```

If you use simplify = TRUE, str_extract_all will return a matrix with short matches expanded to the same legnth as the longest.

```{r}
str_extract_all(more, color_match, simplify=TRUE)
```

```{r}
x <-  c("a","a b","a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)

```


<h2> Grouped Matches </h2>

You can also use parentheses to extract parts of a complex match.  For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the" 
Defining a word in a regular expression is a little tricky.

```{r}
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
  str_subset(noun) %>%
head(10)

has_noun %>%
  str_extract(noun)

```


str_extract() gives us the complete match; 
str_match() gives us each individual component

```{r}
has_noun %>%
  str_match(noun)
```


You can also used tidyr::extract()

```{r}
tibble(sentence = sentences) %>%
  tidyr::extract(
    sentence, c("article", "noun"),"(a|the) ([^ ]+)", remove = FALSE
  )
```


Like str_extract, if you want ALL matches for each string, you'll need str_match_all()

Find all words that come after a "number" like "one", "two", "three" etc. Pull out both the number and the word.



```{r}
numword <- "(one|two|three|four|five|six|seven|eight|nine|ten) +(\\S+)"
sentences[str_detect(sentences, numword)] %>%
  str_extract(numword)
```


Find all contractions. Separate out the pieces before and after the apostrophe.
```{r}
contraction <- "([A-Za-z]+)'([A-Za-z]+)"
sentences %>%
  `[`(str_detect(sentences, contraction)) %>%
  str_extract(contraction)

```

<h2> Replacing Matches </h2>

str_replace() and str_replace_all() allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string:

```{r}

x <- c("apple","pear", "banana")
str_replace(x, "[aeiou]", "-")
```


```{r}
x <- c("apple","pear", "banana")
str_replace_all(x, "[aeiou]", "-")
```

You can insert backreferences (identified when you use parenthesis) to insert components of the match. In the following code, I flip the order of the second and third words:

```{r}
sentences %>%
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
  head(5)

```


<h2> Splitting </h2>
Use str_split() to split a string up into pieces. For example, we could split sentences into words:

```{r}
sentences %>%
  head(5) %>%
  str_split(" ")
```

```{r}
sentences %>%
  head(5) %>%
  str_split(" ", simplify =TRUE)

```

Split up a string like "apples, pears, and bananas" into individual components.

```{r}
x <- c("apples, pears, and bananas")
str_split(x, ", +(and +)?")[[1]]
```
Why is it better to split up by boundary("word") than " "? </br>
Answer: Splitting by boundary("word") splits on punctuation and not just whitespace.


What does splitting with an empty string ("") do?

```{r}
str_split("ab. cd|agt", "")[[1]]

```

Answer: It splits the string into individual characters. 

<h3> Find Matches </h3>

str_locate() and str_locate_all() give you the starting and ending positiong of each match. These are particulary useful when on of the other ufnctions does exactly what you want. You can use str_locate() to find the matching pattern, and str_sub() to extract and/or modify them. 

```{r}
# the regular call
str_locate(fruit, "nana") %>%
  head(10)
```


<h2> Other Types of Pattern </h2>
When you use a pattern that's a string, it's automatically wrapped into a call to regex()

```{r}
# the regular call
str_view(fruit, "nan", match = TRUE)
# Is shorthand for
str_view(fruit, regex("nana"), match = TRUE)
```


You can use other arguments of regex() to control details of the match like: </br>
ignore_case = TRUE </br>
multiline = TRUE (allows the ^ and $ to match the start and end of each line rather than the start and end of the complete string ) </br>
comments =true </br>
dotall = TRUE allows . to match everything, including \n</br>


```{r}
bananas <-  c("banana","Banana", "BANANA")
str_view(bananas,"banana")
```

```{r}
str_view(bananas,regex("banana", ignore.case = TRUE ))
```
```{r}
# multiline example
x <-  "Line 1\nLine 2 \nLine 3"
str_extract_all(x, "^Line") [[1]]
str_extract_all(x, regex("^Line", multiline = TRUE)) [[1]]

```


```{r}
# comments = TRUE example

phone <-  regex("
                \\(?    # optiona opening parens
                (\\d{3}) # area code
                [)- ]?  # optional closing parens, dash, or space
                (\\d{3}) # another three numbers
                [ -]?    # optional space or dash
                (\\d{3}) # three more numbers
                ", comments = TRUE)

str_match("514-791-8141", phone)

```

There are three other functions you can use nistead of regex() </br>
fixed() matches exactly the specified sequence of bytes. It ignores all special regular expressions and operates at a very low level. This allows you to avoid complex escaping and can be much faster than regular expressions. 

```{r}
# microbenchmark shows that it's about 3x faster for a simple example
# you need to install microbenchmark package

microbenchmark::microbenchmark(
  fixed = str_detect(sentences, fixed("the")),
  regex = str_detect(sentences, "the"),
  times = 20
  )
```

fixed() is faster but problematic when there are multiple ways of representing the same character.
Use coll() instead.

```{r}
a1 <-  "\u00e1"
a2 <- "a\u0301"
c(a1,a2)
```

```{r}
# but a1 is not the same as a2

a1 == a2

```


Coll() compares strings using standard collation rules. This is useful for doing case-insensitive matching. Note that coll() takes a locale parameter that controls which rules are used for comparing characters.

```{r}
str_detect(a1, fixed(a2))
str_detect(a1, coll(a2))
```

Both fixed() and regex() have ignore_case statement. but only coll() allows you to ick the locale. They always use the default locale.  The downside of coll() is speed. You can use boundary() to match boundaries.

```{r}
x <-  "this is a sentence."
str_view_all(x, boundary("word"))

```
```{r}
str_extract_all(x, boundary("word"))

```


How would you find all strings containing \ with regex() vs. with fixed()?

```{r}

str_subset(c("a\\b", "ab"), "\\\\")
#> [1] "a\\b"
str_subset(c("a\\b", "ab"), fixed("\\"))
#> [1] "a\\b"

```


What are the five most common words in sentences?

```{r}
library(tidyverse)
str_extract_all(sentences, boundary("word")) %>%
  unlist() %>%
  str_to_lower() %>%
  tibble() %>%
  set_names("word") %>%
  group_by(word) %>%
  count(sort = TRUE) %>%
  head(5)

```

<h3> Other uses of regular expressions </h3>

Two other useful functions in base R that also use regular expressions: </br>
apropos() searches allobjects available fromt he global environment. This is useful if you can't quite remember the name of the function:

```{r}
apropos("replace")

```
dir() lists all the files in a directory.  The pattern argument takes a regular expression and only returns filenmaes that match the pattern

```{r}
head(dir(pattern = "\\.Rmd$"))
```

<h2> Stringi </h2>

stringr is built on top of the stringi package. stingr is useful when you'r learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. </p>

Stringi on the other hand is designed to be comperehensive. It contains almost every function you might ever need. Stringi has 234 functions to stringr's 42. 
</p>

Find the stringi functions that:

Count the number of words. stri_count_words()
Find duplicated strings. stri_duplicated()
Generate random text. There are several functions beginning with stri_rand_. </br>
stri_rand_lipsum generates lorem ipsum text, </br>
stri_rand_strings generates random strings, </br>
stri_rand_shuffle randomly shuffles the code points in the text.</br>

How do you control the language that stri_sort() uses for sorting?</br>
Use the locale argument to the opts_collator argument.