Chapter 14 Strings - R for Data Science

Thi

9/20/2019

knitr:: opts_chunk$set(echo=TRUE, results = "asis", cache = TRUE)

library(stringr)
library(tidyverse)

## -- Attaching packages ---------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v readr   1.3.1
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v dplyr   0.8.3
## v ggplot2 3.2.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(stringi)

14.2.5 Q1 In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA

paste('today','is')

[1] “today is”

paste0('today','is')

[1] “todayis”

str_c('today','is')

[1] “todayis”

paste0 is similiar to str_c

paste('today','is',NA)

[1] “today is NA”

paste0('today','is',NA)

[1] “todayisNA”

str_c('today','is', NA)

[1] NA

paste handle NA like a object

paste0 handle NA like character

str_c propagates NA, if any values is missing, the function returns NA

14.2.5 Q2 In your own words, describe the difference between the sep and collapse arguments to str_c().

str_c("today", "is", sep = ", ")

[1] “today, is”

str_c("today", "is", collapse = ",")

[1] “todayis”

sep can be used to add character together and separate them by special character.

today <- c ('today','is', "a")
str_c(today, collapse = "|")

[1] “today|is|a”

collapse cannot be use to paste charater together, it can only add special character between strings in a list.

14.2.5 Q3 Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

round(str_length('today')/2,0)+1

[1] 3

str_sub('today',round(str_length('today')/2,0)+1,round(str_length('today')/2,0)+1)

[1] “d”

str_sub('dean',round(str_length('today')/2,0)+1,round(str_length('today')/2,0)+1)

[1] “a”

If string has a even number of chracter, get the “right” middle character. For example: “dean” -> get “a”

round(12.5, digits = 0)

[1] 12

round(13.5 ,digits=0)

[1] 14

round(14.5, digits = 0)

[1] 14

On a side not, function round() would round (0.5) to nearest even number. It is not a good function to use for calculating grades, or business transactions. Look at: floor(), signif(), ceiling(), trunc() for more options.

14.2.5 Q4 What does str_wrap() do? When might you want to use it?

str_wrap("Cincoro, a new high-end Tequila helmed by a team of NBA owners—including Michael Jordan—and finance professionals, has launched in 12 markets across the U.S. The new entry is taking aim at the luxury market in the on- and off-premise with packaging designed by Nike’s Mark Smith and liquid selected by the brand’s founders in consultation with industry experts.", width = 50, indent = 20, exdent = 20)

[1] " Cincoro, a new high-end Tequila helmed by a teamof NBA owners—including Michael Jordan—and financeprofessionals, has launched in 12 markets acrossthe U.S. The new entry is taking aim at the luxurymarket in the on- and off-premise with packagingdesigned by Nike’s Mark Smith and liquid selectedby the brand’s founders in consultation withindustry experts."

cat(str_wrap("Cincoro, a new high-end Tequila helmed by a team of NBA owners—including Michael Jordan—and finance professionals, has launched in 12 markets across the U.S. The new entry is taking aim at the luxury market in the on- and off-premise with packaging designed by Nike’s Mark Smith and liquid selected by the brand’s founders in consultation with industry experts.", width = 50, indent = 20, exdent = 20), "\n")

                Cincoro, a new high-end Tequila helmed by a team
                of NBA owners—including Michael Jordan—and finance
                professionals, has launched in 12 markets across
                the U.S. The new entry is taking aim at the luxury
                market in the on- and off-premise with packaging
                designed by Nike’s Mark Smith and liquid selected
                by the brand’s founders in consultation with
                industry experts.

str_wrap deal with very long text, it cut long text into smaller defined length (width = ).

14.2.5 Q5 What does str_trim() do? What’s the opposite of str_trim()?

str_trim(" today is a  ", side = c("both"))

[1] “today is a”

str_trim(" today is a  ", side = c("left"))

[1] “today is a”

str_trim(" today is a  ", side = c("right"))

[1] " today is a"

str_trim trim space in text Opposite of str_trim is str_pad. Add space to text

str_pad("today is", width =50, side = c("both"))

[1] " today is " 14.2.5 Q6 Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

vec_to_str <- function(x) { string = paste(x,‘,’) string}

vec_to_str <- function(x) {
    n <- length(x)
    if (n == 0)
        ""
    else if (n == 1)
        x[1]
    else if (n == 2)
        paste(x[1],'and',x[2])
    else if (n == 3)
        str_c(x[1], x[2],paste('and',x[3]), sep = ' ')
    else 'Please enter less than 4 elements vector only'
}
a <- c("a","g", "d")
a

[1] “a” “g” “d”

vec_to_str(a)

[1] “a g and d”

14.3.1 Q2 How would you match the sequence "’ ?

x <- c("abc", "a.c", "\"\'\\\\")
x

[1] “abc” “a.c” “"’\\”

str_view(x,"\"\'\\\\")

14.3.1 Q3 What patterns will the regular expression ...... match? How would you represent it as a string?

"\\..\\..\\.."

[1] “\..\..\..”

str_view(".ah.bjk.", "\\..\\..\\..")

It matches “.” something “.” something “.” For example .Q.W. or .as.fgfg.

14.3.2 Q1 How would you match the literal string “$^$”

x <- c("abc", "a.c", "$^$")
x

[1] “abc” “a.c” “$^$”

str_view(x,"\\$\\^\\$")

14.3.2 Q2 Given the corpus of common words in stringr::words , create regular expressions that find all words that:

Start with “y”.

str_view(words,"^y",match = T)

End with “x”

str_view(words,"x$", match = T)

Are exactly three letters long.

str_view(words,"^...$", match = T)

Have seven letters or more.

str_view(words, ("......."), match = T)

14.3.3.1 Q1 Create regular expressions to find all words that:

Start with a vowel.

str_view(words,"^[e,u,i,o,a,]", match = T)

That only contain consonants. (Hint: thinking about matching “not”-vowels.)

str_view(words,"^[^e,u,i,o,a,]+$", match = T)

End with ed, but not with eed.

str_view(words,"(^ed$|[^e]ed$)", match = T)

End with ing or ise.

str_view(words,"(ing$|ise$)",match = T)

14.3.3.1 Q2 Empirically verify the rule “i before e except after c”.

str_view(words, "(ie|cei)", match = T)

14.3.3.1 Q3 Is “q” always followed by a “u”?

str_view(words,"q[^u]", match = T)

14.3.3.1 Q4 Write a regular expression that matches a word if it’s probably written in British English, not American English.

str_view(words,".ou.", match = T)

colour

14.3.3 Q5 Create a regular expression that will match telephone numbers as commonly written in your country.

phone <- c("1 219 733 8965", "1-329-293-8753 ", "banana", "595 794 7569",
  "387 287 6718", "apple", "1.233.398.9187  ", "482 952 3315",
  "239 923 8115 and 842 566 4692", "Work: 579-499-7527", "$1000",
  "Home: 543.355.3679")
str_view(phone,"1.[0-9][0-9][0-9].[0-9][0-9][0-9].[0-9][0-9][0-9][0-9]", match = T)

14.3.4 Q2 Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

^.*$ Match any string “\{.+\}” Match any string with {} -- match dddd-dd-dd (digits) “\\{4}” match \\

14.3.4 Q3 Create regular expressions to find all words that: Start with three consonants.

str_view(words,"^[e,u,i,o,a,]{3}", match = T)

Have three or more vowels in a row.

str_view(words,"[^e,u,i,o,a,]{3,}", match = T)

Have two or more vowel-consonant pairs in a row

str_view(words,"([e,u,i,o,a][^e,u,i,o,a]){2,}", match = T)

14.3.5 Q1 Describe, in words, what these expressions will match: (.)\1\1 : repeating of a charater : “aaa”, “bbb”, “ppp”, “rrr” “(.)(.)\2\1” “noon”, “appa”, “lool”, “tyyt” (..)\1 repeat of a pair of characters : “emem”, “anan” “popo” “(.).\1.\1” a character, any charater, the first character repeat, any charater, the first charater repeat again: “ataya”, “erehe”, “dedad” "(.)(.)(.).*\3\2\1 : any charater, any charater, any charater, many charater, repeat 3rd charater, repeat 2nd charater, repeat 1st charater: example

str_view(c("qweeeeeeewq","rtyuuytr"),"(.)(.)(.).*\\3\\2\\1", match = T)

14.3.5 Q2 Construct regular expressions to match words that: Start and end with the same character.

str_view(words,"^(.).*\\1$", match=T)

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_view(words,"(.)(.).*\\1\\2", match=T)

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(words,"(.).*\\1.*\\1", match=T)

14.4.1 Q1 For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

Find all words that start or end with x.

str_view(words,"(^x|x$)", match = T)

start_x <- str_detect(words,"^x")
end_x <- str_detect(words, "x$")
words[end_x|start_x]

[1] “box” “sex” “six” “tax”

Find all words that start with a vowel and end with a consonant.

start_vowel <- str_detect(words,"^[e,u,i,o,a]")
end_consonant <- str_detect(words,"[^e,u,i,o,a$]$")
words[start_vowel & end_consonant]

[1] “about” “accept” “account” “across” “act”
[6] “actual” “add” “address” “admit” “affect”
[11] “afford” “after” “afternoon” “again” “against”
[16] “agent” “air” “all” “allow” “almost”
[21] “along” “already” “alright” “although” “always”
[26] “amount” “and” “another” “answer” “any”
[31] “apart” “apparent” “appear” “apply” “appoint”
[36] “approach” “arm” “around” “art” “as”
[41] “ask” “at” “attend” “authority” “away”
[46] “awful” “each” “early” “east” “easy”
[51] “eat” “economy” “effect” “egg” “eight”
[56] “either” “elect” “electric” “eleven” “employ”
[61] “end” “english” “enjoy” “enough” “enter”
[66] “environment” “equal” “especial” “even” “evening”
[71] “ever” “every” “exact” “except” “exist”
[76] “expect” “explain” “express” “identify” “if”
[81] “important” “in” “indeed” “individual” “industry”
[86] “inform” “instead” “interest” “invest” “it”
[91] “item” “obvious” “occasion” “odd” “of”
[96] “off” “offer” “often” “okay” “old”
[101] “on” “only” “open” “opportunity” “or”
[106] “order” “original” “other” “ought” “out”
[111] “over” “own” “under” “understand” “union”
[116] “unit” “university” “unless” “until” “up”
[121] “upon” “usual”

Are there any words that contain at least one of each different vowel?

A = str_detect(words, "a")
E = str_detect(words, "e")
O = str_detect(words, "o")
U = str_detect(words, "u")
I = str_detect(words, "i")
words[A&E&O&U&I]

character(0)

14.1.1 Q2 What word has the highest number of vowels?

max(str_count(words,"[e,u,i,o,a]"))

[1] 5

str_view(words,"(.*[e,u,i,o,a].*){5}", match = T)

What word has the highest proportion of vowels?

count <- str_count(words,"[e,u,i,o,a]")
length <- str_length(words)
df <- tibble(words = words, count = count, length = length)
df %>%
  mutate(proportion = count/length) %>%
  filter(proportion == max(proportion))

A tibble: 1 x 4

words count length proportion 1 a 1 1 1

14.4.2 Q1 In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colours1 <- str_c("\\b(",str_c(colours,collapse = "|"), ")\\b")
colours1

colour_match <- str_c(colours1, collapse = "|")
colour_match

has_colour <- str_subset(sentences, colours1)
has_colour[1:10]

[1] “Glue the sheet to the dark blue background.”
[2] “Two blue fish swam in the tank.”
[3] “A wisp of cloud hung in the blue air.”
[4] “Leaves turn brown and yellow in the fall.”
[5] “The spot on the blotter was made by green ink.” [6] “The sofa cushion is red and of light weight.”
[7] “The sky that morning was clear and bright blue.” [8] “A blue crane is a tall wading bird.”
[9] “It is hard to erase blue or red ink.”
[10] “The lamp shone with a steady green flame.”

matches <- str_extract(has_colour, colour_match)
matches

[1] “blue” “blue” “blue” “yellow” “green” “red” “blue”
[8] “blue” “blue” “green” “red” “red” “red” “green” [15] “green” “purple” “green” “red” “red” “blue” “blue”
[22] “red” “green” “green” “green” “yellow” “orange” “red”
[29] “red”

more <- sentences[str_count(sentences, colour_match) > 1]
more

[1] “It is hard to erase blue or red ink.”
[2] “The sky in the west is tinged with orange red.”

14.4.2 Q2 From the Harvard sentences data, extract: The first word from each sentence.

(str_extract(sentences,word(sentences,1)))[1:10]

[1] “The” “Glue” “It’s” “These” “Rice” “The” “The” “The”
[9] “Four” “Large”

All words ending in ing.

end_w_ing <- str_subset(sentences,"([^ ]+ing[ .])|([^ ]+ing[ ])")
str_extract(end_w_ing, "([^ ]+ing[ .])|([^ ]+ing)")

[1] “spring.” “evening.” “morning.” “winding” “living.”
[6] “king” “Adding” “making” “raging” “playing”
[11] “sleeping” “ring.” “glaring” “sinking.” “dying”
[16] “Bring” “lodging” “filing” “making” “morning”
[21] “wearing” “Bring” “wading” “swing” “nothing.”
[26] “ring” “morning” “sing” “sleeping” “painting.” [31] “walking” “bring” “bring” “shipping.” “spring”
[36] “ring” “winding” “puzzling” “spring” “landing.”
[41] “thing” “waiting” “whistling” “nothing.” “timing”
[46] “thing” “spring” “changing.” “drenching” “moving”
[51] “working” “ring”

All plurals.

end_w_s <- str_subset(sentences,"\\b([A-z]{3,}s\\b)")
str_extract(end_w_s,"\\b([A-z]{3,}s\\b)")[1:20]

[1] “planks” “days” “bowls” “lemons” “hogs”
[6] “hours” “stockings” “helps” “fires” “across”
[11] “bonds” “Press” “useless” “kittens” “days”
[16] “Sickness” “grass” “books” “keeps” “leads”

14.4.3 Q1 Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.

numbers <- c("one","two", "three", "four", "five", "six", "seven", "eight", "nine", "ten")

numbers_and_word <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten)\\b ([^ ]+)"
numbers_and_word

sentences_w_number <- str_subset(sentences,numbers_and_word)
str_extract(sentences_w_number, numbers_and_word)

[1] “seven books” “two met” “two factors” “three lists”
[5] “seven is” “two when” “ten inches.” “one war”
[9] “one button” “six minutes.” “ten years” “two shares”
[13] “two distinct” “five cents” “two pins” “five robins.” [17] “four kinds” “three story” “three inches” “six comes”
[21] “three batches” “two leaves.”

14.4.3 Q2 Find all contractions. Separate out the pieces before and after the apostrophe.

contraction <- str_extract(str_subset(sentences,"([^ ]+[\\'].)"),"([^ ]+[\\'].)")
str_split(contraction,"\\'")

[[1]] [1] “It” “s”

[[2]] [1] “man” “s”

[[3]] [1] “don” “t”

[[4]] [1] “store” “s”

[[5]] [1] “workmen” “s”

[[6]] [1] “Let” “s”

[[7]] [1] “sun” “s”

[[8]] [1] “child” “s”

[[9]] [1] “king” “s”

[[10]] [1] “It” “s”

[[11]] [1] “don” “t”

[[12]] [1] “queen” “s”

[[13]] [1] “don” “t”

[[14]] [1] “pirate” “s”

[[15]] [1] “neighbor” “s”

14.4.4 Q1 Replace all forward slashes in a string with backslashes.

sentence <- "Hello/Hello/Hello"
writeLines(sentence)

Hello/Hello/Hello

writeLines(str_replace_all(sentence,"/","\\\\"))

Hello

14.4.4 Q2 Implement a simple version of str_to_lower() using replace_all().

test <- "THIS IS A TEST"
str_to_lower(test)

[1] “this is a test”

str_replace_all(test,c("T" = "t", "H" = "h", "I" = "i", "S" = "s", "A" = "a", "E" = "e"))

[1] “this is a test”

14.4.4 Q3 Switch the first and last letters in words.

str_replace(str_replace(words[1:20],"[^.]",str_extract(words[1:20],".$")),".$",str_extract(words[1:20],"^."))

[1] “a” “ebla” “tboua” “ebsoluta” “tccepa”
[6] “tccouna” “echieva” “scrosa” “tca” “ectiva”
[11] “lctuaa” “dda” “sddresa” “tdmia” “edvertisa” [16] “tffeca” “dffora” “rftea” “nfternooa” “ngaia”

Which of those strings are still words? those with the same first and last letters will return the same words. For example:

first_is_last <- str_extract(words,"^.") == str_extract(words,".$")
words[first_is_last]

[1] “a” “america” “area” “dad” “dead”
[6] “depend” “educate” “else” “encourage” “engine”
[11] “europe” “evidence” “example” “excuse” “exercise”
[16] “expense” “experience” “eye” “health” “high”
[21] “knock” “level” “local” “nation” “non”
[26] “rather” “refer” “remember” “serious” “stairs”
[31] “test” “tonight” “transport” “treat” “trust”
[36] “window” “yesterday”

14.4.5 Q1 Split up a string like “apples, pears, and bananas” into individual components.

str_split("apples, pears, and bananas", boundary("word"))

[[1]] [1] “apples” “pears” “and” “bananas”

Why is it better to split up by boundary(“word”) than " "?

str_split("apples, pears, and bananas", " ")

[[1]] [1] “apples,” “pears,” “and” “bananas”

Because we dont want “apple,” we want “apple”

14.4.5 Q3 What does splitting with an empty string ("") do? Experiment, and then read the documentation.

str_split("apples, pears, and bananas", "")

[[1]] [1] “a” “p” “p” “l” “e” “s” “,” " " “p” “e” “a” “r” “s” “,” " " “a” “n” [18] “d” " " “b” “a” “n” “a” “n” “a” “s” This function split the strings into characters

14.5.1 Q1 How would you find all strings containing with regex() vs. with fixed()?

string <- c("contains \\")
str_subset(string, regex("\\\\"))

[1] “contains \”

str_subset(string, fixed("\\"))

[1] “contains \”

14.4.5 Q2 What are the five most common words in sentences?

split_sentences <- unlist(sentences)
split_sentences <- unlist(str_split(split_sentences,boundary("word")))
split_sentences %>%
  str_to_lower() %>%
  tibble() %>%
  set_names("words") %>%
  count(words) %>%
  arrange(desc(n)) %>%
  head(n=5)

A tibble: 5 x 2

words n 1 the 751 2 a 202 3 of 132 4 to 123 5 and 118 Find the stringi functions that: Count the number of words.

countthis <- "Count the number of words"
str_split(countthis, " ")

[[1]] [1] “Count” “the” “number” “of” “words”

sapply(str_split(countthis, " "), length)

[1] 5

Find duplicated strings.

a <- c("abs", "acs","abs", "bgf")
duplicated(a)

[1] FALSE FALSE TRUE FALSE

Generate random text.

stri_rand_strings(10,length = 5)

[1] “gNvyF” “O7FZw” “V1CzG” “stoXi” “nvbuo” “Qp4M6” “Cf7B6” “uEOSl” [9] “Up4hf” “nDVaU” 14.7.1 Q2 How do you control the language that stri_sort() uses for sorting?

stri_sort(c("A","R","B"))

[1] “A” “B” “R”

stri_sort(c("A","R","B"),decreasing = T)

[1] “R” “B” “A”

stri_sort(sample(LETTERS))

[1] “A” “B” “C” “D” “E” “F” “G” “H” “I” “J” “K” “L” “M” “N” “O” “P” “Q” [18] “R” “S” “T” “U” “V” “W” “X” “Y” “Z”

stri_sort(sample(LETTERS), decreasing = T)

[1] “Z” “Y” “X” “W” “V” “U” “T” “S” “R” “Q” “P” “O” “N” “M” “L” “K” “J” [18] “I” “H” “G” “F” “E” “D” “C” “B” “A”