In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?
paste("rabbit", "rabbit")
## [1] "rabbit rabbit"
paste0("rabbit", "rabbit")
## [1] "rabbitrabbit"
Paste combines strings similary to str_c. Paste0 combines the values without putting a space between them.
str_c("rabbit", NA)
## [1] NA
paste("rabbit", NA)
## [1] "rabbit NA"
paste0("rabbit", NA)
## [1] "rabbitNA"
The functions differ in their handling of NA, as shown above. Str_c just returns NA, since it sees it as a missing value, so it cannot be combined with the other value “rabbit”. Paste and paste0 do not see NA as a missing value, but rather as another valid value, so it allows it to be combined with my first value.
In your own words, describe the difference between the sep and collapse arguments to str_c().
Sep clarifies how values within a string you are creating should be separated when they are combined, by stating what character or value those values will be separated by. Collapse is used to combine vectors into a single string, but similarly states what character or value those vectors will be separated by.
Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?
x <- c("abc")
y <- c("abcd")
str_sub(x, ((str_length(x)+1) / 2), ((str_length(x)+1) / 2))
## [1] "b"
str_sub(y, ((str_length(y)+1) / 2), ((str_length(y)+1) / 2) + 1)
## [1] "bc"
What does str_wrap() do? When might you want to use it?
x <- "We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America."
str_wrap(x, width = 30, indent = 5, exdent = 0)
## [1] " We the People of the\nUnited States, in Order to\nform a more perfect Union,\nestablish Justice, insure\ndomestic Tranquility, provide\nfor the common defence,\npromote the general Welfare,\nand secure the Blessings\nof Liberty to ourselves and\nour Posterity, do ordain and\nestablish this Constitution\nfor the United States of\nAmerica."
I had to look up the arguments you have to use in str_wrap. It looks like it’s a way to format your text into a set width, with indentations and exdentations as desired. You might want to use it to ensure words aren’t getting split across lines, perhaps.
What does str_trim() do? What’s the opposite of str_trim()?
x <- " Too many cooks."
str_trim(x)
## [1] "Too many cooks."
str_trim gets rid of excess spaces on either end of the string. The opposite of str_trim is str_pad, which adds spaces.
Explain why each of these strings don’t match a : “",”\“,”\".
“" will just escape the next character in the string.”\" is in regex, which will escape the next character in the regex. “\" leaves you with just one open quote so you need an additional to fully get”".
How would you match the sequence "’?
x <- "a\"'\\b"
writeLines(x)
## a"'\b
str_view(x, "\\\"'\\\\")
What patterns will the regular expression ...... match? How would you represent it as a string?
It should match a string that is .[any character].[any character].[any character] The string would be the following:
x_string <- "\\..\\..\\.."
writeLines(x_string)
## \..\..\..
How would you match the literal string “\(^\)”?
x <- "match the string $^$ please"
str_view(x, "\\$\\^\\$")
Given the corpus of common words in stringr::words, create regular expressions that find all words…
str_view(stringr::words, "^y", match = TRUE)
str_view(stringr::words, "x$", match = TRUE)
str_view(stringr::words, "^...$", match = TRUE)
str_view(stringr::words, ".......", match = TRUE)
Create regular expressions to find all words that:
Start with a vowel.
str_view(stringr::words, "^[aeiou]", match = TRUE)
That only contain consonants. (Hint: thinking about matching “not”-vowels.)
str_view(stringr::words, "^[^aeiou]$", match = TRUE)
End with ed, but not with eed.
str_view(stringr::words, "[^e]ed$", match = TRUE)
End with ing or ise.
str_view(stringr::words, "(ing|ise)$", match = TRUE)
Empirically verify the rule “i before e except after c”.
str_view(stringr::words, "(cei|[^c]ie)", match=TRUE)
Is “q” always followed by a “u”?
str_view(stringr::words, "q[^u]", match=TRUE)
Write a regular expression that matches a word if it’s probably written in British English, not American English.
str_view(stringr::words, "..our|ise$|yse$", match=TRUE)
Create a regular expression that will match telephone numbers as commonly written in your country.
x <- c("phone number (301) 555-8888")
str_view(x, "\\(\\d\\d\\d\\)\\s\\d\\d\\d-\\d\\d\\d\\d")
Describe the equivalents of ?, +, * in {m,n} form.
? = {0,1} + = {1, } * = {0, }
Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
^.*$ matches any string
\{.+\}" matches anything in brackets (but doesn’t match empty brackets)
-- matches [four numbers]-[two numbers]-[two numbers]
“\\{4}” matches 4 back slashes
Create regular expressions to find all words that:
Start with three consonants.
str_view(stringr::words, "^[^aeoiuy]{3}", match = TRUE)
str_view(stringr::words, "[aeiou]{3}", match = TRUE)
str_view(stringr::words, "([aeiou][^aeiou]){2,}", match = TRUE)
Describe, in words, what these expressions will match:
(.)\1\1 The same character repeated three times (ex. aaa) “(.)(.)\2\1” First character, second character twice, first character (ex. abba) (..)\1 Two characters repeated (ex. abab) “(.).\1.\1” A character followed by another character, the first character, then another character, then the first character (ex. abaca) "(.)(.)(.).*\3\2\1" A character followed by another character followed by another character, then any number of characters, then the original characters in reverse order (ex. abcfffffcba)
Construct regular expressions to match words that:
Start and end with the same character.
str_view(stringr::words, "^(.).*\\1$", match = TRUE)
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view(stringr::words, "(..).*\\1", match = TRUE)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view(stringr::words, "(.).+\\1.+\\1", match = TRUE)
For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
Find all words that start or end with x.
str_subset(stringr::words, "^x|x$")
## [1] "box" "sex" "six" "tax"
Find all words that start with a vowel and end with a consonant.
str_subset(stringr::words, "^(a|e|i|o|u).*(^a|e|i|o|u)$")
## [1] "able" "absolute" "achieve" "active" "advertise"
## [6] "age" "ago" "agree" "also" "appropriate"
## [11] "argue" "arrange" "associate" "assume" "available"
## [16] "aware" "educate" "else" "encourage" "engine"
## [21] "europe" "evidence" "example" "excuse" "exercise"
## [26] "expense" "experience" "eye" "imagine" "improve"
## [31] "include" "income" "increase" "inside" "insure"
## [36] "into" "introduce" "involve" "issue" "office"
## [41] "once" "one" "operate" "oppose" "organize"
## [46] "otherwise" "unite" "use"
Are there any words that contain at least one of each different vowel?
words <- stringr::words
words[str_detect(words, "a") & str_detect(words, "e") & str_detect(words, "i") & str_detect(words, "o") & str_detect(words, "u")]
## character(0)
What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
words <- stringr::words
numvowels <- tibble(word = words,
vowels = str_count(words, "[aeiou]")) %>%
arrange(desc(vowels))
numvowels
## # A tibble: 980 x 2
## word vowels
## <chr> <int>
## 1 appropriate 5
## 2 associate 5
## 3 available 5
## 4 colleague 5
## 5 encourage 5
## 6 experience 5
## 7 individual 5
## 8 television 5
## 9 absolute 4
## 10 achieve 4
## # … with 970 more rows
words <- stringr::words
propvowels <- tibble(word = words,
vowel = str_count(words, "[aeiou]")/str_length(words))
propvowels
## # A tibble: 980 x 2
## word vowel
## <chr> <dbl>
## 1 a 1
## 2 able 0.5
## 3 about 0.6
## 4 absolute 0.5
## 5 accept 0.333
## 6 account 0.429
## 7 achieve 0.571
## 8 across 0.333
## 9 act 0.333
## 10 active 0.5
## # … with 970 more rows