I am practicing working with string using the online version of the book “R for data science.”
library(stringr)
## Warning: package 'stringr' was built under R version 4.0.4
library(stringi)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.4
## v tidyr 1.1.2 v forcats 0.5.1
## v readr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?
paste() separates strings by spaces into a single string by default, while paste0() doesn,’t separate strings by space by default. The two functions are equivalent to str_c from stringr. paste0() and paste convert NA into a string and concatenates it with other strings in the function. str_c only returns NA if it has a NA value.
In your own words, describe the difference between the sep and collapse arguments to str_c().
Collapse and sep argument controls how strings in str_c() are separated while combining them into one string. collapse argument combines a vector of strings in str_c(), and sep argument combines individual strings inputted into str_c().
Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?
d <-ceiling(str_length("character")/2)
str_sub("character",d, d)
## [1] "a"
If the string has an even number of characters, I would use the floor function to select the second middle character not captured by using ceiling function.
What does str_wrap() do? When might you want to use it?
str_wrap() wraps strings into a paragraph format. You would use str_wrap() when you want to format lines of strings into a paragraph.
What does str_trim() do? What’s the opposite of str_trim()?
str_trim() removes whitespace from the start and end of a string. str_pad() is the opposite of str_trim().
Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.
b <- str_c(c("a", "b", "c"), collapse = ",")
str_replace(b, "c", c("c" = "and c") )
## [1] "a,b,and c"
Explain why each of these strings don’t match a : “",”\“,”\".
"" The one backslash will escape the following character and will not match backslash in a string.
“\” Two backslashes will result in a backslash in a regular expression which will escape the following character, and not match to a backslash in a string.
"\" Three backslashes will also result in a backslash in a regular expression which will escape the following character.
How would you match the sequence "’?
str_view("\"'\\", "\"'\\\\", match = TRUE)
What patterns will the regular expression ...... match? How would you represent it as a string?
The pattern will match strings like “.a.b.c” which are single characters separated by periods.
str_view("$^$","^$")
Given the corpus of common words in stringr::words, create regular expressions that find all words that:
str_view_all(words, "^y", match = TRUE)
2) End with “x”
str_view_all(words, "x$", match = TRUE)
3) Are exactly three letters long. (Don’t cheat by using str_length()!)
str_view_all(words, "^...$", match = TRUE)
4) Have seven letters or more.
Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.
str_view_all(words, ".......+", match = TRUE)
Create regular expressions to find all words that:
str_view_all(words,"^[aeiou]", match = TRUE )
2) That only contain consonants. (Hint: thinking about matching “not”-vowels.)
str_view_all(words,"^[^aeiou]*$", match = TRUE )
3) End with ed, but not with eed.
str_view_all(words,"[^e]ed$", match = TRUE )
4) End with ing or ise.
str_view_all(words,"ing$|ise$", match = TRUE )
str_view_all(words, "cie", match = TRUE)
Is “q” always followed by a “u”?
In the words dataset the letter q is always followed by the letter u.
str_view_all(words, "q[^u]", match = TRUE)
Write a regular expression that matches a word if it’s probably written in British English, not American English.
“ou|yse$”
The word color in British English is written as colour. The word analyze in British English is written as analyse.
Create a regular expression that will match telephone numbers as commonly written in your country.
str_view("301-123-4567","\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d")
Describe the equivalents of ?, +, * in {m,n} form.
? would be equivalent to {0,1} because the expression matches for 0 or 1 values.
would be equivalent to {1,} because the expression matches for one or more values.
would be equivalent to {0,} because the expression matches for zero or more values.
Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
The expression matches a word in a string that starts with a character and is followed by 0 or more characters.
The string will match a string surrounded by curly brackets.
The expression matches a string that contains a sequence of four digits followed by a dash, followed by two digits followed by a dash, and followed by two digits.
Create regular expressions to find all words that:
str_view_all(words, "^[^aeiou]{3}", match = TRUE)
2) Have three or more vowels in a row.
str_view_all(words, "[aeiou]{3,}", match = TRUE)
3) Have two or more vowel-consonant pairs in a row.
str_view_all(words, "([aeiou][^aeiou]){2,}", match = TRUE)
Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.
Describe, in words, what these expressions will match:
The expression will match a word with three repeating characters.
The expression will match a palindrome of four characters.
The expression will match a word with two characters repeating.
The expression will match a word with a character that is repeated three times, and each repeated character is separated by one other character.
The expression will match a word with three letters in the beginning followed by any number of characters, and ending with same three characters as the beginning but in reverse order.
2)Construct regular expressions to match words that:
1) Start and end with the same character.
str_view_all(words,"^(.)((.*\\1$)|\\1?$)", match = TRUE )
2) Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view_all("church","((.)(.)).*\\1", match = TRUE )
3) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view_all(words,"(.).\\1.\\1", match = TRUE )
For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
str_view_all(words, "^x|x$", match = TRUE)
xstart <- str_detect(words, "^x")
xend <- str_detect(words, "x$")
words[xstart|xend]
## [1] "box" "sex" "six" "tax"
2) Find all words that start with a vowel and end with a consonant.
str_view_all(words, "^[aeiou].*[^aeiou]$", match = TRUE)
vow <- str_detect(words, "^[aieou]")
con <- str_detect(words, "[^aeiou]$")
words[vow & con] %>%
head(10)
## [1] "about" "accept" "account" "across" "act" "actual" "add"
## [8] "address" "admit" "affect"
3) Are there any words that contain at least one of each different vowel?
a <- str_detect(words, "[a]")
e <- str_detect(words, "[e]")
i <- str_detect(words, "[i]")
o <- str_detect(words, "[o]")
u <- str_detect(words, "[u]")
words[a & e & i & o & u]
## character(0)
max(str_count(words, "[aeiou]"))
## [1] 5
words[str_count(words, "[aeiou]") == 5]
## [1] "appropriate" "associate" "available" "colleague" "encourage"
## [6] "experience" "individual" "television"
prop <- str_count(words, "[aeiou]")/str_length(words)
words[prop == 1]
## [1] "a"
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c("\\b(", str_c(colours, collapse = "|"), ")\\b")
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
From the Harvard sentences data, extract:
str_extract(sentences,"[A-ZAa-z]+") %>%
head(10)
## [1] "The" "Glue" "It" "These" "Rice" "The" "The" "The" "Four"
## [10] "Large"
2) All words ending in ing.
str_extract(sentences,"\\b[A-Za-z]+ing\\b") %>%
head(10)
## [1] NA NA NA NA NA NA NA NA NA NA
3) All plurals.
str_extract(sentences,"\\b[A-Za-z]{3,}s\\b") %>%
head(10)
## [1] "planks" NA NA "days" "bowls" "lemons"
## [7] NA "hogs" "hours" "stockings"
num <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten) ([^ ]+)"
has_num <- sentences %>%
str_subset(num) %>%
head(10)
has_num %>%
str_match(num)
## [,1] [,2] [,3]
## [1,] "seven books" "seven" "books"
## [2,] "two met" "two" "met"
## [3,] "two factors" "two" "factors"
## [4,] "three lists" "three" "lists"
## [5,] "seven is" "seven" "is"
## [6,] "two when" "two" "when"
## [7,] "ten inches." "ten" "inches."
## [8,] "one war" "one" "war"
## [9,] "one button" "one" "button"
## [10,] "six minutes." "six" "minutes."
str_match(sentences, "([A-Za-z]+)'([A-Za-z]+)") %>%
head(10)
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
## [3,] "It's" "It" "s"
## [4,] NA NA NA
## [5,] NA NA NA
## [6,] NA NA NA
## [7,] NA NA NA
## [8,] NA NA NA
## [9,] NA NA NA
## [10,] NA NA NA
str_replace_all("a/b/c", "/", "\\\\")
## [1] "a\\b\\c"
str_replace_all("The Apple", c("A" = "a", "T" = "t"))
## [1] "the apple"
Switch the first and last letters in words. Which of those strings are still words?
Some examples of strings which are still words are sub, tub, read, and lead.
words %>%
str_replace("^([A-Za-z])(.*)([A-Za-z])$", "\\3\\2\\1") %>%
head(10)
## [1] "a" "ebla" "tboua" "ebsoluta" "tccepa" "tccouna"
## [7] "echieva" "scrosa" "tca" "ectiva"
x <- "apples, pears, and bananas"
str_split(x, ", ")
## [[1]]
## [1] "apples" "pears" "and bananas"
Why is it better to split up by boundary(“word”) than " "?
It is better to split up by boundary(“word”) than by " " because boundary(“word”) is more refined. It can pick up on words while ignoring other punctuation marks which aren’t part of the word.
What does splitting with an empty string ("") do? Experiment, and then read the documentation.
It splits every part of the string into individual characters.
x <- "apples, pears, and bananas"
str_split(x, "")
## [[1]]
## [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " "
## [20] "b" "a" "n" "a" "n" "a" "s"
str_view_all(c("go\\od", "good"), "\\\\")
str_view_all(c("go\\od", "good"), fixed("\\"))
tibble(list = unlist(str_extract_all(sentences, boundary("word"))))%>%
mutate(list = str_to_lower(list)) %>%
count(list, sort = TRUE) %>%
head(5)
## # A tibble: 5 x 2
## list n
## <chr> <int>
## 1 the 751
## 2 a 202
## 3 of 132
## 4 to 123
## 5 and 118
Find the stringi functions that:
Count the number of words.
stri_count_words()
Find duplicated strings.
stri_duplicated()
Generate random text.
stri_rand_strings()
stri_rand_shuffle()
stri_rand_lipsum()
How do you control the language that stri_sort() uses for sorting?
The language can be controlled by using the opts_collator argument in stri_sort().
Wickham, H., & Grolemund, G. (2017). R for data science. Retrieved April 14, 2021, from https://r4ds.had.co.nz/strings.html