These problems come from the book Automated Data Collection with R 1st edition, Chapter 8 problems 4-8 on pages 217-218.
Before working, I’ll load the “stringr” package to use with the regular expressions.
library(stringr)
Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
Answer: this expression matches a digit sequence of any length followed by a literal dollar sign, since the double backslash escapes the metacharacter meaning typically associated with the dollar sign (which is end of sentence).
The string “398$hello” returns “398$”:
str_extract("398$hello", "[0-9]+\\$")
## [1] "398$"
The string “398” returns “N/A”, showing again that the $ is literal:
str_extract("398", "[0-9]+\\$")
## [1] NA
Answer: The expression is bookended by word edge characters, so it is looking for a single word. The word consists of any lowercase letter, and has to be between 1 and 4 characters long.
str_extract("I am a real american", "\\b[a-z]{1,4}\\b")
## [1] "am"
unlist(str_extract_all("I am a real american", "\\b[a-z]{1,4}\\b"))
## [1] "am" "a" "real"
This expression looks for any series of characters followed by “.txt” at the end of the string. The question mark is necessary to make the expression “ungreedy”, thus stopping at the next match of the expression.
str_extract("txt", ".*?\\.txt$")
## [1] NA
str_extract(".txt", ".*?\\.txt$")
## [1] ".txt"
str_extract("jabroni.txt", ".*?\\.txt$")
## [1] "jabroni.txt"
str_extract("know your role blvd jabroni.txt", ".*?\\.txt$")
## [1] "know your role blvd jabroni.txt"
This looks like a date format, almost. It’s the same form, however the digits between the forward slashes are not restricted to valid date ranges.
str_extract("12/12/1999", "\\d{2}/\\d{2}/\\d{4}")
## [1] "12/12/1999"
str_extract("30/30/1999", "\\d{2}/\\d{2}/\\d{4}")
## [1] "30/30/1999"
str_extract("dd/dd/dddd", "\\d{2}/\\d{2}/\\d{4}")
## [1] NA
str_extract("<head> </head>", "<(.+?)>.+?</\\1>")
## [1] "<head> </head>"
str_extract("<title>ol' chunk of coal</title>", "<(.+?)>.+?</\\1>")
## [1] "<title>ol' chunk of coal</title>"
str_extract("<head>won't work<head>", "<(.+?)>.+?</\\1>")
## [1] NA
str_extract("<head>this neither</title>", "<(.+?)>.+?</\\1>")
## [1] NA
Rewrite the expression [0-9]+\\$ in a way that all elements are altered but the expression performs the same task.
Answer: \\d{1,}[$]
str_extract("398$hello", "\\d{1,}[$]")
## [1] "398$"
str_extract("398", "\\d{1,}[$]")
## [1] NA
Consider the mail address chunkylover53[at]aol[dot]com
email <- "chunkylover53[at]aol[dot]com"
email <- str_replace(email, "\\[at\\]", "@" )
email <- str_replace(email, "\\[dot\\]", ".")
email
## [1] "chunkylover53@aol.com"
Answer: This fails because predefined character classes must be enclosed in brackets. If they aren’t, R will make a character class of “:digit”.
str_extract_all(email, "[[:digit:]]")
## [[1]]
## [1] "5" "3"
Answer: \\D looks at everything but digits. \\d returns digits.
str_extract_all(email, "\\d")
## [[1]]
## [1] "5" "3"
. We would would like to extract the first HTML tag. To do so we write the regular expression <.+>.. Explain why this fails and correct the expression.*
Answer: This string fails because it watched the movie Wall Street one too many times: it’s too greedy. Adding the ? character after the plus sign will make take the shortest possible string that will satisfy the conditions.
html_string <- "<title>+++BREAKING NEWS+++</title>"
str_extract(html_string, "<.+>")
## [1] "<title>+++BREAKING NEWS+++</title>"
str_extract(html_string, "<.+?>")
## [1] "<title>"
Consider the string “(5-3)^2 = 5^2-2*5*3+ 3^2 conforms to the binomial theorem” We would like to extract the formula in the string. To do so we write the regular expression [^0-9=+*()]+. Explain why this fails and correct the expression.
Answer: Putting the character “^” at the beginning inverses the rest of the character class’s contents. “-” is not really a member of the character class, since it retains its metacharacter status. If we move the carrot and the dash to the end, that makes it so the class is no longer inverted and the dash is included in the character class.
binomial_str <- "(5-3)^2=5^2-2*5*3+3^2 conforms to the binomial theorem."
str_extract(binomial_str, "[^0-9=+*()]+")
## [1] "-"
str_extract(binomial_str, "[0-9=+*()^-]+")
## [1] "(5-3)^2=5^2-2*5*3+3^2"