## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Source files: https://github.com/djlofland/DATA607_F2019/tree/master/Assignment3
Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to problems 3 and 4 from chapter 8 of Automated Data Collection in R. Problem 9 is extra credit. You may work in a small group, but please submit separately with names of all group participants in your submission.
Here is the referenced code for the introductory example in #3:
Copy the introductory example. The vector name stores the extracted names. R> name [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert”
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
# Example code from Chapter:
names <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
count <- length(names)
# Loop thru each name (note we could probably crate function and apply it to the names list with lapply())
for (i in 1:count) {
# check if there is comma and assume it's already in last, first
if (!str_detect(names[i], ',')) {
# str_match() will pull out the backref's for us
name_parts <- str_match(names[i], '^(.*)[[:space:]]+(\\w+)$')
names[i] <- paste(str_trim(name_parts[3]), ', ', str_trim(name_parts[2]), sep="")
}
}
print(names)## [1] "Szyslak, Moe" "Burns, C. Montgomery" "Lovejoy, Rev. Timothy"
## [4] "Flanders, Ned" "Simpson, Homer" "Hibbert, Dr. Julius"
if a name has a comma, I’m assuming it’s already in last, first. Note that this would be a poor assumption with a larger unknwn data set and a name might have credentials as end, eg “Jane Doe, MD”. If so, we’d change our regex to look for titles
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
# Example code from Chapter:
names <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
# use str_detect() to return TRUE/FALSE per name
title <- str_detect(names, '[[:alpha:]]{2,}\\.[[:space:]]?')
title## [1] FALSE FALSE TRUE FALSE FALSE TRUE
I look for a simple “aa.” pattern anywhere in the string. This prevents catching initials, e.g. “C. Mont…”. I didn’t restrict case as we might have see “DR. Julius” or “Dr. Julius”. This solution handles cases: “Dr. Julius Hibble” and “Hibble, Dr. Julius”, “Dr.Julius Hibble” and “Julius Hibble, Dr.”.
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
# Example code from Chapter:
names <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
# Find all occurances where there are 3 "names". Note a "name" might also be a title, we'll filter that out in a second step
three_names <- str_detect(names, '(\\w+)[ ,.]+(\\w+)[ .,]+(\\w+)')
# filter out titles (we built the title vector in the above problem)
second_name <- three_names & !title
second_name## [1] FALSE TRUE FALSE FALSE FALSE FALSE
I’m assuming by “second name” the question is trying to trap the “C. Montgomery” entry and that “second name” isn’t implying “last name”.
Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
## [1] TRUE TRUE TRUE FALSE
Looks for a sequence of numbers immediately followed by a dollar sign.
## [1] TRUE TRUE TRUE
Looks for the presense of a word with 1, 2, 3 or 4 characters in length.
## [1] TRUE TRUE TRUE TRUE
Checks if the string ends with “.txt”. Useful if pattern matching for file names. I would have used ‘.+\.txt’ since ‘.txt’ is typically a hidden file and not a text file. Note that the ? is unnecessary as ’.*’ matchs 0 or more times which is already “optional”.
example <- c('12/24/1969', 'My birthday is 12/24/1969, yay!')
str_detect(example, '\\d{2}/\\d{2}/\\d{4}')## [1] TRUE TRUE
Matches dates in MM/DD/YYYY format found anywhere in the string. Note that this requires 2 digits in month and day and 4 digits in year, so wouldn’t match ‘1/1/19’. To be more flexible, we might use ‘\d{2}/\d{2}/(\d{2}\b|\d{4}\b)’.
example <- c('<p>I work</p>', '<DIV>I work</DIV>', '<p style="a">I break</p>', '<DIV>I break</div>')
str_detect(example, '<(.+?)>.+?</\\1>')## [1] TRUE TRUE FALSE FALSE
This matches basic html and xml tag formats. As written, it’ll fail if we add any extra info in the tag, which is pretty standard in HTML. As written, it’s also case sensitive. A better implementaion might be: **str_detect(example, regex(’<(.+) ?(.*)?>.+?</\1>‘, ignore_case = T)). Also, note that our book references an older method ignore.case() which doesn’t exist anymore. We now need to use regex(’’, ignore_case=T)** to accomplish the same thing.
Consider the string (5-3)ˆ2=5ˆ2-253+3ˆ2 conforms to the binomial theorem. We would like to extract the formula in the string. To do so we write the regular expression [ˆ0-9=+*()]+. Explain why this fails and correct the expression. \[Example:\ \ (x+y)^4=x^4+4x^3y+6x^2y^2+4xy^3+y^4\]
## [1] "(5"
## [1] "(5-3)ˆ2=5ˆ2-2*5*3+3ˆ2"
In the current regex, the ^ will mean “starts with” not exponent and the ‘-’ means “range” not the literal minus sign. We need to escape the ^ and add an escaped -. The correct regex would be, ‘[0-9=+*()\\-\\ˆ]+’.
The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others!
clcopCow1zmstc0d87wnkig70vdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5 dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjat0aootj55t3Nj3ne 6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.S qoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr
code <- 'clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3cocObt7yczjat0aootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr'
letters <- str_match_all(code, regex('[A-Z]'))
message <- toString(letters[[1]])
message <- message %>% str_replace_all(', ', '')
message## [1] "CONGRATULATIONSYOUAREASUPERNERD"