## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Source files: https://github.com/djlofland/DATA607_F2019/tree/master/Assignment3

Chapter 3 Questions

Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to problems 3 and 4 from chapter 8 of Automated Data Collection in R. Problem 9 is extra credit. You may work in a small group, but please submit separately with names of all group participants in your submission.

Here is the referenced code for the introductory example in #3:

Question 3

Copy the introductory example. The vector name stores the extracted names. R> name [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert”

  1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
## [1] "Szyslak, Moe"          "Burns, C. Montgomery"  "Lovejoy, Rev. Timothy"
## [4] "Flanders, Ned"         "Simpson, Homer"        "Hibbert, Dr. Julius"

if a name has a comma, I’m assuming it’s already in last, first. Note that this would be a poor assumption with a larger unknwn data set and a name might have credentials as end, eg “Jane Doe, MD”. If so, we’d change our regex to look for titles

  1. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

I look for a simple “aa.” pattern anywhere in the string. This prevents catching initials, e.g. “C. Mont…”. I didn’t restrict case as we might have see “DR. Julius” or “Dr. Julius”. This solution handles cases: “Dr. Julius Hibble” and “Hibble, Dr. Julius”, “Dr.Julius Hibble” and “Julius Hibble, Dr.”.

  1. Construct a logical vector indicating whether a character has a second name.
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

I’m assuming by “second name” the question is trying to trap the “C. Montgomery” entry and that “second name” isn’t implying “last name”.

Question 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

  1. [0-9]+\$
## [1]  TRUE  TRUE  TRUE FALSE

Looks for a sequence of numbers immediately followed by a dollar sign.

  1. \b[a-z]{1,4}\b
## [1] TRUE TRUE TRUE

Looks for the presense of a word with 1, 2, 3 or 4 characters in length.

  1. .*?\.txt$
## [1] TRUE TRUE TRUE TRUE

Checks if the string ends with “.txt”. Useful if pattern matching for file names. I would have used ‘.+\.txt’ since ‘.txt’ is typically a hidden file and not a text file. Note that the ? is unnecessary as ’.*’ matchs 0 or more times which is already “optional”.

  1. \d{2}/\d{2}/\d{4}
## [1] TRUE TRUE

Matches dates in MM/DD/YYYY format found anywhere in the string. Note that this requires 2 digits in month and day and 4 digits in year, so wouldn’t match ‘1/1/19’. To be more flexible, we might use ‘\d{2}/\d{2}/(\d{2}\b|\d{4}\b)’.

  1. <(.+?)>.+?</\1>
## [1]  TRUE  TRUE FALSE FALSE

This matches basic html and xml tag formats. As written, it’ll fail if we add any extra info in the tag, which is pretty standard in HTML. As written, it’s also case sensitive. A better implementaion might be: **str_detect(example, regex(’<(.+) ?(.*)?>.+?</\1>‘, ignore_case = T)). Also, note that our book references an older method ignore.case() which doesn’t exist anymore. We now need to use regex(’’, ignore_case=T)** to accomplish the same thing.

Question 9

Consider the string (5-3)ˆ2=5ˆ2-253+3ˆ2 conforms to the binomial theorem. We would like to extract the formula in the string. To do so we write the regular expression [ˆ0-9=+*()]+. Explain why this fails and correct the expression. \[Example:\ \ (x+y)^4=x^4+4x^3y+6x^2y^2+4xy^3+y^4\]

## [1] "(5"
## [1] "(5-3)ˆ2=5ˆ2-2*5*3+3ˆ2"

In the current regex, the ^ will mean “starts with” not exponent and the ‘-’ means “range” not the literal minus sign. We need to escape the ^ and add an escaped -. The correct regex would be, ‘[0-9=+*()\\-\\ˆ]+’.

Bonus Question

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others!

clcopCow1zmstc0d87wnkig70vdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5 dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjat0aootj55t3Nj3ne 6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.S qoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

## [1] "CONGRATULATIONSYOUAREASUPERNERD"

References