DATA 607 Assignment 3

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(stringr)

Source files: https://github.com/djlofland/DATA607_F2019/tree/master/Assignment3

Chapter 3 Questions

Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to problems 3 and 4 from chapter 8 of Automated Data Collection in R. Problem 9 is extra credit. You may work in a small group, but please submit separately with names of all group participants in your submission.

Here is the referenced code for the introductory example in #3:

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

Question 3

Copy the introductory example. The vector name stores the extracted names. R> name [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert”

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

# Example code from Chapter:
names <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
count <- length(names)

# Loop thru each name (note we could probably crate function and apply it to the names list with lapply())
for (i in 1:count) {
  # check if there is comma and assume it's already in last, first
  if (!str_detect(names[i], ',')) {
    # str_match() will pull out the backref's for us
    name_parts <- str_match(names[i], '^(.*)[[:space:]]+(\\w+)$')
    names[i] <- paste(str_trim(name_parts[3]), ', ', str_trim(name_parts[2]), sep="")
  }
}

print(names)

## [1] "Szyslak, Moe"          "Burns, C. Montgomery"  "Lovejoy, Rev. Timothy"
## [4] "Flanders, Ned"         "Simpson, Homer"        "Hibbert, Dr. Julius"

if a name has a comma, I’m assuming it’s already in last, first. Note that this would be a poor assumption with a larger unknwn data set and a name might have credentials as end, eg “Jane Doe, MD”. If so, we’d change our regex to look for titles

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

# Example code from Chapter:
names <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))

# use str_detect() to return TRUE/FALSE per name
title <- str_detect(names, '[[:alpha:]]{2,}\\.[[:space:]]?')
title

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

I look for a simple “aa.” pattern anywhere in the string. This prevents catching initials, e.g. “C. Mont…”. I didn’t restrict case as we might have see “DR. Julius” or “Dr. Julius”. This solution handles cases: “Dr. Julius Hibble” and “Hibble, Dr. Julius”, “Dr.Julius Hibble” and “Julius Hibble, Dr.”.

Construct a logical vector indicating whether a character has a second name.

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

# Example code from Chapter:
names <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))

# Find all occurances where there are 3 "names".  Note a "name" might also be a title, we'll filter that out in a second step 
three_names <- str_detect(names, '(\\w+)[ ,.]+(\\w+)[ .,]+(\\w+)')

# filter out titles (we built the title vector in the above problem)
second_name <- three_names & !title
second_name

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

I’m assuming by “second name” the question is trying to trap the “C. Montgomery” entry and that “second name” isn’t implying “last name”.

Question 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$

example <- c('abc 123$', '123$', '99.99$', '99.99 $')
str_detect(example, '[0-9]+\\$')

## [1]  TRUE  TRUE  TRUE FALSE

Looks for a sequence of numbers immediately followed by a dollar sign.

\b[a-z]{1,4}\b

example <- c('a b ', 'abcd', ' abcde abcd')
str_detect(example, '\\b[a-z]{1,4}\\b')

## [1] TRUE TRUE TRUE

Looks for the presense of a word with 1, 2, 3 or 4 characters in length.

.*?\.txt$

example <- c('filename.txt', 'file name.txt', 'f.txt', '.txt')
str_detect(example, '.*?\\.txt$')

## [1] TRUE TRUE TRUE TRUE

Checks if the string ends with “.txt”. Useful if pattern matching for file names. I would have used ‘.+\.txt’ since ‘.txt’ is typically a hidden file and not a text file. Note that the ? is unnecessary as ’.*’ matchs 0 or more times which is already “optional”.

\d{2}/\d{2}/\d{4}

example <- c('12/24/1969', 'My birthday is 12/24/1969, yay!')
str_detect(example, '\\d{2}/\\d{2}/\\d{4}')

## [1] TRUE TRUE

Matches dates in MM/DD/YYYY format found anywhere in the string. Note that this requires 2 digits in month and day and 4 digits in year, so wouldn’t match ‘1/1/19’. To be more flexible, we might use ‘\d{2}/\d{2}/(\d{2}\b|\d{4}\b)’.

<(.+?)>.+?</\1>

example <- c('<p>I work</p>', '<DIV>I work</DIV>', '<p style="a">I break</p>', '<DIV>I break</div>')
str_detect(example, '<(.+?)>.+?</\\1>')

## [1]  TRUE  TRUE FALSE FALSE

This matches basic html and xml tag formats. As written, it’ll fail if we add any extra info in the tag, which is pretty standard in HTML. As written, it’s also case sensitive. A better implementaion might be: **str_detect(example, regex(’<(.+) ?(.*)?>.+?</\1>‘, ignore_case = T)). Also, note that our book references an older method ignore.case() which doesn’t exist anymore. We now need to use regex(’’, ignore_case=T)** to accomplish the same thing.

Question 9

Consider the string (5-3)ˆ2=5ˆ2-253+3ˆ2 conforms to the binomial theorem. We would like to extract the formula in the string. To do so we write the regular expression [ˆ0-9=+*()]+. Explain why this fails and correct the expression. \[Example:\ \ (x+y)^4=x^4+4x^3y+6x^2y^2+4xy^3+y^4\]

example <- c('(5-3)ˆ2=5ˆ2-2*5*3+3ˆ2')
str_extract(example, '[ˆ0-9=+*()]+')

## [1] "(5"

example <- c('(5-3)ˆ2=5ˆ2-2*5*3+3ˆ2')
str_extract(example, '[0-9=+*()\\-\\ˆ]+')

## [1] "(5-3)ˆ2=5ˆ2-2*5*3+3ˆ2"

In the current regex, the ^ will mean “starts with” not exponent and the ‘-’ means “range” not the literal minus sign. We need to escape the ^ and add an escaped -. The correct regex would be, ‘[0-9=+*()\\-\\ˆ]+’.

Bonus Question

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others!

clcopCow1zmstc0d87wnkig70vdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5 dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjat0aootj55t3Nj3ne 6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.S qoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

code <- 'clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3cocObt7yczjat0aootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr'

letters <- str_match_all(code, regex('[A-Z]'))
message <- toString(letters[[1]])
message <- message %>% str_replace_all(', ', '')

message

## [1] "CONGRATULATIONSYOUAREASUPERNERD"

References

Binomial Theorem: https://en.wikipedia.org/wiki/Binomial_theorem