Question 3

Data

library(stringr)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))

Below from text downloaded from Baruch library.

R> library(stringr)

R> name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
`R> name
[1] "Moe Szyslak"      "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
[4] "Ned Flanders"     "Simpson, Homer"       "Dr. Julius Hibbert"`

R> phone <- unlist(str_extract_all(raw.data, "\\(?(\\d{3})?\\)?(-| )?\\d{3}(-| )?\\d{4}"))
R> phone
[1] "555-1239"       "(636) 555-0113" "555-6542"       "555 8904"
[5] "636-555-3226"   "5553642"

We can input the results into a data frame:


R> data.frame(name = name, phone = phone)
                  name          phone
1          Moe Szyslak       555-1239
2 Burns, C. Montgomery (636) 555-0113
3 Rev. Timothy Lovejoy       555-6542
4         Ned Flanders       555 8904
5       Simpson, Homer   636-555-3226
6   Dr. Julius Hibbert        5553642

Question

Copy the introductory example. The vector name stores the extracted names.

R> name
[1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
[4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
  1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name .
  2. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
  3. Construct a logical vector indicating whether a character has a second name.

Answer

Part (a)

The answer below will work for this question for we have access to the raw data. It will not necessarily work for strings in general. There are three cases in the name vector:

  1. first_name last_name
  2. last_name, firstname
  3. title first_name last_name

There is no case of last_name, title first_name. Therefore, the steps that the will be taken are:

  1. Identify strings with at least two entries and at least a space in between them, and no punctuation
    • The additional [[:alpha] ]* allows recovering names like “John Jacob Jingleheimer Schmidt”.
    • This will serve the title examples as well, since after the title, the names behave the same, so so long as the portion with the period isn’t returned, we are safe.
  2. Identify strings with commas and extract the first and last names
  3. Concatenate the two cases
    • The time taken to reorder the string in the original name order will prove valuable in part (c)
# Handle the simple case
C1 <- unlist(str_extract_all(name, "\\b[:alpha:]+ [[:alpha:] ]*[:alpha:]+\\b"))

# More complicated; broken down for clarity
# Where are the comma entries?
id_a <- which(str_detect(name, "[,]"), TRUE)
id_rest <- seq_len(6)[-id_a]
C2a <- unlist(str_split(name[str_detect(name, "[,]")], ", "))
C2 <- character(length(C2a) / 2)
for (i in seq_along(C2)) {
  C2[i] <- paste0(C2a[2 * i], " ", C2a[2 * i - 1])
}
# Resurrect old order
# See https://stackoverflow.com/questions/1493969/how-to-insert-elements-into-a-vector
id <- c(id_rest, id_a)
C <- c(C1, C2)
CC <- C[order(id)]
CC
## [1] "Moe Szyslak"         "C. Montgomery Burns" "Timothy Lovejoy"    
## [4] "Ned Flanders"        "Homer Simpson"       "Julius Hibbert"

Part (b)

In this example, all titles have periods and no names with titles have commas. Therefore, the presence of a period together with the absence of a comma is a necessary and sufficient condition for a name to have a title.

# Vector
str_detect(name, "[.]") & !(str_detect(name, "[,]"))
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
# Test
name[str_detect(name, "[.]") & !(str_detect(name, "[,]"))]
## [1] "Rev. Timothy Lovejoy" "Dr. Julius Hibbert"

Part (c)

It’s easier to answer this question using the result from part (a). Since we no longer have to worry about titles, all that is needed is to look for any strings with length greater than two. Since we resuscitated the proper order in part (a), I don’t see a problem using its results. While if this was the initial ask, there may be a way to code for it directly; in the real world, it is highly inefficient to code every problem from scratch and not to reuse good code. Since this is a data science program, and not a computer science program, I’d hope this is acceptable.

If work needed to be done from scratch, please consider it as if I posted the code from (a) here. Thank you.

# Vector
unlist(lapply(str_split(CC, " "), length)) > 2
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE
# Test
CC[unlist(lapply(str_split(CC, " "), length)) > 2]
## [1] "C. Montgomery Burns"

Question 4

Question

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

  1. [0-9]+\\$
  2. \\b[a-z]{1,4}\\b
  3. .*?\\.txt$
  4. \\d{2}/\\d{2}/\\d{4}
  5. <(.+?)>.+?</\\1>

Answer

Part (a)

At least one digit between 0 and 9 ending with a $ sign: “4657$”, “123$”

Test <- c("apple", "yoyo", "R2D2", "", "465762", "4657$", "123$", "fig")
unlist(str_extract_all(Test, "[0-9]+\\$"))
## [1] "4657$" "123$"

Part (b)

A word, begining with a lowercase letter, and is between 1 and 4 characters long, and then the word ends: “yoyo”, “fig”

unlist(str_extract_all(Test, "\\b[a-z]{1,4}\\b"))
## [1] "yoyo" "fig"

Part (c)

An optional (?) string characters (.) of any length (*), followed by a mandatory period, and then a “txt” and then ending. In other words, a “.txt” file where anything could be the filename, including empty: DT607.txt, .txt

# Part (b) picks up the txt and pdf file names for some reason. I guess it sees "." as a boundary.
Test <- c("DT607.txt", ".txt", "DT607.pdf", "08/17/1978", "19/02/2018")
unlist(str_extract_all(Test, ".*?\\.txt$"))
## [1] "DT607.txt" ".txt"

Part (d)

Two digits, a slash, two digits, a slash, then four digits. In other words, a MM/DD/YYYY or DD/MM/YYYY date string: 08/17/1978, 19/02/2018

unlist(str_extract_all(Test, "\\d{2}/\\d{2}/\\d{4}"))
## [1] "08/17/1978" "19/02/2018"

Part (e)

This one through me for a loop. What is being described is any string of characters, of at least length one, surronded by angle brackets, then any string of characters at least length one, followed by the ORIGINAL string in angle brackets, prefaced by a slash. It finally hit me that this is looking for HTML span-type tags, such as <strong>This was difficult.</strong>.

Test <- c(Test, "<strong>This was difficult.</strong>")
unlist(str_extract_all(Test, "<(.+?)>.+?</\\1>"))
## [1] "<strong>This was difficult.</strong>"

Question 9

Question

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

Answer

I should just load up my instance of hashcat and throw this at it :).

After thinking for a bit, I decided to pull out the capital letters—EUREKA! The code below makes the message a little cleaner.

Secret <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

# The magic words are squeamish ossifrage
# Yes, I'm dating myself

str_c(unlist(lapply(str_extract_all(unlist(str_split(Secret, "\\.")), "[[:upper:][:punct:]]"), str_c, collapse = "")), collapse = " ")
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD!"