Automated Data Collection with R, Chapter 8 Question 3

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
titles <- unlist(str_extract_all(name, "^\\w+\\."))
titles <- titles[titles != "character(0)"]
name
## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

1. Use the tools of this chapter to rearrange the vector so that all the elements conform to the standard last_name first_name

newname <- trimws(unlist(str_replace_all(name, titles, "")))
last_name <- unlist(str_extract_all(newname, "((\\w+(?=,))|((?<=[\\w] ).*$))"))
first_name <- unlist(str_extract_all(newname, "((^\\w+(?= \\w))|((?<=[,] ).*$))"))
newname <- paste(last_name, first_name)
newname
## [1] "Szyslak Moe"         "Burns C. Montgomery" "Lovejoy Timothy"    
## [4] "Flanders Ned"        "Simpson Homer"       "Hibbert Julius"

2. Construct a logical vector indicating whether a character has a title.

greplall <- function(list_pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE, value=FALSE, logic=FALSE){
  # Exceptions
  if (typeof(list_pattern) != "character" || typeof(x) != "character"){
    warning("Both variables are required to be character vectors!")
    return(NULL)
  }
  if (length(list_pattern) == 0 || length(x) == 0){
    warning("Both lists must have at least one item in them.")
    return(NULL)
  }
  # Running the code
  loglist <- c()
  for (i in 1:length(list_pattern)){
    tmplist <- grepl(list_pattern[i], x)
    if (length(loglist) == 0){
      loglist <- c(loglist, tmplist)
    }
    else{
      for (j in 1:length(loglist)){
        if (tmplist[j] == TRUE && loglist[j] == FALSE){
          loglist[j] <- TRUE
        }
      }
    }
  }
  return(loglist)
}

greplall(titles, name)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

3. Construct a logical vector indicating whether a character has a second name.

grepl("^([[:alpha:].]+ ){2,}", newname)
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

4. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

1. [0-9]+\$

A string of numbers of any length followed by the dollar sign. Example: 071$

2. \b[a-z]{1,4}\b

A string of lowercase letters between one and four characters in length, specifically a word. It would not match whitespace. Example: Octopus

3. .*?\.txt$

This is most likely meant to identify text files given that it ends in .txt, which is an extension for text files. It is looking for any file name, or even no file name… which leads me to believe it might need to be modified to be useful. Example: .txt

4. \d{2}/\d{2}/\d{4}

This is looking for a string of numbers of specific lengths seperated by forward slashes. Two digits, two digits, four digits. Likely, mm/dd/yyyy. An example of this would be the due date of this assignment. Example: 02/12/2018

5. <(.+?)>.+?</\1>

A regex pattern for capturing HTML or XML tags, it looks for the first group and the last group to match, with any text in between. This is not ideal in the case of HTML or XML with parameters, i.e. name, type, style, what have you. Example:

Now you’re thinking with R!

9. The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com (Extra Credit)

encrypted <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

Well, it’s clear that it’s all of the capitals. CONGRATULATIONSYOUAREASUPERNERD!