Automated Data Collection in R, Chapter 8

3. Copy the introductory example. The vector name stores the extracted names.
library(stringr)


raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"
data.frame(name)
##                   name
## 1          Moe Szyslak
## 2 Burns, C. Montgomery
## 3 Rev. Timothy Lovejoy
## 4         Ned Flanders
## 5       Simpson, Homer
## 6   Dr. Julius Hibbert
  1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
name_clean <- str_replace_all(name, c("(.+, )(.+)$" = "\\2 \\1", ", " = ""))
name_clean
## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"
  1. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
title <- str_detect(name_clean, "\\b[[:alpha:]]{2,3}\\.")
title
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
  1. Construct a logical vector indicating whether a character has a second name.
name_2nd <- str_detect(name_clean, "\\b[[:upper:]]{1}\\.")
name_2nd
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

4. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
  1. [0-9]+\\$ This string represents numerical values that precedes a dollar sign. Note: As demonstrated below, this expression would not be useful for extracting currency with decimal values as that form of punctuation is not referenced in the string pattern.
example_4a <- c("4524$", "423.31", "65.48$", "5$", "800.00$")
str_extract(example_4a, "[0-9]+\\$")
## [1] "4524$" NA      "48$"   "5$"    "00$"
  1. \\b[a-z]{1,4}\\b This string represents 1-4 lowercase characters located at the edge of a word. Using \\b at the beginning and end of the expression creates boundaries on both edges of a word.
example_4b <- c("HELLO", "test", "YeStErDaY", "Tall", "small", "ok")
str_extract(example_4b, "\\b[a-z]{1,4}\\b")
## [1] NA     "test" NA     NA     NA     "ok"
  1. .*\\.txt$ This string represents any characters that precede “.txt” values. While . can be used to represent any character, the backslashes \\ before the . matches the period character literally. This string format would be useful in identifying .txt files.
example_4c <- c("file.txt", "file.pdf", "txt.jpeg", "example.Rmd")
str_extract(example_4c, ".*\\.txt$")
## [1] "file.txt" NA         NA         NA
  1. \\d{2}/\\d{2}/\\d{4} This string represents the following: two digits, forward slash, two digits, forward slash, four digits. This string format could be useful in identifying dates, however the expression would need additional critera added to prevent numbers that aren’t actually dates (i.e. months > 12, days > 31) from being extracted.
example_4d <- c("500/660", "12/31/2018", "99/99/9999", "05\13\1923")
str_extract(example_4d, "\\d{2}/\\d{2}/\\d{4}")
## [1] NA           "12/31/2018" "99/99/9999" NA
  1. <(.+?)>.+?</\\1> This string represents any characters enclosed in angle brackets, followed by any characters and an enclosed angle bracket that contains a forward slash with at least 1 character (i.e. HTML code: <html> text </html> )
example_4e <- c("<title>TEST TITLE</title>", "<strong>BOLD TEXT</strong>", "<700", "</p>")
str_extract(example_4e, "<(.+?)>.+?</\\1>")
## [1] "<title>TEST TITLE</title>"  "<strong>BOLD TEXT</strong>"
## [3] NA                           NA

9. The following code hides a secret message. Crack it with R and regular expressions.
secret_message <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
cracked_code <- unlist(str_extract_all(secret_message, "[[:upper:]]"))
cracked_code
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"