Description

The following problems come from chapter 8, regular expressions and essential string functions from “Automated Data Collection with R”

Problem 3

library(stringr)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))

name
## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"
  1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
get_first_last <- function(n){
  # Strip out title and middle name
  n <- str_replace(n, "(Rev. )|(C. )|(Dr. )", "")
  # Reorder if comma is present
  split_n <- unlist(str_split(n, ", "))
  if (length(split_n) == 2){
    n <- paste(split_n[2], split_n[1])
  }
  return(n)
}

first_name_last_name <- as.vector(sapply(name, get_first_last))

first_name_last_name
## [1] "Moe Szyslak"      "Montgomery Burns" "Timothy Lovejoy" 
## [4] "Ned Flanders"     "Homer Simpson"    "Julius Hibbert"
  1. Construct a logical vector indicating whether a character has a title (i.e. Rev. and Dr.)
# Look for two letters followed by a period and a space
has_title <- str_count(name, "\\w{2}. ") > 1
has_title
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
  1. Construct a logical vector indicating whether a character has a second name.
# Find a space followed by one letter then a period then a space
has_second_name <- str_count(name, " \\w{1}. ") == 1
has_second_name
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Problem 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression

Question Regular Expression Type of String Examples
(a) [0-9]+\$ Numbers followed by a dollar sign 100\(, 25\)
(b) \b[a-z]{1,4}\b A one to four lower-case character string help, bat
(c) .*?\.txt$ Alphanumeric characters followed by .txt readme.txt, 1.txt
(d) \d{2}/\d{2}/\d{4} A date in a DD/MM/YYYY or MM/DD/YYYY format 12/31/2018
(e) <(.+?)>.+?</\1> Text between to HTML tags <b>This is bold</b>