Load data into R data frame
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
raw.data
## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
Create name vector (used code from book)
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
Translate name list to first_name last_name
Pseudocode
for any item in “name” that contains a “,” after a word and before another word, reverse the order of the words before and after.
strip out any titles. be careful of Chuck Burns–that C. is an initial and not a title.
namefl <- str_replace_all(name, "([[:alpha:]]+)(, )([[:alpha:]].+)", "\\3 \\1")
namefl2 <- str_replace_all(namefl, "[[:alpha:]]{2,}\\.", "")
namefl2
## [1] "Moe Szyslak" "C. Montgomery Burns" " Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" " Julius Hibbert"
Construct a logical vector indicating whether a character has a title.
str_detect(name, "[[:alpha:]]{2,}\\.")
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
Construct a logical vector indicating whether a character has a second name.
I am assuming this means a middle name, though the names given exclude Homer and Dr. Hibbert’s middle names (or that Rev. Lovejoy and Ned Flanders are both Jr.s)
str_detect(namefl2, "(\\.\\ )\\w*\\ \\w*")
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
1. [0-9]+\$
The end of the string is at least 1 number (e.g., 6541 or the last four digits of "asdf.4324")
2. \b[az]{1,4}\b
Any combination of the letters a and z that is less than 5 characters long (e.g., 'a', 'az', 'zaaz', 'aaax')
__3. .*?\.txt$__
Any string that ends in '.txt' (e.g., 'bobsyouruncle.txt')
4. \d{2}/\d{2}/\d{4}
2 digits followed by a slash followed by 2 digits followed by a slash followed by 4 digits (e.g., 10/23/1973)
5. <(.+?)>.+?</\1>
Any number of non-lineterminating characters enclosed in <>, followed by any number of non-lineterminating characters, followed by a slash and the contents of the first set of non-lineterminating characters enclosed in <> (e.g. "<hard>thingsare</hard>")