I started by loading stringr to facilitate analysis.
library(stringr)
I extracted the names, identified by at least 2 sequences of alphabetic characters, from the character string.
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
names <- unlist(str_extract_all(raw.data, "[[A-z].,' ']{2,}"))
names
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
I copied the list over then identified and replaced the relevant strings, using my knowledge of The Simpsons to change “C.” to “Charles”. I then used the word function in stringr to extract the first and last names, which I displayed as columns in a data frame.
names1 <- names
grep(",", names1, value = TRUE)
## [1] "Burns, C. Montgomery" "Simpson, Homer"
names1 <- str_replace(names1, "Burns, C. Montgomery", "Charles Montgomery Burns")
names1 <- str_replace(names1, "Simpson, Homer", "Homer Simpson")
grep(".", names1, value = TRUE)
## [1] "Moe Szyslak" "Charles Montgomery Burns"
## [3] "Rev. Timothy Lovejoy" "Ned Flanders"
## [5] "Homer Simpson" "Dr. Julius Hibbert"
names1 <- str_replace(names1, "[A-z]{2,3}\\.+\\s", "")
names1
## [1] "Moe Szyslak" "Charles Montgomery Burns"
## [3] "Timothy Lovejoy" "Ned Flanders"
## [5] "Homer Simpson" "Julius Hibbert"
first_name <- word(names1, 1)
last_name <- word(names1, -1)
data.frame(first_name, last_name)
## first_name last_name
## 1 Moe Szyslak
## 2 Charles Burns
## 3 Timothy Lovejoy
## 4 Ned Flanders
## 5 Homer Simpson
## 6 Julius Hibbert
I located persons with titles by using a pattern of 2 or 3 alphabetic characters preceding a period. This pattern avoided the “C.” initial for Mr. Burns.
titles <- grepl("[A-z]{2,3}\\.", names)
titles
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
data.frame(names, titles)
## names titles
## 1 Moe Szyslak FALSE
## 2 Burns, C. Montgomery FALSE
## 3 Rev. Timothy Lovejoy TRUE
## 4 Ned Flanders FALSE
## 5 Simpson, Homer FALSE
## 6 Dr. Julius Hibbert TRUE
Using the cleaned list names1, I created an object containing the string counts for each person. I then created a logical vector indicating TRUE for each person with more than 2 strings, or names.
second_name_check <- str_count(names1, "[\\w+]{3,}")
second_name <- second_name_check > 2
second_name
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
data.frame(names1, second_name_check)
## names1 second_name_check
## 1 Moe Szyslak 2
## 2 Charles Montgomery Burns 3
## 3 Timothy Lovejoy 2
## 4 Ned Flanders 2
## 5 Homer Simpson 2
## 6 Julius Hibbert 2
This regular expression would match a string consisting of some number of numeric digits ending in a dollar sign.
one <- c("abcdef$", "123456", "abcdef", "123456$")
grepl("[0-9]+\\$", one)
## [1] FALSE FALSE FALSE TRUE
This regular expression would match a string consisting of one to four lowercase letters with an empty string on both edges.
two <- c("Charlie", "Name", "charlie", "one", "four")
grepl("\\b[a-z]{1,4}\\b", two)
## [1] FALSE FALSE FALSE TRUE TRUE
This regular expression would match a string ending in “.txt” (e.g. a .txt file).
three <- c("some.txt", "file.csv", "another.txt", "other.xls")
grepl(".*?\\.txt$", three)
## [1] TRUE FALSE TRUE FALSE
This regular expression would match a string consisting of, in order, two digits, a forward slash, two digits, a forward slash, and four digits (e.g. a date).
four <- c("9/12/19", "09/12/19", "09/12/2019")
grepl("\\d{2}/\\d{2}/\\d{4}", four)
## [1] FALSE FALSE TRUE
This regular expression appears to match pairs of properly formatted HTML tags and whatever they contain.
five <- c("<html><body>Some text</body></html>", "</html>something</body>")
grepl("<(.+?)>.+?</\\1>", five)
## [1] TRUE FALSE
I had difficulty finding the snippet from the book online, so I actually typed it out. In doing so, I figured the secret message related to either the numbers or the upper case letters (I wasn’t sure what to make of “clcop”). Extracting uppercase letters spelled out a phrase, and upon closer inspection, the words in the phrase were separated by punctuation. I replaced the periods with spaces, creating “CONGRATULATIONS YOU ARE A SUPERNERD!”
snippet <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
str_extract_all(snippet, "[0-9]")
## [[1]]
## [1] "1" "0" "8" "7" "7" "9" "2" "8" "5" "5" "0" "7" "8" "0" "3" "5" "3"
## [18] "0" "7" "5" "5" "3" "3" "6" "4" "1" "1" "6" "2" "2" "4" "9" "0" "5"
## [35] "6" "5" "1" "7" "2" "4" "6" "3" "9" "5" "8" "9" "6" "5" "9" "4" "9"
## [52] "0" "5" "4" "5"
str_extract_all(snippet, "[A-Z]")
## [[1]]
## [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"
uppercase <- unlist(str_extract_all(snippet, "[[A-Z].|!]"))
secret_message <- str_replace_all(paste(uppercase, collapse = ""), "\\.", " ")
secret_message
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD!"
“Basic Regular Expressions in R, Cheat Sheet”. RStudio. Accessed 091219 from https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf;
“Extract first word from a column and insert into new column [duplicate]”. StackOverflow. Accessed 091219 from https://stackoverflow.com/questions/31925811/extract-first-word-from-a-column-and-insert-into-new-column;
“Chapter 8”. Automated Data Collection with R. Accessed 091219 from http://kek.ksu.ru/eos/WM/AutDataCollectR.pdf