This week’s assignment applies the text processing capabilities of regular expressions and the string manipulation functionality offered by Hadley Wickham’s stringr package to problems 3, 4, and 9 from Chapter 8 in Automated Data Collection with R (2015).
library(stringr)
3. Copy the introductory example. The vector name stores the extracted names.
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
phone <- unlist(str_extract_all(raw.data, "\\(?(\\d{3})?\\)?(-| )?\\d{3}(-| )?\\d{4}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
3.1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
splitname <- str_split(name, ", ", simplify = TRUE)
first_last <- str_c(splitname[, 2], splitname[, 1])
first_last <- str_replace(first_last, "([a-z])([A-Z])", "\\1 \\2")
first_last
## [1] "Moe Szyslak" "C. Montgomery Burns" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"
no_title <- str_replace(first_last, "[[:alpha:]]{2,}\\. ", "")
no_title
## [1] "Moe Szyslak" "C. Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
first_name <- str_extract(no_title, "^[[:alpha:]]+\\.?")
last_name <- str_extract(first_last, "[[:alpha:]]+$")
no_mid_name <- str_c(first_name, last_name, sep = " ")
no_mid_name
## [1] "Moe Szyslak" "C. Burns" "Timothy Lovejoy" "Ned Flanders"
## [5] "Homer Simpson" "Julius Hibbert"
3.2. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
title_test <- str_detect(first_last, "[[:alpha:]]{2,}\\.")
title_test
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
title <- str_extract(first_last, "[[:alpha:]]{2,}\\.")
title
## [1] NA NA "Rev." NA NA "Dr."
3.3. Construct a logical vector indicating whether a character has a second name.
Assuming “second name” refers to surname:
last_test <- str_detect(first_last, "[[:alpha:]]+$")
last_test
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
last_name
## [1] "Szyslak" "Burns" "Lovejoy" "Flanders" "Simpson" "Hibbert"
Assuming “second name” refers to middle name:
second_test <- str_detect(first_last, "(?<!([[:alpha:]]{2,3}\\.)) [[:alpha:]]+ ")
second_test
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
second_name <- str_trim(str_extract(first_last, "(?<!([[:alpha:]]{2,3}\\.)) [[:alpha:]]+ "))
second_name
## [1] NA "Montgomery" NA NA NA
## [6] NA
Creating a Simpsons characters’ names and phone numbers data frame:
pnum <- "\\(?([0-9]{3})?\\)?[- ]?([0-9]{3})[- ]?([0-9]{4})"
names_df <- cbind(last_name, first_name, middle_name = second_name, title, area_code = str_match(phone, pnum)[, 2], phone = str_c(str_match(phone, pnum)[, 3], "-", str_match(phone, pnum)[, 4]))
knitr::kable(names_df)
| last_name | first_name | middle_name | title | area_code | phone |
|---|---|---|---|---|---|
| Szyslak | Moe | NA | NA | NA | 555-1239 |
| Burns | C. | Montgomery | NA | 636 | 555-0113 |
| Lovejoy | Timothy | NA | Rev. | NA | 555-6542 |
| Flanders | Ned | NA | NA | NA | 555-8904 |
| Simpson | Homer | NA | NA | 636 | 555-3226 |
| Hibbert | Julius | NA | Dr. | NA | 555-3642 |
4. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
4.1. [0-9]+\\$
This regular expression describes a string of one or more digits followed by a dollar sign, for example, 958$.
4.2. \\b[a-z]{1,4}\\b
This regular expression describes a word of one to four consecutive lowercase alphabetic characters, in other words, a string of one to four lowercase letters with word boundaries on both sides. An example is pear.
4.3. .*?\\.txt$
This regular expression describes a string of any number of characters ending in “.txt”, i.e., a file or path name with a .txt extension. For example, folder/file.txt.
4.4. \\d{2}/\\d{2}/\\d{4}
This regular expression describes 2 digits followed by a forward slash, 2 more digits, another forward slash, and four more digits, in other words, a date in the format 12/21/1994 or 18/09/2016.
4.5. <(.+?)>.+?</\\1>
This regular expression describes an xml or html element with start and end tags, without attributes, and with variable content of length of at least one character. An example is <data> 123 </data>.
In the following code, the examples are checked to verify that they conform to the provided regular expressions:
examples <- c("958$", "pear", "folder/file.txt", "12/21/1994", "18/09/2016", "<data> 123 </data>")
#4.1
str_extract(examples, "[0-9]+\\$")
## [1] "958$" NA NA NA NA NA
#4.2
str_extract(examples, "\\b[a-z]{1,4}\\b")
## [1] NA "pear" "file" NA NA "data"
#4.3
str_extract(examples, ".*?\\.txt$")
## [1] NA NA "folder/file.txt" NA
## [5] NA NA
#4.4
str_extract(examples, "\\d{2}/\\d{2}/\\d{4}")
## [1] NA NA NA "12/21/1994" "18/09/2016"
## [6] NA
#4.5
str_extract(examples, "<(.+?)>.+?</\\1>")
## [1] NA NA NA
## [4] NA NA "<data> 123 </data>"
9. The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.
hidden <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
unlist(str_extract_all(hidden, "[A-Z]"))
## [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"
unlist(str_extract_all(hidden, "[a-z]"))
## [1] "c" "l" "c" "o" "p" "o" "w" "z" "m" "s" "t" "c" "d" "w" "n" "k" "i"
## [18] "g" "v" "d" "i" "c" "p" "u" "g" "g" "v" "h" "r" "y" "n" "j" "u" "w"
## [35] "c" "z" "i" "h" "q" "r" "f" "p" "x" "s" "j" "d" "w" "p" "n" "a" "n"
## [52] "w" "o" "w" "i" "s" "d" "i" "j" "j" "k" "p" "f" "d" "r" "c" "o" "c"
## [69] "b" "t" "y" "c" "z" "j" "a" "t" "a" "o" "o" "t" "j" "t" "j" "n" "e"
## [86] "c" "f" "e" "k" "r" "w" "w" "w" "o" "j" "i" "g" "d" "v" "r" "f" "r"
## [103] "b" "z" "b" "k" "n" "b" "h" "z" "g" "v" "i" "z" "c" "r" "o" "p" "w"
## [120] "g" "n" "b" "q" "o" "f" "a" "o" "t" "f" "b" "w" "m" "k" "t" "s" "z"
## [137] "q" "e" "f" "y" "n" "d" "t" "k" "c" "f" "g" "m" "c" "g" "x" "o" "n"
## [154] "h" "k" "g" "r"
unlist(str_extract_all(hidden, "[0-9]"))
## [1] "1" "0" "8" "7" "7" "9" "2" "8" "5" "5" "0" "7" "8" "0" "3" "5" "3"
## [18] "0" "7" "5" "5" "3" "3" "6" "4" "1" "1" "6" "2" "2" "4" "9" "0" "5"
## [35] "6" "5" "1" "7" "2" "4" "6" "3" "9" "5" "8" "9" "6" "5" "9" "4" "9"
## [52] "0" "5" "4" "5"
The hidden message is revealed by extracting just the uppercase characters from the given string as below:
code <- str_c(unlist(str_extract_all(hidden, "[A-Z]")), collapse = "")
code
## [1] "CONGRATULATIONSYOUAREASUPERNERD"