Automated Data Collection with R, Ch. 8: Regular expressions and essential string functions

This week’s assignment applies the text processing capabilities of regular expressions and the string manipulation functionality offered by Hadley Wickham’s stringr package to problems 3, 4, and 9 from Chapter 8 in Automated Data Collection with R (2015).


Load required package

library(stringr)


Problem 3

3. Copy the introductory example. The vector name stores the extracted names.

raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"   

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
phone <- unlist(str_extract_all(raw.data, "\\(?(\\d{3})?\\)?(-| )?\\d{3}(-| )?\\d{4}"))

name
## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

3.1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

splitname <- str_split(name, ", ", simplify = TRUE)
first_last <- str_c(splitname[, 2], splitname[, 1])
first_last <- str_replace(first_last, "([a-z])([A-Z])", "\\1 \\2")
first_last
## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"
no_title <- str_replace(first_last, "[[:alpha:]]{2,}\\. ", "")
no_title 
## [1] "Moe Szyslak"         "C. Montgomery Burns" "Timothy Lovejoy"    
## [4] "Ned Flanders"        "Homer Simpson"       "Julius Hibbert"
first_name <- str_extract(no_title, "^[[:alpha:]]+\\.?")
last_name <- str_extract(first_last, "[[:alpha:]]+$")

no_mid_name <- str_c(first_name, last_name, sep = " ")
no_mid_name
## [1] "Moe Szyslak"     "C. Burns"        "Timothy Lovejoy" "Ned Flanders"   
## [5] "Homer Simpson"   "Julius Hibbert"

3.2. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

title_test <- str_detect(first_last, "[[:alpha:]]{2,}\\.")
title_test
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
title <- str_extract(first_last, "[[:alpha:]]{2,}\\.")
title
## [1] NA     NA     "Rev." NA     NA     "Dr."

3.3. Construct a logical vector indicating whether a character has a second name.

Assuming “second name” refers to surname:

last_test <- str_detect(first_last, "[[:alpha:]]+$")
last_test
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
last_name
## [1] "Szyslak"  "Burns"    "Lovejoy"  "Flanders" "Simpson"  "Hibbert"

Assuming “second name” refers to middle name:

second_test <- str_detect(first_last, "(?<!([[:alpha:]]{2,3}\\.)) [[:alpha:]]+ ")
second_test
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE
second_name <- str_trim(str_extract(first_last, "(?<!([[:alpha:]]{2,3}\\.)) [[:alpha:]]+ "))
second_name
## [1] NA           "Montgomery" NA           NA           NA          
## [6] NA

Creating a Simpsons characters’ names and phone numbers data frame:

pnum <- "\\(?([0-9]{3})?\\)?[- ]?([0-9]{3})[- ]?([0-9]{4})"

names_df <- cbind(last_name, first_name, middle_name = second_name, title, area_code = str_match(phone, pnum)[, 2], phone = str_c(str_match(phone, pnum)[, 3], "-", str_match(phone, pnum)[, 4]))
                  
knitr::kable(names_df)
last_name first_name middle_name title area_code phone
Szyslak Moe NA NA NA 555-1239
Burns C. Montgomery NA 636 555-0113
Lovejoy Timothy NA Rev. NA 555-6542
Flanders Ned NA NA NA 555-8904
Simpson Homer NA NA 636 555-3226
Hibbert Julius NA Dr. NA 555-3642


Problem 4

4. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

4.1. [0-9]+\\$

This regular expression describes a string of one or more digits followed by a dollar sign, for example, 958$.

4.2. \\b[a-z]{1,4}\\b

This regular expression describes a word of one to four consecutive lowercase alphabetic characters, in other words, a string of one to four lowercase letters with word boundaries on both sides. An example is pear.

4.3. .*?\\.txt$

This regular expression describes a string of any number of characters ending in “.txt”, i.e., a file or path name with a .txt extension. For example, folder/file.txt.

4.4. \\d{2}/\\d{2}/\\d{4}

This regular expression describes 2 digits followed by a forward slash, 2 more digits, another forward slash, and four more digits, in other words, a date in the format 12/21/1994 or 18/09/2016.

4.5. <(.+?)>.+?</\\1>

This regular expression describes an xml or html element with start and end tags, without attributes, and with variable content of length of at least one character. An example is <data> 123 </data>.

In the following code, the examples are checked to verify that they conform to the provided regular expressions:

examples <- c("958$", "pear", "folder/file.txt", "12/21/1994", "18/09/2016", "<data> 123 </data>")

#4.1
str_extract(examples, "[0-9]+\\$")
## [1] "958$" NA     NA     NA     NA     NA
#4.2
str_extract(examples, "\\b[a-z]{1,4}\\b")
## [1] NA     "pear" "file" NA     NA     "data"
#4.3
str_extract(examples, ".*?\\.txt$")
## [1] NA                NA                "folder/file.txt" NA               
## [5] NA                NA
#4.4
str_extract(examples, "\\d{2}/\\d{2}/\\d{4}")
## [1] NA           NA           NA           "12/21/1994" "18/09/2016"
## [6] NA
#4.5
str_extract(examples, "<(.+?)>.+?</\\1>")
## [1] NA                   NA                   NA                  
## [4] NA                   NA                   "<data> 123 </data>"


Problem 9

9. The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

hidden <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

unlist(str_extract_all(hidden, "[A-Z]"))
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"
unlist(str_extract_all(hidden, "[a-z]"))
##   [1] "c" "l" "c" "o" "p" "o" "w" "z" "m" "s" "t" "c" "d" "w" "n" "k" "i"
##  [18] "g" "v" "d" "i" "c" "p" "u" "g" "g" "v" "h" "r" "y" "n" "j" "u" "w"
##  [35] "c" "z" "i" "h" "q" "r" "f" "p" "x" "s" "j" "d" "w" "p" "n" "a" "n"
##  [52] "w" "o" "w" "i" "s" "d" "i" "j" "j" "k" "p" "f" "d" "r" "c" "o" "c"
##  [69] "b" "t" "y" "c" "z" "j" "a" "t" "a" "o" "o" "t" "j" "t" "j" "n" "e"
##  [86] "c" "f" "e" "k" "r" "w" "w" "w" "o" "j" "i" "g" "d" "v" "r" "f" "r"
## [103] "b" "z" "b" "k" "n" "b" "h" "z" "g" "v" "i" "z" "c" "r" "o" "p" "w"
## [120] "g" "n" "b" "q" "o" "f" "a" "o" "t" "f" "b" "w" "m" "k" "t" "s" "z"
## [137] "q" "e" "f" "y" "n" "d" "t" "k" "c" "f" "g" "m" "c" "g" "x" "o" "n"
## [154] "h" "k" "g" "r"
unlist(str_extract_all(hidden, "[0-9]"))
##  [1] "1" "0" "8" "7" "7" "9" "2" "8" "5" "5" "0" "7" "8" "0" "3" "5" "3"
## [18] "0" "7" "5" "5" "3" "3" "6" "4" "1" "1" "6" "2" "2" "4" "9" "0" "5"
## [35] "6" "5" "1" "7" "2" "4" "6" "3" "9" "5" "8" "9" "6" "5" "9" "4" "9"
## [52] "0" "5" "4" "5"

The hidden message is revealed by extracting just the uppercase characters from the given string as below:

code <- str_c(unlist(str_extract_all(hidden, "[A-Z]")), collapse = "")
code
## [1] "CONGRATULATIONSYOUAREASUPERNERD"