Data 607 Week 3 Assignment

Automated Data Collection with R, Ch. 8: Regular expressions and essential string functions

This week’s assignment applies the text processing capabilities of regular expressions and the string manipulation functionality offered by Hadley Wickham’s stringr package to problems 3, 4, and 9 from Chapter 8 in Automated Data Collection with R (2015).

Load required package

library(stringr)

Problem 3

3. Copy the introductory example. The vector name stores the extracted names.

raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"   

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
phone <- unlist(str_extract_all(raw.data, "\\(?(\\d{3})?\\)?(-| )?\\d{3}(-| )?\\d{4}"))

name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

3.1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

splitname <- str_split(name, ", ", simplify = TRUE)
first_last <- str_c(splitname[, 2], splitname[, 1])
first_last <- str_replace(first_last, "([a-z])([A-Z])", "\\1 \\2")
first_last

## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

no_title <- str_replace(first_last, "[[:alpha:]]{2,}\\. ", "")
no_title

## [1] "Moe Szyslak"         "C. Montgomery Burns" "Timothy Lovejoy"    
## [4] "Ned Flanders"        "Homer Simpson"       "Julius Hibbert"

first_name <- str_extract(no_title, "^[[:alpha:]]+\\.?")
last_name <- str_extract(first_last, "[[:alpha:]]+$")

no_mid_name <- str_c(first_name, last_name, sep = " ")
no_mid_name

## [1] "Moe Szyslak"     "C. Burns"        "Timothy Lovejoy" "Ned Flanders"   
## [5] "Homer Simpson"   "Julius Hibbert"

3.2. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

title_test <- str_detect(first_last, "[[:alpha:]]{2,}\\.")
title_test

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

title <- str_extract(first_last, "[[:alpha:]]{2,}\\.")
title

## [1] NA     NA     "Rev." NA     NA     "Dr."

3.3. Construct a logical vector indicating whether a character has a second name.

Assuming “second name” refers to surname:

last_test <- str_detect(first_last, "[[:alpha:]]+$")
last_test

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

last_name

## [1] "Szyslak"  "Burns"    "Lovejoy"  "Flanders" "Simpson"  "Hibbert"

Assuming “second name” refers to middle name:

second_test <- str_detect(first_last, "(?<!([[:alpha:]]{2,3}\\.)) [[:alpha:]]+ ")
second_test

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

second_name <- str_trim(str_extract(first_last, "(?<!([[:alpha:]]{2,3}\\.)) [[:alpha:]]+ "))
second_name

## [1] NA           "Montgomery" NA           NA           NA          
## [6] NA

Creating a Simpsons characters’ names and phone numbers data frame:

pnum <- "\\(?([0-9]{3})?\\)?[- ]?([0-9]{3})[- ]?([0-9]{4})"

names_df <- cbind(last_name, first_name, middle_name = second_name, title, area_code = str_match(phone, pnum)[, 2], phone = str_c(str_match(phone, pnum)[, 3], "-", str_match(phone, pnum)[, 4]))
                  
knitr::kable(names_df)

last_name	first_name	middle_name	title	area_code	phone
Szyslak	Moe	NA	NA	NA	555-1239
Burns	C.	Montgomery	NA	636	555-0113
Lovejoy	Timothy	NA	Rev.	NA	555-6542
Flanders	Ned	NA	NA	NA	555-8904
Simpson	Homer	NA	NA	636	555-3226
Hibbert	Julius	NA	Dr.	NA	555-3642

Problem 4

4. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

4.1. [0-9]+\\$

This regular expression describes a string of one or more digits followed by a dollar sign, for example, 958$.

4.2. \\b[a-z]{1,4}\\b

This regular expression describes a word of one to four consecutive lowercase alphabetic characters, in other words, a string of one to four lowercase letters with word boundaries on both sides. An example is pear.

4.3. .*?\\.txt$

This regular expression describes a string of any number of characters ending in “.txt”, i.e., a file or path name with a .txt extension. For example, folder/file.txt.

4.4. \\d{2}/\\d{2}/\\d{4}

This regular expression describes 2 digits followed by a forward slash, 2 more digits, another forward slash, and four more digits, in other words, a date in the format 12/21/1994 or 18/09/2016.

4.5. <(.+?)>.+?</\\1>

This regular expression describes an xml or html element with start and end tags, without attributes, and with variable content of length of at least one character. An example is <data> 123 </data>.

In the following code, the examples are checked to verify that they conform to the provided regular expressions:

examples <- c("958$", "pear", "folder/file.txt", "12/21/1994", "18/09/2016", "<data> 123 </data>")

#4.1
str_extract(examples, "[0-9]+\\$")

## [1] "958$" NA     NA     NA     NA     NA

#4.2
str_extract(examples, "\\b[a-z]{1,4}\\b")

## [1] NA     "pear" "file" NA     NA     "data"

#4.3
str_extract(examples, ".*?\\.txt$")

## [1] NA                NA                "folder/file.txt" NA               
## [5] NA                NA

#4.4
str_extract(examples, "\\d{2}/\\d{2}/\\d{4}")

## [1] NA           NA           NA           "12/21/1994" "18/09/2016"
## [6] NA

#4.5
str_extract(examples, "<(.+?)>.+?</\\1>")

## [1] NA                   NA                   NA                  
## [4] NA                   NA                   "<data> 123 </data>"

Problem 9

9. The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

hidden <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

unlist(str_extract_all(hidden, "[A-Z]"))

##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"

unlist(str_extract_all(hidden, "[a-z]"))

##   [1] "c" "l" "c" "o" "p" "o" "w" "z" "m" "s" "t" "c" "d" "w" "n" "k" "i"
##  [18] "g" "v" "d" "i" "c" "p" "u" "g" "g" "v" "h" "r" "y" "n" "j" "u" "w"
##  [35] "c" "z" "i" "h" "q" "r" "f" "p" "x" "s" "j" "d" "w" "p" "n" "a" "n"
##  [52] "w" "o" "w" "i" "s" "d" "i" "j" "j" "k" "p" "f" "d" "r" "c" "o" "c"
##  [69] "b" "t" "y" "c" "z" "j" "a" "t" "a" "o" "o" "t" "j" "t" "j" "n" "e"
##  [86] "c" "f" "e" "k" "r" "w" "w" "w" "o" "j" "i" "g" "d" "v" "r" "f" "r"
## [103] "b" "z" "b" "k" "n" "b" "h" "z" "g" "v" "i" "z" "c" "r" "o" "p" "w"
## [120] "g" "n" "b" "q" "o" "f" "a" "o" "t" "f" "b" "w" "m" "k" "t" "s" "z"
## [137] "q" "e" "f" "y" "n" "d" "t" "k" "c" "f" "g" "m" "c" "g" "x" "o" "n"
## [154] "h" "k" "g" "r"

unlist(str_extract_all(hidden, "[0-9]"))

##  [1] "1" "0" "8" "7" "7" "9" "2" "8" "5" "5" "0" "7" "8" "0" "3" "5" "3"
## [18] "0" "7" "5" "5" "3" "3" "6" "4" "1" "1" "6" "2" "2" "4" "9" "0" "5"
## [35] "6" "5" "1" "7" "2" "4" "6" "3" "9" "5" "8" "9" "6" "5" "9" "4" "9"
## [52] "0" "5" "4" "5"

The hidden message is revealed by extracting just the uppercase characters from the given string as below:

code <- str_c(unlist(str_extract_all(hidden, "[A-Z]")), collapse = "")
code

## [1] "CONGRATULATIONSYOUAREASUPERNERD"