DATA 607 - Week 3 Assignment

Problem 3: Copy the introductory example. The vector `name` stores the extracted names.

# load stringr package
library(stringr)

# copy example from the book
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5543642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

(a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard `first_name last_name`.

Reviewing the list of names, it appears there are two (or three) issues that we need to fix:
(i) names given in the form last_name, first_name
(ii) names with titles (Rev. and Dr.)
(iii) names with three names (possible issue depending on the goal of the exercise).

First let’s fix isssue (i) by reversing the first and last names.

( name2 <- str_replace(name, "([:alpha:]+), ([:graph:]+)", "\\2 \\1") )

## [1] "Moe Szyslak"          "C. Burns Montgomery"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

Next let’s fix issue (ii). Let’s assume that any title has 2 or more letters and ends with a period; further, if the title occurs at the beginning of the name it will be followed by a space (e.g., Ms., Prof.), or if it occurs at the end of the name it will be preceded by a comma (e.g., Esq., Ph.D.).

# define the regular expression for titles; use the "graph" character class since there may be punctuation in the middle of the title
title <- "([:graph:]{2,}\\. )|(, [:graph:]{2,}\\.)"
# replace any instances of titles with ""
( name3 <- str_replace_all(name2, title, "") )

## [1] "Moe Szyslak"         "C. Burns Montgomery" "Timothy Lovejoy"    
## [4] "Ned Flanders"        "Homer Simpson"       "Julius Hibbert"

Finally, there’s the issue of “C. Burns Montgomery”, which has 3 names. If including the 3 names is acceptable, then we’re done; but if the goal is to list only the first name (in this case “C.”) and last name, then we need to remove the middle name (“Burns”).

# use the "graph" character class to allow for initial letters with periods ("C.")
( name4 <- str_replace(name3, "([:graph:]+) ([:graph:]+) ([:graph:]+)", "\\1 \\3") )

## [1] "Moe Szyslak"     "C. Montgomery"   "Timothy Lovejoy" "Ned Flanders"   
## [5] "Homer Simpson"   "Julius Hibbert"

(b) Construct a logical vector indicating whether a character has a title (i.e., `Rev.` and `Dr.`).

As before, define a title as a character string having 2 or more letters and ending with a period. This time we don’t have to worry about the position of the title (whether at the beginning of the name or at the end), since we’re not replacing any characters in the name string.

title2 <- "[:graph:]{2,}\\."
( title_logical <- str_detect(name, title2) )

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

name[title_logical]

## [1] "Rev. Timothy Lovejoy" "Dr. Julius Hibbert"

str_view(name, title2)

(c) Construct a logical vector indicating whether a character has a second name.

In this case, assume that we’ve converted the name vector to first_name last_name format and have removed all titles, so we can work with the name3 vector above. Now to find whether a character has a second name, we match on any middle names.

# define a regular expression for middle names = space followed by "graph" characters (may include period) followed by space
mid_name <- " [:graph:]+ "
( second_name_logical <- str_detect(name3, mid_name) )

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

name3[second_name_logical]

## [1] "C. Burns Montgomery"

str_view(name3, mid_name)

Problem 4: Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

(a) [0-9]+\$

Answer: one or more digits followed by a “$”
Example:

str_view_all("ab$c123., 456$jkl :sdf7$89\\", "[0-9]+\\$")

(b) \b[a-z]{1,4}\b

Answer: word of 1-4 lowercase letters
Example:

str_view_all("This is the premise; that is the conclusion.", "\\b[a-z]{1,4}\\b")

(c) .*?\.txt$

Answer: any character sequence (0 or more characters) followed by “.txt”, occurring at the end of the string
Example:

str_view_all(c("txt", ".txt", "name.txt", "name.txt csv", "first,last.txt", "random : name.txt"), ".*?\\.txt$")

(d) \d{2}/\d{2}/\d{4}

Answer: two digits then “/” then 2 digits then “/” then 4 digits (e.g., date format “mm/dd/yyyy”)
Example:

str_view_all(c("on 12/15/2014, ", "2014/12/15", " 6/8/2018", "06/8/2018", "06/08/2018"), "\\d{2}/\\d{2}/\\d{4}")

(e) <(.+?)>.+?</\1>

Answer: any character sequence (1 or more characters) inside of “<” and “>”, then another character sequence (1 or more characters), followed by “/” and the first character sequence inside of “<” and “>”
Example:

v <- c("<abc>123</abc>", "abc>123</abc>", "<></>", "<>x</>", "< > a</ >", "<ab ,12>xyz<ab ,12>", "<abc > x</abc >")
str_view_all(v, "<(.+?)>.+?</\\1>")

Problem 9: The following code hides a secret message. Crack it with R and regular expressions.

Let’s extract the capital letters and punctuation from the code.

code <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

code %>% str_extract_all("[A-Z[:punct:]]") %>% unlist() %>% str_c(collapse = "")

## [1] "CONGRATULATIONS.YOU.ARE.A.SUPERNERD!"

DATA 607 - Week 3 Assignment

Kevin Benson

September 13, 2018

Problem 3: Copy the introductory example. The vector `name` stores the extracted names.

(a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard `first_name last_name`.

(b) Construct a logical vector indicating whether a character has a title (i.e., `Rev.` and `Dr.`).

(c) Construct a logical vector indicating whether a character has a second name.

Problem 4: Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

(a) [0-9]+\$

(b) \b[a-z]{1,4}\b

(c) .*?\.txt$

(d) \d{2}/\d{2}/\d{4}

(e) <(.+?)>.+?</\1>

Problem 9: The following code hides a secret message. Crack it with R and regular expressions.

DATA 607 - Week 3 Assignment

Kevin Benson

September 13, 2018

Problem 3: Copy the introductory example. The vector name stores the extracted names.

(a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

(b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

(c) Construct a logical vector indicating whether a character has a second name.

Problem 4: Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

(a) [0-9]+\$

(b) \b[a-z]{1,4}\b

(c) .*?\.txt$

(d) \d{2}/\d{2}/\d{4}

(e) <(.+?)>.+?</\1>

Problem 9: The following code hides a secret message. Crack it with R and regular expressions.

Problem 3: Copy the introductory example. The vector `name` stores the extracted names.

(a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard `first_name last_name`.

(b) Construct a logical vector indicating whether a character has a title (i.e., `Rev.` and `Dr.`).