DATA 607: Week 3 Assignment

Introductory Example

library(stringr)
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert" 
names <-unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
names

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Question 3

3.a Rearrange the Vector so all elements conform to the standard first_name last_name

first_name_extract <- unlist(str_extract_all(names, "\\w+\\s|[, ]\\s\\w+"))
first_name_split <- unlist(str_split(first_name_extract, ",[[:blank:]]{1}"))
first_names <- unlist(str_extract_all(first_name_split, "\\w+"))
first_names

## [1] "Moe"     "C"       "Timothy" "Ned"     "Homer"   "Julius"

last_name_extract <- unlist(str_extract_all(names, "[^[:punct:]]\\s\\w+$|\\w+[,]"))
last_name_split <- unlist(str_split(last_name_extract, "[[:blank:]]{1}"))
last_names <- unlist(str_extract_all(last_name_split, "[[:alpha:]][[:alpha:]]+"))
last_names

## [1] "Szyslak"  "Burns"    "Lovejoy"  "Flanders" "Simpson"  "Hibbert"

Simpsons_Characters <- data.frame(first_names, last_names)
Simpsons_Characters

##   first_names last_names
## 1         Moe    Szyslak
## 2           C      Burns
## 3     Timothy    Lovejoy
## 4         Ned   Flanders
## 5       Homer    Simpson
## 6      Julius    Hibbert

3.b Construct a logical vector indicating whether a character has a title

unlist(str_detect(names, "[A-Za-z][A-Za-z][.]"))

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

titled_name <- unlist(str_extract_all(names, "\\w+[.]\\s\\w+\\s\\w+"))
titled_name

## [1] "Rev. Timothy Lovejoy" "Dr. Julius Hibbert"

3.c Construct a logical vector indicating whether a character has a second name.

unlist(str_detect(names, "\\s\\w[.]\\s\\w+"))

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Question 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression

(a) [0-9]+\$

Answer: A digit followed by a dollar sign

sample4a <- c("456$", "87", "7802$", "hello", "telephone$", "45%", "0190$", "$67", "67$2")
str_detect(sample4a, "[0-9]+\\$")

## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE

(b) \b[a-z]{1,4}\b

Answer: A lowercase word between 1 and 4 characters long

sample4b <- c("far", "May","vba3", "b", "hello")
str_detect(sample4b, "\\b[a-z]{1,4}\\b")

## [1]  TRUE FALSE FALSE  TRUE FALSE

(c) .*?\.txt$ #### Answer: Any combination of characters (or no characters) and matched at most once that is a .txt file

sample4c <- c("hello", "dsfjh", "hello.txt", "dsfjh.txt", "123.txt", "%^*(&%^.txt", ".txt")
str_detect(sample4c, ".*?\\.txt$")

## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

(d) \d{2}/\d{2}/\d{4}

Answer: Two digits / Two Digits / Four Digits

sample4d <- c("67/89/0984", "445/2346/982357", "53/75/1893", "04.24.1995", "04/24/1995")
str_detect(sample4d, "\\d{2}/\\d{2}/\\d{4}")

## [1]  TRUE FALSE  TRUE FALSE  TRUE

(e) <(.+?)>.+?</\1>

Answer: Any mix of characters enclosed by <></>

sample4e <- c("<header>Hello World</header>", "<header>Hello World<header>")
str_detect(sample4e, "<(.+?)>.+?</\\1>")

## [1]  TRUE FALSE

Question 9

The following code hides a secret message. Crack it with R and regular expressions.

secret_message <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

break_message = unlist(str_extract_all(secret_message, "\\w+[[:punct:]]"))
message = unlist(str_extract_all(break_message, "[[:upper:]]|[[:punct:]]"))
message_readable = paste(message, collapse="")
message_readable

## [1] "CONGRATULATIONS.YOU.ARE.A.SUPERNERD!"

DATA 607: Week 3 Assignment

Michele Bradley

9/14/2017

Introductory Example

Question 3

3.a Rearrange the Vector so all elements conform to the standard first_name last_name

3.b Construct a logical vector indicating whether a character has a title

3.c Construct a logical vector indicating whether a character has a second name.

Question 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression

(a) [0-9]+\$

Answer: A digit followed by a dollar sign

(b) \b[a-z]{1,4}\b

Answer: A lowercase word between 1 and 4 characters long

(c) .*?\.txt$ #### Answer: Any combination of characters (or no characters) and matched at most once that is a .txt file

(d) \d{2}/\d{2}/\d{4}

Answer: Two digits / Two Digits / Four Digits

(e) <(.+?)>.+?</\1>

Answer: Any mix of characters enclosed by <></>

Question 9

The following code hides a secret message. Crack it with R and regular expressions.