CUNY 607 Homework Assignment 3

Load libraries.

library(stringr)

Copy the introductory example. The vector name stores the extracted names.

raw.data <- "555-1239Moe Szyslak( 636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

# Several readjustments will need to be made to the name vector
# Will need to remove the titles and initials
name.2 <- str_trim(sub("[A-Za-z]{1,}\\.","",name))
name.2

## [1] "Moe Szyslak"        "Burns,  Montgomery" "Timothy Lovejoy"   
## [4] "Ned Flanders"       "Simpson, Homer"     "Julius Hibbert"

# Now, will need to reverse the last names and first names.
# Will use the sub function again.
# Reference: http://stackoverflow.com/questions/33826650/last-name-first-name-to-first-name-last-name

name.3 <- sub("(\\w+),\\s+(\\w+)","\\2 \\1", name.2 )
name.3

## [1] "Moe Szyslak"      "Montgomery Burns" "Timothy Lovejoy" 
## [4] "Ned Flanders"     "Homer Simpson"    "Julius Hibbert"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

title <- str_detect(name, "[A-Za-z]{2,}\\.")
title.df <- data.frame(Name = name, Has_Title = title)
knitr::kable(title.df, caption = "Person has title?")

Person has title?
Name	Has_Title
Moe Szyslak	FALSE
Burns, C. Montgomery	FALSE
Rev. Timothy Lovejoy	TRUE
Ned Flanders	FALSE
Simpson, Homer	FALSE
Dr. Julius Hibbert	TRUE

Construct a logical vector indicating whether a character has a second name.

# Will initially count how many spaces there are, as spaces separate names. If there are two first names, there will be a total of 2 spaces as opposed to 1. In this case, we will create a new vector to remove the titles.
name.4 <- str_trim(sub("[A-Za-z]{2,}\\.","",name))
space.count <- str_count(name.4, "\\s")
space.count

## [1] 1 2 1 1 1 1

#In this case, I will convert this into 1's and 0's so that this can be converted into a logical vector with an ifelse() function.
space.count <- space.count - 1

# Now to create a logical vector.
Two.Name <- ifelse(space.count,TRUE,FALSE)

# And now to create a data.frame and place it into a vector
Two.Name.df <- data.frame(Name = name, Two_Names = Two.Name)
knitr::kable(Two.Name.df, caption = "Does this person have more than 1 name?")

Does this person have more than 1 name?
Name	Two_Names
Moe Szyslak	FALSE
Burns, C. Montgomery	TRUE
Rev. Timothy Lovejoy	FALSE
Ned Flanders	FALSE
Simpson, Homer	FALSE
Dr. Julius Hibbert	FALSE

Describe the types of strings that conform to the following regular expressions and construst an example that is matched by the regular expression.

[0-9]+\$

This is an all digit expression. There is at all digits (given the [0-9]+) and the $ sign at the end of the digits. An example would be 534221097$.

x <- c("43234$", "adbd3222$", "$33432", "1$")
str_extract_all(x, "[0-9]+\\$")

## [[1]]
## [1] "43234$"
## 
## [[2]]
## [1] "3222$"
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "1$"

\b[a-z]{1,4}\b

This Regex is looking for a 1 to 4 letter lower case word with empty string at either edge. An example would be " usa “.

x <- c(" usa ", "usa1", "United States")
str_extract_all(x, "\\b[a-z]{1,4}\\b")

## [[1]]
## [1] "usa"
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)

.*?\.txt$

This is looking for a .txt file. The .*? is looking for optional characters prior to the “.txt”. The file must end in a “.txt.” as the $ sign indicates the end of the string. An example would be “hello.txt”.

x <- c("hello.txt", "hello.txt!", "datascience.pdf")
str_extract_all(x, ".*?\\.txt$")

## [[1]]
## [1] "hello.txt"
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)

\d{2}/\d{2}/\d{4}

This format is likely looking for a two digit month, two digit day, and four digit year. An example would be 01/01/2000.

x <- c("06/30/00", "01/01/2000", "June 1st, 2017")
str_extract_all(x, "\\d{2}/\\d{2}/\\d{4}")

## [[1]]
## character(0)
## 
## [[2]]
## [1] "01/01/2000"
## 
## [[3]]
## character(0)

<(.+?)>.+?</\1>

This is looking for html tags. The \1 means that it is back referencing to the first segment.

x <- c("<tag>Hello</tag>", "<fire>house</fire>", "<oreo cream oreo>", "<fire>house<fire>", "<a>a</a>")
str_extract_all(x, "<(.+?)>.+?</\\1>")

## [[1]]
## [1] "<tag>Hello</tag>"
## 
## [[2]]
## [1] "<fire>house</fire>"
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## [1] "<a>a</a>"

9 The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk! gr

data.code <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk! gr"
secret <- unlist(str_extract_all(data.code, "[[:upper:].]+"))
secret

##  [1] "C"  "O"  "N"  "G"  "R"  "A"  "T"  "U"  "L"  "AT" "I"  "O"  "N"  "S" 
## [15] "."  "Y"  "O"  "U"  "."  "A"  "R"  "E"  "."  "A"  ".S" "U"  "P"  "E" 
## [29] "R"  "N"  "E"  "R"  "D"

message <- str_replace_all(paste0(secret, collapse = ''), "[.]", " ")
message

## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"

CUNY 607 Homework Assignment 3

Joel Park

2/14/2017