DATA607

library(stringr)
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

3.1 Use the tools of this chapter to rearrange the vector so that all elements conform to the standard

first_name last_name

# Create a dataframe by splitting the names using the str_split function against a comma with a space. Entries that could not be split due to a lack of a comma will be copied into both rows associated with the entry
split_names <- data.frame(str_split(name,", "))
final_names = list()

#For the amount of entries in the list, determine if the two rows associated with the entry contain the same value. If so, simply copy the contents of one of the rows into the final list. Otherwise, copy the second row followed by the first row into the final list. 

for (i in 1:length(name)){
  if (split_names[1,i] == split_names[2,i]){
    final_names[i] <- paste(split_names[1,i])
  } else{
    final_names[i] <- paste(split_names[2,i],split_names[1,i])
  }
next
}
unlist(final_names)

## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

3.2 Construct a logical vector indicating whether a character has a title (i.e., Rev and Dr.).

#To do this we look for a situation where after a word break (start of a word) we see 2 or 3 alphabetical characters followed by a period and a space
str_detect(final_names,"\\b[[:alpha:]]{2,3}\\. ")

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

3.3 Construct a vector indicating whether a character has a second name.

Here we look at two possibilities:

The full name contains an abbreviated single letter for either the first or middle name followed by a period
The full name contains three full names (not counting titles that are followed by periods).

str_detect(final_names,"\\b[[:alpha:]]\\. [[:alpha:]]|\\b[[:alpha:]]+[[:space:]][[:alpha:]]+[[:space:]][[:alpha:]]+\\b")

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

4. Desribe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\\$

This espression would require that the target expression contain a number followed immediately by a dollar symbol ($) - no spaces allowed after the number. What comes before the number or after the dollar symbol are irrelevant.

string4_1 <- c("He placed a 500$ bet on SpaceX being the first to land a rocket on Mars", "but not 500 $", "or $500")
str_detect(string4_1,"[0-9]+\\$")

## [1]  TRUE FALSE FALSE

“\\b[a-z]{1-4}\\b”

This expression would identify words consisting of no more than 4 lower-case letters.

string4_2 <- "This sentence contains four words which meet that slim criteria, 1234. 123"
str_extract_all(string4_2,"\\b[a-z]{1,4}\\b")

## [[1]]
## [1] "four" "meet" "that" "slim"

.*?\\.txt$

This requires the expression to end with a “.txt” (specifically lower case) preceded by any number of characters (including none).

string4_3 <- c("final_project_56_&^%%$.txt", "not final_project.pdf", "or even final.TXT")
str_detect(string4_3,".*?\\.txt$")

## [1]  TRUE FALSE FALSE

\\d{2}/\\d{2}/\\d{4}

This expression looks for the expression to contain a sequence of 11 characters of a specific type - 2 digits followed by a forward slash “/”, followed by 2 digits, followed by another forward slash and finally 4 more digits. The amount of numbers/characters immediately preceding or following this set of characters is irrelevant. This seems to be a test for the presence of a data in either DD/MM/YYYY or MM/DD/YYYY formats.

string4_4 <- c("The scheduled delivery date is 23/11/2020", "Not February 5th, 2020", "but also things like 12/23/11/2020/43/4563456" )
str_detect(string4_4,"\\d{2}/\\d{2}/\\d{4}")

## [1]  TRUE FALSE  TRUE

<(.+?)>.+?</\1>

This expression looks for the standard XML “open” and close" brackets surrounding any expression (of at least length 1).

string4_5 <- c("Here is <head>test</head>", " not <head></head>")
str_detect(string4_5,"<(.+?)>.+?</\\1")

## [1]  TRUE FALSE

9.The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippeet is also available in the materials at www.r-datacollection.com.

Since th hint indicated the importance of certain types of characters, let’s start by pulling out characters by type.

string9 <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"


str_extract_all(string9,"[:digit:]")

## [[1]]
##  [1] "1" "0" "8" "7" "7" "9" "2" "8" "5" "5" "0" "7" "8" "0" "3" "5" "3"
## [18] "0" "7" "5" "5" "3" "3" "6" "4" "1" "1" "6" "2" "2" "4" "9" "0" "5"
## [35] "6" "5" "1" "7" "2" "4" "6" "3" "9" "5" "8" "9" "6" "5" "9" "4" "9"
## [52] "0" "5" "4" "5"

str_extract_all(string9,"[:lower:]")

## [[1]]
##   [1] "c" "l" "c" "o" "p" "o" "w" "z" "m" "s" "t" "c" "d" "w" "n" "k" "i"
##  [18] "g" "v" "d" "i" "c" "p" "u" "g" "g" "v" "h" "r" "y" "n" "j" "u" "w"
##  [35] "c" "z" "i" "h" "q" "r" "f" "p" "x" "s" "j" "d" "w" "p" "n" "a" "n"
##  [52] "w" "o" "w" "i" "s" "d" "i" "j" "j" "k" "p" "f" "d" "r" "c" "o" "c"
##  [69] "b" "t" "y" "c" "z" "j" "a" "t" "a" "o" "o" "t" "j" "t" "j" "n" "e"
##  [86] "c" "f" "e" "k" "r" "w" "w" "w" "o" "j" "i" "g" "d" "v" "r" "f" "r"
## [103] "b" "z" "b" "k" "n" "b" "h" "z" "g" "v" "i" "z" "c" "r" "o" "p" "w"
## [120] "g" "n" "b" "q" "o" "f" "a" "o" "t" "f" "b" "w" "m" "k" "t" "s" "z"
## [137] "q" "e" "f" "y" "n" "d" "t" "k" "c" "f" "g" "m" "c" "g" "x" "o" "n"
## [154] "h" "k" "g" "r"

str_extract_all(string9,"[:upper:]")

## [[1]]
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"

# Here's the secret message!

DATA607_HW3

Misha Kollontai

9/10/2019

3.1 Use the tools of this chapter to rearrange the vector so that all elements conform to the standard

3.2 Construct a logical vector indicating whether a character has a title (i.e., Rev and Dr.).

3.3 Construct a vector indicating whether a character has a second name.

4. Desribe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

9.The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippeet is also available in the materials at www.r-datacollection.com.