Author: Romerl Elizes

Question III

Copy the introductory example. The vector name stores the extracted names.

library(stringr)
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
raw.data

## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

first_name_raw <- unlist(str_extract_all(name,"\\w* |, \\w*(. \\w*|$)"))
first_name_raw <- first_name_raw[first_name_raw != " "]
first_name_cand <- gsub(",","",first_name_raw)
first_name <- str_trim(first_name_cand)

last_name_raw <- unlist(str_extract(name," \\w*$|\\w*,"))
last_name_cand <- gsub(",","",last_name_raw)
last_name <- str_trim(last_name_cand)

data.frame(first_name = first_name, last_name = last_name)

##      first_name last_name
## 1           Moe   Szyslak
## 2 C. Montgomery     Burns
## 3       Timothy   Lovejoy
## 4           Ned  Flanders
## 5         Homer   Simpson
## 6        Julius   Hibbert

Definitely, not the best solution, but it does the job.

For first_name, I extracted the string pattern for either any string with a space afterward OR a string that is preceded by a comma and a space. The intermediate result will return a vector of the first names but with commas and empty spaces. I used gsub to get rid of commas and str_trim to trim the leading and ending empty spaces for each first_name string.

For last_name, I extracted the string pattern for either any string that ends with an empty space followed by a string or a string with a comma afterwards. The intermediate result will return a vector of the last names but with commas and empty spaces. I used gsub to get rid of commas and str_trim to trim the leading and ending empty spaces for each last_name string.

The final result is a display of the data frame with first_name and last_name columns and their values.

2. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

title_vector <- grepl("\\w{2,}\\. ",name)
data.frame(first_name = first_name, last_name = last_name, title = title_vector)

##      first_name last_name title
## 1           Moe   Szyslak FALSE
## 2 C. Montgomery     Burns FALSE
## 3       Timothy   Lovejoy  TRUE
## 4           Ned  Flanders FALSE
## 5         Homer   Simpson FALSE
## 6        Julius   Hibbert  TRUE

For the logical vector title, I used grepl to find any word with at least 2 characters length and that ends with a period and a space. If the pattern exists, it will return true, otherwise, it will be false.

The final result is a display of the data frame with first_name, last_name, and title columns and their values.

3. Construct a logical vector indicating whether a character has a second name.

second_name_vector <- grepl("\\w{1}\\. ",first_name)
second_name_vector

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

data.frame(first_name = first_name, last_name = last_name, title = title_vector, secondname = second_name_vector)

##      first_name last_name title secondname
## 1           Moe   Szyslak FALSE      FALSE
## 2 C. Montgomery     Burns FALSE       TRUE
## 3       Timothy   Lovejoy  TRUE      FALSE
## 4           Ned  Flanders FALSE      FALSE
## 5         Homer   Simpson FALSE      FALSE
## 6        Julius   Hibbert  TRUE      FALSE

For the logical vector secondname, I used grepl to find any word with exactly 1 characterlength and that ends with a period and a space. If the pattern exists, it will return true, otherwise, it will be false.

The final result is a display of the data frame with first_name, last_name, title, and secondname columns and their values

Question IV

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

All answers were derived on based on experimentation and liberal usage of the grepl function which was pretty useful.

1. [0-9]+\\$

a <- c("Frank","3Reginald","Robert4","Lisa36","8","345","8$","A6S12$","85$")
Four_1_vector <- grepl("[0-9]+\\$",a)
Four_1_vector

## [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

The pattern being searched is any string that contains a number and ends with a $.

2. \\b[a-z]{1,4}\\b

b <- c("Frank","3Reginald","Robert4","Lisa","Reginald abc Rupert", "Reginald abc","Reginald abcde","Reginald abc3")
Four_2_vector <- grepl("\\b[a-z]{1,4}\\b",b)
Four_2_vector

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

The pattern being searched is any string that contains initially a word and a second word that must consist at least 1 alphabetical character and a maximum 4 alphabetical characters. In the example above, I demonstrated two FALSE returns which included a second word with 5 alphabetical characters and a second word with some alphabetical characters but ending with a number.

3. .*?\\.txt$

c <- c("Frank.txt","3Reginald.txt","Robert4","Lisa.txtx")
Four_3_vector <- grepl(".*?\\.txt$",c)
Four_3_vector

## [1]  TRUE  TRUE FALSE FALSE

The pattern being searched is string with initially any alphanumeric word pattern that ends with .txt. I added one that ended .txtx and it returned FALSE.

4. \\d{2}/\\d{2}/\\d{4}

d <- c("Frank","February 22, 1965","3/4/18","03/04/2019","12/31/2018")
Four_4_vector <- grepl("\\d{2}/\\d{2}/\\d{4}",d)
Four_4_vector

## [1] FALSE FALSE FALSE  TRUE  TRUE

The pattern being searched is the date string with the pattern nn/nn/nnnn. I used long date format and short date format to demonstrate that the logical values returned are FALSE. Only examples following the proscribed pattern will return TRUE.

5. <(.+?)>.+?</\\1>

e <- c("Frank","<html>something</html>", "<docNumber>herewego</nothis>", "<docNumber></docNumber>","<docNumber> </docNumber>")
Four_5_vector <- grepl("<(.+?)>.+?</\\1>",e)
Four_5_vector

## [1] FALSE  TRUE FALSE FALSE  TRUE

The pattern being searched is any string that follows the XML-based format for tags: someword. someword must not be empty. For example, in the example above, would return FALSE, but will return TRUE because there is a space in between opening and closing tags.

Question IX (OPTIONAL)

As of 9/10/2018 - Have tried all possibilities with given time. No luck.

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others!

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

rawstring <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

enc2native(rawstring)

## [1] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

DATA607 - Assignment 3