Assignment 2

Scott Ogden

Raw Data:

raw.data <- raw.data<-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

Problem 3:

Copy the introductory example. The vector “name”" stores the extracted names.

Name

library("stringr")
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name

#Note: Used documentation and examples from http://www.endmemo.com/program/R/sub.php as a guide for replacement

#Remove titles
name_notitle<-sub("[A-z]{2,3}\\. ","",name)

#Swap names seperated by commas
name_swap <- sub("(\\w+),\\s(\\w+)","\\2 \\1", name_notitle)
#Now swap names seperated by period
name_swap2<-sub("(\\w+)\\s(\\w+).\\s(\\w+)","\\1. \\3 \\2", name_swap)

name_swap2

## [1] "Moe Szyslak"         "C. Montgomery Burns" "Timothy Lovejoy"    
## [4] "Ned Flanders"        "Homer Simpson"       "Julius Hibbert"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.)

#Use the remove title logic and "Detect String" Function
title<-str_detect(name,"[A-z]{2,3}\\. ")
#Join
data.frame(name_swap2,title)

##            name_swap2 title
## 1         Moe Szyslak FALSE
## 2 C. Montgomery Burns FALSE
## 3     Timothy Lovejoy  TRUE
## 4        Ned Flanders FALSE
## 5       Homer Simpson FALSE
## 6      Julius Hibbert  TRUE

Construct a logical vector indicating whether a character has a second name

count <- str_count( name_swap2, "\\S+" )
data.frame(name_swap2,count>2)

##            name_swap2 count...2
## 1         Moe Szyslak     FALSE
## 2 C. Montgomery Burns      TRUE
## 3     Timothy Lovejoy     FALSE
## 4        Ned Flanders     FALSE
## 5       Homer Simpson     FALSE
## 6      Julius Hibbert     FALSE

Problem 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$

This will find numeric strings that have a dollar sign attached to the end.

a <- c("1234asd","12345$","$123as","sadsf$")
str_detect(a, "[0-9]+\\$")

## [1] FALSE  TRUE FALSE FALSE

\b[a-z]{1,4}\b

This will find the strings that have from exactly four lower case letters in them

b <- c("abcde","123aabc22$","Abcd","abcd")
str_detect(b, "\\b[a-z]{1,4}\\b")

## [1] FALSE FALSE FALSE  TRUE

.*?\.txt$

This will find strings that end in .txt. Because of the ? the preceeding txt is optional so it will find anything ending .txt.

c <- c("1234asd.txt",".txt","sa.txtdk$$","$")
str_detect(c, ".*?\\.txt$")

## [1]  TRUE  TRUE FALSE FALSE

\d{2}/\d{2}/\d{4}

This will find strings that are formatted like numbers seperated by “/”. Specifically, strings that have at least two digits followed by / then exactly two digits then / then exactly 4 digits.

d <- c("aa/bb/cccc","09/18/2016","9/18/2016","9999999/99/9999")
str_detect(d, "\\d{2}/\\d{2}/\\d{4}")

## [1] FALSE  TRUE FALSE  TRUE

<(.+?)>.+?</\1>

This will find strings that are enclosed by the following: ANYTHING ELSE . Basically carrots containing a some text contain some other string, much like html tags, with the appropriate “/” in the closing tag.

e <- c("<b> Hi </b>","asdasfa","<i>Italic</i>","<i>Failed<i>")
str_detect(e, "<(.+?)>.+?</\\1>")

## [1]  TRUE FALSE  TRUE FALSE

Problem9

The following code hides a secret message. Crack it with R and regular expressions.

code<-"clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

# First find all capital letters
caps<-str_extract_all(code, "[[:upper:].]{1,}")
#unlist
u<-unlist(caps)
#concatenate all
pst<-paste(u, collapse ='')
#replace periods
result<-str_replace_all(pst,"[.]"," ")
result

## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"