Assigment 3

Question 3

Copy the introductory example. The vector name stores the extracted names.

library(stringr)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name # This was taken from the textbook.

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

a)

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

I am assuming that the instructions demand that honorifics are also removed. (I.e. J.K. Rowling stays the same but Ms. J.K. Rowling drops the Ms.)

name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

name2 <- str_replace_all(name, "(.+)(, .+)$", "\\2 \\1") # Change order
name2

## [1] "Moe Szyslak"           ", C. Montgomery Burns" "Rev. Timothy Lovejoy" 
## [4] "Ned Flanders"          ", Homer Simpson"       "Dr. Julius Hibbert"

name3 <- str_replace_all(name2, ", ", "") # Remove commas
name3

## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

name4 <- str_replace_all(name3, "[A-Z][a-z]([a-z]?)\\.", "") # Remove honerifics.
name4

## [1] "Moe Szyslak"         "C. Montgomery Burns" " Timothy Lovejoy"   
## [4] "Ned Flanders"        "Homer Simpson"       " Julius Hibbert"

b)

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

Well I regret not reading this question before putting all that offer into removing them in the first place.

library(knitr)
df <- data.frame(name3)
df$title <- str_detect(string = name3, pattern = "\\w{2,3}\\.")
kable(df)

name3	title
Moe Szyslak	FALSE
C. Montgomery Burns	FALSE
Rev. Timothy Lovejoy	TRUE
Ned Flanders	FALSE
Homer Simpson	FALSE
Dr. Julius Hibbert	TRUE

c)

Construct a logical vector indicating whether a character has a second name.

df$secname <- str_detect(string = name3, pattern = "[A-Z]{1}\\.")
kable(df)

name3	title	secname
Moe Szyslak	FALSE	FALSE
C. Montgomery Burns	FALSE	TRUE
Rev. Timothy Lovejoy	TRUE	FALSE
Ned Flanders	FALSE	FALSE
Homer Simpson	FALSE	FALSE
Dr. Julius Hibbert	TRUE	FALSE

It should be noted that this method is not very robust. If the data was more complex an additional column would be made to count the number of spaces without a proceeding period (to ignore the honorifics) and then the new column and the one I just created would be added together and if there was one true, we would know that the character had a middle name.

Question 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

(a) `[0-9]+\\$`

test <- c("gjkaef123892389$fsafsdlkj", "asfdlk$afsdlk", "234$123", "1234", "$23sf")
test1 <- unlist(str_extract_all(test, pattern = "[0-9]+\\$" ))
test1

## [1] "123892389$" "234$"

From these examples we can see that the expression only brings a number and a $. And only in that specific order.

(b) `\\b[a-z]{1,4}\\b`

This will detect a word beginning with 1-4 characters and another word edge. In other words any word that is between 1 and four charters. Characters must be lowercase:

test <- c("abcdefg", "abe def hgi", "abdefkl asl 234", "1234", "WTF", "abc")
test1 <- str_extract_all(test, pattern = "\\b[a-z]{1,4}\\b" )
test1

## [[1]]
## character(0)
## 
## [[2]]
## [1] "abe" "def" "hgi"
## 
## [[3]]
## [1] "asl"
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] "abc"

(c) `.*?\\.txt$`

This will select anything that ends in .txt, even something that only contains .txt but not just txt. Also any spaces after txt will be ignored. (Useful for finding files.)

test <- c("abcdefg.txt", "abe.txt def hgi.txt", "abdefkl.txt asl 234", "1234.txt", "WTF.txt", ".txt")
test1 <- str_extract_all(test, pattern = ".*?\\.txt$" )
test1

## [[1]]
## [1] "abcdefg.txt"
## 
## [[2]]
## [1] "abe.txt def hgi.txt"
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "1234.txt"
## 
## [[5]]
## [1] "WTF.txt"
## 
## [[6]]
## [1] ".txt"

(d) `\\d{2}/\\d{2}/\\d{4}`

This looks at two two digit numbers and a four digit number divided by each other. This would be useful for dates. (I mean you could at least tell her you were could at programming. Also it would work for chronological dates but ambiguity would be involved because America and the rest of the world write dates in a different order.)

test <- c("23/42/2345", "abe def hgi", "12/13/234", "1/2/34", "WTF", "abc")
test1 <- str_extract_all(test, pattern = "\\d{2}/\\d{2}/\\d{4}")
test1

## [[1]]
## [1] "23/42/2345"
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)

(e) `<(.+?)>.+?</\\1>`

The query will look for “<” then code to make anything valid in the <>. then there can be anything and finally the code must be matched again but with o proceeding /. This is too find HTML tags (or a really weird coincidence.)

test <- c("<Probably not valid HTML> what </Probably not valid HTML>", "<12 not HTML> did not captialize the same </12 not html>", "12/13/234", "<12 not HTML> captialized the same </12 not HTML>", "WTF", "abc")
test1 <- str_extract_all(test, pattern = "<(.+?)>.+?</\\1>")
test1

## [[1]]
## [1] "<Probably not valid HTML> what </Probably not valid HTML>"
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "<12 not HTML> captialized the same </12 not HTML>"
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)

Question 9

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

code = "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8pf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

I had to look online, the code is all the capital letters:

plain_txt <- unlist(str_extract_all(code, "[[:upper:].]{1,}"))

plain_txt <-  str_replace_all(string =  plain_txt, pattern =  "\\.", replacement = " ")
paste(plain_txt, collapse = "")

## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"

Personally I prefer this second secret message.(Pretend I didn’t write this code to find the “hidden” message.)

prefer <- c("y", "o","u", "\\.", "a","r","e", "\\.","t","h","e",
            "\\.","c","o","o","l","e","s","t")
y=c()
x=0
for(i in prefer){
  x = unlist(str_locate(code, i))
  y = c(y,x[1])
}
y

##  [1]  38   4  32 129  68  37 121 129  13  36 121 129   1   4   4   2 121
## [18]  12  13

#coolfunct <- 
#apply(prefer, 2, unlist(str_locate(code, )))

here is the actual code :)

z=c()
zz=c()
for( i in y){
  z = unlist(str_sub(code, start = i, end = i))
  zz =c(zz,z)
}
zz

##  [1] "y" "o" "u" "." "a" "r" "e" "." "t" "h" "e" "." "c" "o" "o" "l" "e"
## [18] "s" "t"

better_message <-  str_replace_all(string =  zz, pattern =  "\\.", replacement = " ")
paste(better_message, collapse = "")

## [1] "you are the coolest"

Week 3 Assigment Data 607

Kai Lukowiak

2017-09-11

Assigment 3

Question 3

a)

b)

c)

Question 4

(a) `[0-9]+\\$`

(b) `\\b[a-z]{1,4}\\b`

(c) `.*?\\.txt$`

(d) `\\d{2}/\\d{2}/\\d{4}`

(e) `<(.+?)>.+?</\\1>`

Question 9

Week 3 Assigment Data 607

Kai Lukowiak

2017-09-11

Assigment 3

Question 3

a)

b)

c)

Question 4

(a) [0-9]+\\$

(b) \\b[a-z]{1,4}\\b

(c) .*?\\.txt$

(d) \\d{2}/\\d{2}/\\d{4}

(e) <(.+?)>.+?</\\1>

Question 9

(a) `[0-9]+\\$`

(b) `\\b[a-z]{1,4}\\b`

(c) `.*?\\.txt$`

(d) `\\d{2}/\\d{2}/\\d{4}`

(e) `<(.+?)>.+?</\\1>`