DATA607 - HW 3

Homework Chapter 8

Homework 3, manipulate the simpsons data.

library(stringr)
snames <- c("Moe Szyslak","Burns, C. Montgomery","Rev. Timothy Lovejoy","Ned Flanders","Simpson, Homer","Dr. Julius Hibbert")

## 3.1 - Generate names First Last
# split vector using strsplit and then apply reversal 
strsplit(snames, split=", ")

## [[1]]
## [1] "Moe Szyslak"
## 
## [[2]]
## [1] "Burns"         "C. Montgomery"
## 
## [[3]]
## [1] "Rev. Timothy Lovejoy"
## 
## [[4]]
## [1] "Ned Flanders"
## 
## [[5]]
## [1] "Simpson" "Homer"  
## 
## [[6]]
## [1] "Dr. Julius Hibbert"

#apply over vector reversal and past where split created space
first_last <- sapply(strsplit(snames, split=", "),function(y) {paste(rev(y),collapse=" ")})
print(first_last)

## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

## 3.2
#used str_detect to see if a test is true or false
#specific test
title<-cbind(snames, unlist(str_detect(snames, "Dr.|Rev.")))
print(title)

##      snames                        
## [1,] "Moe Szyslak"          "FALSE"
## [2,] "Burns, C. Montgomery" "FALSE"
## [3,] "Rev. Timothy Lovejoy" "TRUE" 
## [4,] "Ned Flanders"         "FALSE"
## [5,] "Simpson, Homer"       "FALSE"
## [6,] "Dr. Julius Hibbert"   "TRUE"

#unknown test for more generic, requries all to follow convention Alpha Letters + period.
second<-cbind(snames, unlist(str_detect(snames, "[:alpha:]{2,}[.]")))
print(second)

##      snames                        
## [1,] "Moe Szyslak"          "FALSE"
## [2,] "Burns, C. Montgomery" "FALSE"
## [3,] "Rev. Timothy Lovejoy" "TRUE" 
## [4,] "Ned Flanders"         "FALSE"
## [5,] "Simpson, Homer"       "FALSE"
## [6,] "Dr. Julius Hibbert"   "TRUE"

## 3.3
#used str_detect to see if a test is true or false
#specific test
cbind(snames, str_detect(snames," [A-Z]\\."))

##      snames                        
## [1,] "Moe Szyslak"          "FALSE"
## [2,] "Burns, C. Montgomery" "TRUE" 
## [3,] "Rev. Timothy Lovejoy" "FALSE"
## [4,] "Ned Flanders"         "FALSE"
## [5,] "Simpson, Homer"       "FALSE"
## [6,] "Dr. Julius Hibbert"   "FALSE"

Homework 4, describe the expressions and give examples.

[0-9]+\$ [0-9] describes all digits 0 through 9 “+” says that any of the previous can be matches as many times as possibe \$ indicates the dollar symbol at the end

An example of this would be 10000$ but not $100

\b[a-z]{1,4}\b the structure \b indicates a word wrapped with letters at the start and edge [a-z] indicates lowercase alpha letters that can be of length 1 through 4

example data, car, at, a. would not allow I, DOG or others

.*?\.txt$ “.” indicates any item, that is a alpha,digit, or characters “*" means that the preceding can be matched zero plus “?” means it is optional but must be matched at most once \.txt$ indicates that the .txt must be at the end of the string. This is different from \$ which is used to identify a special symbol like ! @ etc.

example would be cesar.txt inclusive of .txt itself

\d{2}/\d{2}/\d{4} \d indicates that digits are expected like [:digits:] {n} indicates the amount expected / is an actual forward slash symbol

so this is expecting a date type string mm/dd/yyyy or dd/mm/yyyy (imperial or metric dates)

<(.+?)>.+?</\1> \1 indicates copy of the prior exact symbol () is like foil match and says items insides must be matched “.+?” indictes that it can be any characters of any length but must occur only once as a whole in this case

“ cesar " would work but “ cesar

" would not

##
samples <- cbind(c("a","a","b","b","c","c","d","d","e","e"),c("10000$","$100","car","DOG","cesar.txt",
          "-.txt","02/15/2017","2/5/2017","<html>cesar</html>","<html>cesar</HTML>"))
df<-cbind(samples,str_detect(samples[,2], "[0-9]+\\$"),str_detect(samples[,2], "\\b[a-z]{1,4}\\b"),
      str_detect(samples[,2], ".*?\\.txt$"),str_detect(samples[,2], "\\d{2}/\\d{2}/\\d{4}"),str_detect(samples[,2], "<(.+?)>.+?</\\1>"))
colnames(df)<-c("3.4","Sample String","a","b","c","d","e")
print(df)

##       3.4 Sample String        a       b       c       d       e      
##  [1,] "a" "10000$"             "TRUE"  "FALSE" "FALSE" "FALSE" "FALSE"
##  [2,] "a" "$100"               "FALSE" "FALSE" "FALSE" "FALSE" "FALSE"
##  [3,] "b" "car"                "FALSE" "TRUE"  "FALSE" "FALSE" "FALSE"
##  [4,] "b" "DOG"                "FALSE" "FALSE" "FALSE" "FALSE" "FALSE"
##  [5,] "c" "cesar.txt"          "FALSE" "TRUE"  "TRUE"  "FALSE" "FALSE"
##  [6,] "c" "-.txt"              "FALSE" "TRUE"  "TRUE"  "FALSE" "FALSE"
##  [7,] "d" "02/15/2017"         "FALSE" "FALSE" "FALSE" "TRUE"  "FALSE"
##  [8,] "d" "2/5/2017"           "FALSE" "FALSE" "FALSE" "FALSE" "FALSE"
##  [9,] "e" "<html>cesar</html>" "FALSE" "TRUE"  "FALSE" "FALSE" "TRUE" 
## [10,] "e" "<html>cesar</HTML>" "FALSE" "TRUE"  "FALSE" "FALSE" "FALSE"

please note that the test for c returns TRUE but the sample string to test c also rings true if you use the test from b. This is becasue b says that the string just has to start or end with an alpha character. What I am unsure of is why cesar.txt caused a trust if it is clear it is more than 4 letters.

Homework 9, manipulate the code.

This is asking to crack the code, so I will use str_extract_all

code <- c("clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr")

#tried alpha and too many letters
str_extract_all(code,"[:alpha:]")

## [[1]]
##   [1] "c" "l" "c" "o" "p" "C" "o" "w" "z" "m" "s" "t" "c" "d" "w" "n" "k"
##  [18] "i" "g" "O" "v" "d" "i" "c" "p" "N" "u" "g" "g" "v" "h" "r" "y" "n"
##  [35] "G" "j" "u" "w" "c" "z" "i" "h" "q" "r" "f" "p" "R" "x" "s" "A" "j"
##  [52] "d" "w" "p" "n" "T" "a" "n" "w" "o" "U" "w" "i" "s" "d" "i" "j" "L"
##  [69] "j" "k" "p" "f" "A" "T" "I" "d" "r" "c" "o" "c" "b" "t" "y" "c" "z"
##  [86] "j" "a" "t" "O" "a" "o" "o" "t" "j" "t" "N" "j" "n" "e" "c" "S" "f"
## [103] "e" "k" "r" "w" "Y" "w" "w" "o" "j" "i" "g" "O" "d" "v" "r" "f" "U"
## [120] "r" "b" "z" "b" "k" "A" "n" "b" "h" "z" "g" "v" "R" "i" "z" "E" "c"
## [137] "r" "o" "p" "w" "A" "g" "n" "b" "S" "q" "o" "U" "f" "P" "a" "o" "t"
## [154] "f" "b" "w" "E" "m" "k" "t" "s" "R" "z" "q" "e" "f" "y" "n" "N" "d"
## [171] "t" "k" "c" "f" "E" "g" "m" "c" "R" "g" "x" "o" "n" "h" "D" "k" "g"
## [188] "r"

#tried lower and garbage 
str_extract_all(code,"[a-z]")

## [[1]]
##   [1] "c" "l" "c" "o" "p" "o" "w" "z" "m" "s" "t" "c" "d" "w" "n" "k" "i"
##  [18] "g" "v" "d" "i" "c" "p" "u" "g" "g" "v" "h" "r" "y" "n" "j" "u" "w"
##  [35] "c" "z" "i" "h" "q" "r" "f" "p" "x" "s" "j" "d" "w" "p" "n" "a" "n"
##  [52] "w" "o" "w" "i" "s" "d" "i" "j" "j" "k" "p" "f" "d" "r" "c" "o" "c"
##  [69] "b" "t" "y" "c" "z" "j" "a" "t" "a" "o" "o" "t" "j" "t" "j" "n" "e"
##  [86] "c" "f" "e" "k" "r" "w" "w" "w" "o" "j" "i" "g" "d" "v" "r" "f" "r"
## [103] "b" "z" "b" "k" "n" "b" "h" "z" "g" "v" "i" "z" "c" "r" "o" "p" "w"
## [120] "g" "n" "b" "q" "o" "f" "a" "o" "t" "f" "b" "w" "m" "k" "t" "s" "z"
## [137] "q" "e" "f" "y" "n" "d" "t" "k" "c" "f" "g" "m" "c" "g" "x" "o" "n"
## [154] "h" "k" "g" "r"

#tried all capital letters and the message shows
str_extract_all(code,"[A-Z]")

## [[1]]
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"

#reading back I noticed that the periods parce the text even more.
str_extract_all(code,"[A-Z]|\\.")

## [[1]]
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "." "Y"
## [18] "O" "U" "." "A" "R" "E" "." "A" "." "S" "U" "P" "E" "R" "N" "E" "R"
## [35] "D"

I am glad I am congratulated for being a super nerd “CONGRATULATIONS YOU ARE A SUPERNERD” #NERD

DATA607 - HW 3

Cesar L. Espitia

February 16, 2017

Homework Chapter 8

Homework 3, manipulate the simpsons data.

Homework 4, describe the expressions and give examples.

Homework 9, manipulate the code.