607 Week 3 Assignment

library(stringr)

3. Copy the introductory example. The vector name stores the extracted names.

1. Use the tools of this chaptor to rearrange the vector so that all elements conform to the standard first_name last_name

To build our vector to conform to the standard, we first extract all the first names and place it onto a vector. We then do the same with the last names and place them into a separate vector. After we have these two vectors we do an element-wise concatenation of the two to get the desired vector complying with the standard (when concatenating we add a space between the first and last name of the characters)

raw.data<-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

name<-unlist(str_extract_all(raw.data,"[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

firstName<-str_extract_all(unlist(str_extract_all(name,"[:alpha:]{1,25} |, [:print:]{1,25}")),"[A-Z](.+?)+[a-z]")
firstName

## [[1]]
## [1] "Moe"
## 
## [[2]]
## [1] "C. Montgomery"
## 
## [[3]]
## [1] "Timothy"
## 
## [[4]]
## [1] "Ned"
## 
## [[5]]
## [1] "Homer"
## 
## [[6]]
## [1] "Julius"

lastName<-str_extract_all(unlist(str_extract_all(name,"[a-z] [:alpha:]{1,25}[a-z]$|[:print:]{1,25},")),"[A-z][a-z]+|[A-z][a-z]+,")
lastName

## [[1]]
## [1] "Szyslak"
## 
## [[2]]
## [1] "Burns"
## 
## [[3]]
## [1] "Lovejoy"
## 
## [[4]]
## [1] "Flanders"
## 
## [[5]]
## [1] "Simpson"
## 
## [[6]]
## [1] "Hibbert"

paste0(firstName,' ',lastName)

## [1] "Moe Szyslak"         "C. Montgomery Burns" "Timothy Lovejoy"    
## [4] "Ned Flanders"        "Homer Simpson"       "Julius Hibbert"

2. Construct a logical vector indicating wether a character has a title (i.e, Rev. and Dr.)

We use the extract all function to find all characters with a title

title<-str_extract_all(name,"[:alpha:]{3}[.]|[:alpha:]{2}[.]")
title

## [[1]]
## character(0)
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "Rev."
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] "Dr."

Now that we know we have the characters with tittles, we can change the extract all function for detect to get a logical vector

logicalTittle<-str_detect(name,"[:alpha:]{3}[.]|[:alpha:]{2}[.]")
logicalTittle

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

3. Construct a logical vector indicating wether a character has a second name.

We use the extract all function to find all characters with a second name. We look for both an initial, with and without a period and full second names. All these three cases are not necesary for this dtaset, since just looking for an inital with a period would have found Mr. Burns, but it was added for a more general solution.

secondName<-str_extract_all(name," [:alpha:]{1}[.]|:alpha:]{1}| [:alpha:] ")
secondName

## [[1]]
## character(0)
## 
## [[2]]
## [1] " C."
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)

Now that we have found the second names, we can replace the ectract all function for detect to generate the logical vector

logicalSecondName<-str_detect(name," [:alpha:]{1}[.]|:alpha:]{1}| [:alpha:] ")
logicalSecondName

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

4. Describe the types of strings that conform to the following regular expressions and censtruct an example that is matched by the regular expression.

1. [0-9]+\$

Returns expression with numbers between 0 and 9 up to the $ simbol [0-9]+ will find many subsequent numbers from 0 to 9 \$ will find the literal $ with the \ making $ a literal and not the end of an expression

Four matching and three no-matching expressions shown

text<-c('1234$','123 1234$','67789$566789','2345345 2345345$','2345345','$123','234234sdfsdf$')
str_extract_all(text,"[0-9]+\\$")

## [[1]]
## [1] "1234$"
## 
## [[2]]
## [1] "1234$"
## 
## [[3]]
## [1] "67789$"
## 
## [[4]]
## [1] "2345345$"
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)

str_detect(text,"[0-9]+\\$")

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

2. \b[a-z]{1,4}\b

We are searching for whole words, with characters from a to z. Expressions have to contain just one word, bo blank spaces, and need to have 1 to 4 characters

Four matching and two no-matching expressions shown

text<-c('good','test','good more than one word each is less than four dig','good@each@word&seen','wordtoolong','verybad morethan oneword eachis morethan fourdigits')
str_extract_all(text,"\\b[a-z]{1,4}\\b")

## [[1]]
## [1] "good"
## 
## [[2]]
## [1] "test"
## 
## [[3]]
##  [1] "good" "more" "than" "one"  "word" "each" "is"   "less" "than" "four"
## [11] "dig" 
## 
## [[4]]
## [1] "good" "each" "word" "seen"
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)

str_detect(text,"\\b[a-z]{1,4}\\b")

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

3. .*?\.txt$

Looks for expression ending in .txt (probably looking for text files in a text).
txt$ identifies expressions ending in .txt
\. is a literal that changes the meaning of . from a wildcard, to be a literal part of the .txt expression being looked for
.*? takes the wildcard . and returns many mtches for any character, not just one

Five matching and two no-matching expressions shown

text<-c('testfile.txt','testfile.txt','.txt','\\testfile.txt','something and a file file.txt','sdfdsftxt','txt.sdfg')
str_extract_all(text,".*?\\.txt$")

## [[1]]
## [1] "testfile.txt"
## 
## [[2]]
## [1] "testfile.txt"
## 
## [[3]]
## [1] ".txt"
## 
## [[4]]
## [1] "\\testfile.txt"
## 
## [[5]]
## [1] "something and a file file.txt"
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)

str_detect(text,".*?\\.txt$")

## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

4. \d{2}/\d{2}/\d{4}

Looking for date matches, that is 2 digits then a / then 2 digits then another / then 4 digits. So looking for dates where days and months need to be expressed as two digits, and the year as 4 digits

Two matching and two no-matching expressions shown

text<-c('01/01/1971','12/12/2012','1/1/71','12/12/12')
str_extract_all(text,"\\d{2}/\\d{2}/\\d{4}")

## [[1]]
## [1] "01/01/1971"
## 
## [[2]]
## [1] "12/12/2012"
## 
## [[3]]
## character(0)
## 
## [[4]]
## character(0)

str_detect(text,"\\d{2}/\\d{2}/\\d{4}")

## [1]  TRUE  TRUE FALSE FALSE

5. <(.+?)>.+?</\1>

Here they are using backreferenceing to find expressions inside the <> than then repreat after a </
This can be used to find tags in an HTML documents. That is we find a tag start with < someTag >, the () references the tag and the .+? looks for any tag, not one in particular. Then we go thru wahatever the body of the tag might be with .+? after which we are looking for a tag close by looking at </ with finally \ \ 1 making reference to the actual body that was opened, that is a reference back to what was inside the starting <>

Four matching tags and two no-matching tags shown

text<-c('<tag>body</tag>','<anothertag>body</anothertag>','<tag>any body with anything in it</tag>','<tag>not the same tag being clossed</anothertag>','<tag>again not the same tag</endtag>','<tag>good did find the same tag</tagend>')
str_extract_all(text,"<(.+?)>.+?</\\1")

## [[1]]
## [1] "<tag>body</tag"
## 
## [[2]]
## [1] "<anothertag>body</anothertag"
## 
## [[3]]
## [1] "<tag>any body with anything in it</tag"
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] "<tag>good did find the same tag</tag"

str_detect(text,"<(.+?)>.+?</\\1")

## [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE

9. The following code hides a secret message. Crack it with R and regular expressions.

code<-'clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hprfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03At5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPalotfb7wEm24k6t3sR9zqe5fy89n6N5t9kc4fE905gmc4Rgxo5nhDk!gr'

Using the hint I first looked at all the alphabet characters, then low cap words, then words starting with caps. No luck on any of these.

str_extract_all(code,"[a-z]")

## [[1]]
##   [1] "c" "l" "c" "o" "p" "o" "w" "z" "m" "s" "t" "c" "d" "w" "n" "k" "i"
##  [18] "g" "v" "d" "i" "c" "p" "u" "g" "g" "v" "h" "r" "y" "n" "j" "u" "w"
##  [35] "c" "z" "i" "h" "p" "r" "f" "p" "x" "s" "j" "d" "w" "p" "n" "a" "n"
##  [52] "w" "o" "w" "i" "s" "d" "i" "j" "j" "k" "p" "f" "t" "d" "r" "c" "o"
##  [69] "c" "b" "t" "y" "c" "z" "j" "a" "t" "a" "o" "o" "t" "j" "t" "j" "n"
##  [86] "e" "c" "f" "e" "k" "r" "w" "w" "w" "o" "j" "i" "g" "d" "v" "r" "f"
## [103] "r" "b" "z" "b" "k" "n" "b" "h" "z" "g" "v" "i" "z" "c" "r" "o" "p"
## [120] "w" "g" "n" "b" "q" "o" "f" "a" "l" "o" "t" "f" "b" "w" "m" "k" "t"
## [137] "s" "z" "q" "e" "f" "y" "n" "t" "k" "c" "f" "g" "m" "c" "g" "x" "o"
## [154] "n" "h" "k" "g" "r"

str_extract_all(code,"[a-z]+")

## [[1]]
##  [1] "clcop"    "ow"       "zmstc"    "d"        "wnkig"    "vdicp"   
##  [7] "uggvhryn" "juwczi"   "hprfp"    "xs"       "j"        "dwpn"    
## [13] "anwo"     "wisdij"   "j"        "kpf"      "t"        "dr"      
## [19] "coc"      "bt"       "yczjat"   "aootj"    "t"        "j"       
## [25] "ne"       "c"        "fek"      "r"        "w"        "wwojig"  
## [31] "d"        "vrf"      "rbz"      "bk"       "nbhzgv"   "i"       
## [37] "z"        "crop"     "w"        "gnb"      "qo"       "f"       
## [43] "alotfb"   "w"        "m"        "k"        "t"        "s"       
## [49] "zqe"      "fy"       "n"        "t"        "kc"       "f"       
## [55] "gmc"      "gxo"      "nh"       "k"        "gr"

str_extract_all(code,"[A-Za-z]+")

## [[1]]
##  [1] "clcopCow"        "zmstc"           "d"              
##  [4] "wnkig"           "OvdicpNuggvhryn" "Gjuwczi"        
##  [7] "hprfpRxs"        "Aj"              "dwpn"           
## [10] "TanwoUwisdij"    "Lj"              "kpf"            
## [13] "At"              "Idr"             "coc"            
## [16] "bt"              "yczjatOaootj"    "t"              
## [19] "Nj"              "ne"              "c"              
## [22] "Sfek"            "r"               "w"              
## [25] "YwwojigOd"       "vrfUrbz"         "bkAnbhzgv"      
## [28] "R"               "i"               "zEcrop"         
## [31] "wAgnb"           "SqoU"            "fPalotfb"       
## [34] "wEm"             "k"               "t"              
## [37] "sR"              "zqe"             "fy"             
## [40] "n"               "N"               "t"              
## [43] "kc"              "fE"              "gmc"            
## [46] "Rgxo"            "nhDk"            "gr"

Finally looked at only the caps letters, and found the message in the code

codeMessage<-str_extract_all(code,"[A-Z]+")
codeMessage

## [[1]]
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "I" "O" "N" "S" "Y" "O" "U"
## [18] "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"

codeMessageString<-paste(unlist(codeMessage),collapse='')
codeMessageString

## [1] "CONGRATULAIONSYOUAREASUPERNERD"

codeMessageString<-str_replace_all(codeMessageString,"SY","S Y")
codeMessageString<-str_replace_all(codeMessageString,"UA","U A")
codeMessageString<-str_replace_all(codeMessageString,"EA","E A")
codeMessageString<-str_replace_all(codeMessageString,"AS","A S")
codeMessageString<-str_replace_all(codeMessageString,"RN","R N")
codeMessageString

## [1] "CONGRATULAIONS YOU ARE A SUPER NERD"