library(stringr)
To build our vector to conform to the standard, we first extract all the first names and place it onto a vector. We then do the same with the last names and place them into a separate vector. After we have these two vectors we do an element-wise concatenation of the two to get the desired vector complying with the standard (when concatenating we add a space between the first and last name of the characters)
raw.data<-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name<-unlist(str_extract_all(raw.data,"[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
firstName<-str_extract_all(unlist(str_extract_all(name,"[:alpha:]{1,25} |, [:print:]{1,25}")),"[A-Z](.+?)+[a-z]")
firstName
## [[1]]
## [1] "Moe"
##
## [[2]]
## [1] "C. Montgomery"
##
## [[3]]
## [1] "Timothy"
##
## [[4]]
## [1] "Ned"
##
## [[5]]
## [1] "Homer"
##
## [[6]]
## [1] "Julius"
lastName<-str_extract_all(unlist(str_extract_all(name,"[a-z] [:alpha:]{1,25}[a-z]$|[:print:]{1,25},")),"[A-z][a-z]+|[A-z][a-z]+,")
lastName
## [[1]]
## [1] "Szyslak"
##
## [[2]]
## [1] "Burns"
##
## [[3]]
## [1] "Lovejoy"
##
## [[4]]
## [1] "Flanders"
##
## [[5]]
## [1] "Simpson"
##
## [[6]]
## [1] "Hibbert"
paste0(firstName,' ',lastName)
## [1] "Moe Szyslak" "C. Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
We use the extract all function to find all characters with a title
title<-str_extract_all(name,"[:alpha:]{3}[.]|[:alpha:]{2}[.]")
title
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "Rev."
##
## [[4]]
## character(0)
##
## [[5]]
## character(0)
##
## [[6]]
## [1] "Dr."
Now that we know we have the characters with tittles, we can change the extract all function for detect to get a logical vector
logicalTittle<-str_detect(name,"[:alpha:]{3}[.]|[:alpha:]{2}[.]")
logicalTittle
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
We use the extract all function to find all characters with a second name. We look for both an initial, with and without a period and full second names. All these three cases are not necesary for this dtaset, since just looking for an inital with a period would have found Mr. Burns, but it was added for a more general solution.
secondName<-str_extract_all(name," [:alpha:]{1}[.]|:alpha:]{1}| [:alpha:] ")
secondName
## [[1]]
## character(0)
##
## [[2]]
## [1] " C."
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
Now that we have found the second names, we can replace the ectract all function for detect to generate the logical vector
logicalSecondName<-str_detect(name," [:alpha:]{1}[.]|:alpha:]{1}| [:alpha:] ")
logicalSecondName
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Returns expression with numbers between 0 and 9 up to the $ simbol [0-9]+ will find many subsequent numbers from 0 to 9 \$ will find the literal $ with the \ making $ a literal and not the end of an expression
Four matching and three no-matching expressions shown
text<-c('1234$','123 1234$','67789$566789','2345345 2345345$','2345345','$123','234234sdfsdf$')
str_extract_all(text,"[0-9]+\\$")
## [[1]]
## [1] "1234$"
##
## [[2]]
## [1] "1234$"
##
## [[3]]
## [1] "67789$"
##
## [[4]]
## [1] "2345345$"
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
##
## [[7]]
## character(0)
str_detect(text,"[0-9]+\\$")
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
We are searching for whole words, with characters from a to z. Expressions have to contain just one word, bo blank spaces, and need to have 1 to 4 characters
Four matching and two no-matching expressions shown
text<-c('good','test','good more than one word each is less than four dig','good@each@word&seen','wordtoolong','verybad morethan oneword eachis morethan fourdigits')
str_extract_all(text,"\\b[a-z]{1,4}\\b")
## [[1]]
## [1] "good"
##
## [[2]]
## [1] "test"
##
## [[3]]
## [1] "good" "more" "than" "one" "word" "each" "is" "less" "than" "four"
## [11] "dig"
##
## [[4]]
## [1] "good" "each" "word" "seen"
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
str_detect(text,"\\b[a-z]{1,4}\\b")
## [1] TRUE TRUE TRUE TRUE FALSE FALSE
Looks for expression ending in .txt (probably looking for text files in a text).
txt$ identifies expressions ending in .txt
\. is a literal that changes the meaning of . from a wildcard, to be a literal part of the .txt expression being looked for
.*? takes the wildcard . and returns many mtches for any character, not just one
Five matching and two no-matching expressions shown
text<-c('testfile.txt','testfile.txt','.txt','\\testfile.txt','something and a file file.txt','sdfdsftxt','txt.sdfg')
str_extract_all(text,".*?\\.txt$")
## [[1]]
## [1] "testfile.txt"
##
## [[2]]
## [1] "testfile.txt"
##
## [[3]]
## [1] ".txt"
##
## [[4]]
## [1] "\\testfile.txt"
##
## [[5]]
## [1] "something and a file file.txt"
##
## [[6]]
## character(0)
##
## [[7]]
## character(0)
str_detect(text,".*?\\.txt$")
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE
Looking for date matches, that is 2 digits then a / then 2 digits then another / then 4 digits. So looking for dates where days and months need to be expressed as two digits, and the year as 4 digits
Two matching and two no-matching expressions shown
text<-c('01/01/1971','12/12/2012','1/1/71','12/12/12')
str_extract_all(text,"\\d{2}/\\d{2}/\\d{4}")
## [[1]]
## [1] "01/01/1971"
##
## [[2]]
## [1] "12/12/2012"
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
str_detect(text,"\\d{2}/\\d{2}/\\d{4}")
## [1] TRUE TRUE FALSE FALSE
Here they are using backreferenceing to find expressions inside the <> than then repreat after a </
This can be used to find tags in an HTML documents. That is we find a tag start with < someTag >, the () references the tag and the .+? looks for any tag, not one in particular. Then we go thru wahatever the body of the tag might be with .+? after which we are looking for a tag close by looking at </ with finally \ \ 1 making reference to the actual body that was opened, that is a reference back to what was inside the starting <>
Four matching tags and two no-matching tags shown
text<-c('<tag>body</tag>','<anothertag>body</anothertag>','<tag>any body with anything in it</tag>','<tag>not the same tag being clossed</anothertag>','<tag>again not the same tag</endtag>','<tag>good did find the same tag</tagend>')
str_extract_all(text,"<(.+?)>.+?</\\1")
## [[1]]
## [1] "<tag>body</tag"
##
## [[2]]
## [1] "<anothertag>body</anothertag"
##
## [[3]]
## [1] "<tag>any body with anything in it</tag"
##
## [[4]]
## character(0)
##
## [[5]]
## character(0)
##
## [[6]]
## [1] "<tag>good did find the same tag</tag"
str_detect(text,"<(.+?)>.+?</\\1")
## [1] TRUE TRUE TRUE FALSE FALSE TRUE
code<-'clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hprfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03At5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPalotfb7wEm24k6t3sR9zqe5fy89n6N5t9kc4fE905gmc4Rgxo5nhDk!gr'
Using the hint I first looked at all the alphabet characters, then low cap words, then words starting with caps. No luck on any of these.
str_extract_all(code,"[a-z]")
## [[1]]
## [1] "c" "l" "c" "o" "p" "o" "w" "z" "m" "s" "t" "c" "d" "w" "n" "k" "i"
## [18] "g" "v" "d" "i" "c" "p" "u" "g" "g" "v" "h" "r" "y" "n" "j" "u" "w"
## [35] "c" "z" "i" "h" "p" "r" "f" "p" "x" "s" "j" "d" "w" "p" "n" "a" "n"
## [52] "w" "o" "w" "i" "s" "d" "i" "j" "j" "k" "p" "f" "t" "d" "r" "c" "o"
## [69] "c" "b" "t" "y" "c" "z" "j" "a" "t" "a" "o" "o" "t" "j" "t" "j" "n"
## [86] "e" "c" "f" "e" "k" "r" "w" "w" "w" "o" "j" "i" "g" "d" "v" "r" "f"
## [103] "r" "b" "z" "b" "k" "n" "b" "h" "z" "g" "v" "i" "z" "c" "r" "o" "p"
## [120] "w" "g" "n" "b" "q" "o" "f" "a" "l" "o" "t" "f" "b" "w" "m" "k" "t"
## [137] "s" "z" "q" "e" "f" "y" "n" "t" "k" "c" "f" "g" "m" "c" "g" "x" "o"
## [154] "n" "h" "k" "g" "r"
str_extract_all(code,"[a-z]+")
## [[1]]
## [1] "clcop" "ow" "zmstc" "d" "wnkig" "vdicp"
## [7] "uggvhryn" "juwczi" "hprfp" "xs" "j" "dwpn"
## [13] "anwo" "wisdij" "j" "kpf" "t" "dr"
## [19] "coc" "bt" "yczjat" "aootj" "t" "j"
## [25] "ne" "c" "fek" "r" "w" "wwojig"
## [31] "d" "vrf" "rbz" "bk" "nbhzgv" "i"
## [37] "z" "crop" "w" "gnb" "qo" "f"
## [43] "alotfb" "w" "m" "k" "t" "s"
## [49] "zqe" "fy" "n" "t" "kc" "f"
## [55] "gmc" "gxo" "nh" "k" "gr"
str_extract_all(code,"[A-Za-z]+")
## [[1]]
## [1] "clcopCow" "zmstc" "d"
## [4] "wnkig" "OvdicpNuggvhryn" "Gjuwczi"
## [7] "hprfpRxs" "Aj" "dwpn"
## [10] "TanwoUwisdij" "Lj" "kpf"
## [13] "At" "Idr" "coc"
## [16] "bt" "yczjatOaootj" "t"
## [19] "Nj" "ne" "c"
## [22] "Sfek" "r" "w"
## [25] "YwwojigOd" "vrfUrbz" "bkAnbhzgv"
## [28] "R" "i" "zEcrop"
## [31] "wAgnb" "SqoU" "fPalotfb"
## [34] "wEm" "k" "t"
## [37] "sR" "zqe" "fy"
## [40] "n" "N" "t"
## [43] "kc" "fE" "gmc"
## [46] "Rgxo" "nhDk" "gr"
Finally looked at only the caps letters, and found the message in the code
codeMessage<-str_extract_all(code,"[A-Z]+")
codeMessage
## [[1]]
## [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "I" "O" "N" "S" "Y" "O" "U"
## [18] "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"
codeMessageString<-paste(unlist(codeMessage),collapse='')
codeMessageString
## [1] "CONGRATULAIONSYOUAREASUPERNERD"
codeMessageString<-str_replace_all(codeMessageString,"SY","S Y")
codeMessageString<-str_replace_all(codeMessageString,"UA","U A")
codeMessageString<-str_replace_all(codeMessageString,"EA","E A")
codeMessageString<-str_replace_all(codeMessageString,"AS","A S")
codeMessageString<-str_replace_all(codeMessageString,"RN","R N")
codeMessageString
## [1] "CONGRATULAIONS YOU ARE A SUPER NERD"