For these problems, we need to load the stringr package.
library(stringr)First, we need to load in the raw data provided by the assignment:
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"Next, is to create an R code that will do the following:
The R-code and output can be seen below:
raw.names <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
final.names <- vector()
for (i in 1:length(raw.names)) {
name_vec <- str_trim(unlist(str_split(raw.names[i],",")))
if (length(name_vec) > 1) {
final.names[i] <- str_c(name_vec[2],name_vec[1],sep=" ")
}
else {
final.names[i] <- name_vec
}
}
final.names[1] "Moe Szyslak" "C. Montgomery Burns" "Rev. Timothy Lovejoy"
[4] "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"
The logical vector should display ‘TRUE’ for the third and sixth element since these are the only two people with titles. The simple way is to utilize the OR (“|”) function to return ‘TRUE’ for all names that have “DR.” or “Rev.”:
title.bool <- str_detect(final.names, "Dr.|Rev.")
title.bool[1] FALSE FALSE TRUE FALSE FALSE TRUE
Another way would be to specify that there is alpha characters (“[[:alpha:]]”) followed by a period (“[.]”). We need to utilize the quantifier function (“{2,}”) because “C.” would return true if we do not specify at least two characters before the period.
title.bool <- str_detect(final.names, "[[:alpha:]]{2,}[.]")
title.bool[1] FALSE FALSE TRUE FALSE FALSE TRUE
From looking at the vector, it appears the only person with two names is “C. Montgomery Burns”. Therefore, we need to use a similar function as before, but change the quantifier to 1 to only specify this entry. In addition we need to specify strictly upper case letters to not return Dr. or Rev.
second.bool <- str_detect(final.names, "[[:upper:]]{1}[.]")
second.bool[1] FALSE TRUE FALSE FALSE FALSE FALSE
The “[0-9]” denotes any integer, the “+” denotes that the preceding (“[0-9]”) can be repeated many times, and the “\\$” means the string must end in a “$”.
The following tests two strings and compares them to the regular expression:
test.ex <- "[0-9]+\\$"
string.vec <- c("913$","$913","hmm","9a13$1")
bool <- str_detect(string.vec,test.ex)
bool[1] TRUE FALSE FALSE TRUE
The only non-obvious ‘TRUE’ result was “9a13$1”. This returned ‘TRUE’ because the portion “13$” follows the pattern, even though the entire string does not follow this pattern:
str_extract("9a13$1",test.ex)[1] "13$"
This regular expression will extract four lower case letters from an expression. The “\\b” indicates that this must have a word edge, and the string cannot be longer than four characters.
test.ex <- "\\b[a-z]{1,4}\\b"
string.vec <- c("913$","$913","ryan","Ryan","gordon")
bool <- str_detect(string.vec,test.ex)
bool[1] FALSE FALSE TRUE FALSE FALSE
The “.*?" portion is pretty much saying that it will return anything, and the “\\.txt$” says that the string must end in “.txt”. Therefore, it will return ‘TRUE’ for anything following the format “[ANY COMBINATION].txt”:
test.ex <- ".*?\\.txt$"
string.vec <- c("file.txt","file.txtt","file.html",".file.txt")
bool <- str_detect(string.vec,test.ex)
bool[1] TRUE FALSE FALSE TRUE
This regular expression is looking for two digits followed by a forward slash, and then two digits followed by a forward slash, and then four digits. This appears to be used to describe a date:
test.ex <- "\\d{2}/\\d{2}/\\d{4}"
string.vec <- c("09/13/1990","09/13/90","September 13, 1990","9/13/90")
bool <- str_detect(string.vec,test.ex)
bool[1] TRUE FALSE FALSE FALSE
The “<(.+?)>” portion symbolizes that a string of length one or greater can be placed between the “<” and “>”. The “</\\1>” means that it must match the beginning portion preceeded by a forward slash. The middle portion (“.+?”) means that anything can be placed in here.
test.ex <- "<(.+?)>.+?</\\1>"
string.vec <- c("< >hi</ >","< >hi< >","<123r>ryan</123r>","<123r>ryan</123>")
bool <- str_detect(string.vec,test.ex)
bool[1] TRUE FALSE TRUE FALSE
Below is the secret message placed into a variable:
secret.message <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"The first thing I noticed was the that there were a lot fewer upper case letters than lower case letters. I decided to extract only the upper case letters using the “str_extract_all()” function:
str_extract_all(secret.message,"[[:upper:]]")[[1]]
[1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
[18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"
When looking at the capital letters, it appears to spell “CONGRATULATIONS YOU ARE A SUPER NERD”. However, I don’t know how the spaces are created. I looked through the original message and it appears that all of the periods represent a space. I tested this theory out below:
str_extract_all(secret.message,"[[:upper:].]")[[1]]
[1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "." "Y"
[18] "O" "U" "." "A" "R" "E" "." "A" "." "S" "U" "P" "E" "R" "N" "E" "R"
[35] "D"
The next step would be to replace all of the periods with spaces using the “str_replace_all()” function:
str_replace_all(unlist(str_extract_all(secret.message,"[[:upper:].]")),pattern="[.]",replacement=" ") [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" " " "Y"
[18] "O" "U" " " "A" "R" "E" " " "A" " " "S" "U" "P" "E" "R" "N" "E" "R"
[35] "D"
It is important to note that the “unlist()” function is necessary to create a vector that the “str_replace_all()” function can act on. Finally, we can use the “paste()” function to get the final answer:
paste(str_replace_all(unlist(str_extract_all(secret.message,"[[:upper:].]")),pattern="[.]",replacement=" "),collapse="")[1] "CONGRATULATIONS YOU ARE A SUPERNERD"