References:
http://www.r-datacollection.com/
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
https://rstudio-pubs-static.s3.amazonaws.com/74603_76cd14d5983f47408fdf0b323550b846.html
https://stackoverflow.com/questions/33826650/last-name-first-name-to-first-name-last-name
Problem 3: Copy the introductory example. The vector name stores the extracted names. Solution : (a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
library(stringr)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
extractedNames <- unlist(str_extract_all(raw.data, "[[A-z]., ]{2,}"))
#For debugging: extractedNames
#Now, take the extracted names and convert the names to first and last names, but only to the ones where a comma is present.
#(\\w+) matches the names, ,\\s matches ", "(comma and space), \\2 \\1 returns the names in opposite order.
reverseNames <- unlist(sub("(\\w+),\\s(\\w+)","\\2 \\1", extractedNames))
#reverseNames
#Now, take care of anomalies as a result of the name reversal. NOTE: If there was more than > 1 anomaly put this into a loop.
position <- which(str_detect(reverseNames, "[:upper:]\\s[[:alpha:]]{1}.") == TRUE)[1]
if(position > 0){
reverseNames[position] <- sub("(\\w+)\\s(\\w+).\\s(\\w+)","\\3 \\1. \\2", reverseNames[position])
}
reverseNames
## [1] "Moe Szyslak" "Montgomery C. Burns" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"
titles <- str_detect(extractedNames, pattern="Rev|Dr[.]")
titles
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
middle <- str_detect(extractedNames, "[[:alpha:]]{1}\\, [:upper:]. ")
middle
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Problem 4: Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression. Solution: (a) [0-9]+$
[0-9] ==> Match any character in set; matches a character in the range of “0” to “9”. Case sensitive
‘+’ ==> Quantfier. Match 1 or more of the preceding token
$ ==> Matches the dollar sign at the end of a number set.
str <- "sadjfalskdfj$9348349$0934830adskjkallkjafklj"
str_extract(str,"[0-9]+\\$")
## [1] "9348349$"
==> Matches a word boundary position between a word character and non-word character or position (start / end of string).
[a-z] ==> Matches a character having a character code between the two specified characters inclusive.
{1,4} ==> Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {3,} will match 3 or more.
str <- "ajay $ajay6+elephant dog5959"
str_extract(str,"\\b[a-z]{1,4}\\b")
## [1] "ajay"
. ===> Matches any character except linebreaks. Equivalent to [^\n\r].
? ===> Makes the preceding quantifier lazy, causing it to match as few characters as possible. By default, quantifiers are greedy, and will match as many characters as possible.
.txt===> Matches anything ends in ‘.txt’ characters.
str <- "394839sadksdafkj948490#$#$##^&^&^& file.txt"
unlist(str_extract_all(str, ".*?\\.txt$"))
## [1] "394839sadksdafkj948490#$#$##^&^&^& file.txt"
{2} ===> Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {3,} will match 3 or more.
/ ===> Matches a ‘/’ character.
str <- "afkljasdfkljsd09/04/2020jasdfkljasdf"
unlist(str_extract_all(str, "\\d{2}/\\d{2}/\\d{4}"))
## [1] "09/04/2020"
(.+?) ===> Groups multiple tokens together and creates a capture group for extracting a substring or using a backreference.
.*? ===> Discussed in previous example.
/ ===> Matches the ‘/’ character.
/ ===> Matches the results of a capture group. For example matches the results of the first capture group & matches the third.
===> Matches the '>' character.
str_extract("afkljasd<tag>fkljsd09/04/2020ja</tag>sdfkljasdf","<(.+?)>.+?</\\1>")
## [1] "<tag>fkljsd09/04/2020ja</tag>"
Problem 9. The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com. Solution:
Observation: Investigate characters that are capitals.
encodedmessage <- c("clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr")
message <- str_extract_all(encodedmessage,"[[:upper:]]")
message
## [[1]]
## [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"