First, load all required libraries
library(stringr)
Problem #3
Copy the introductory example. The vector name stores the extracted names.
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
q1Name <- str_replace(name,"[A-z]{2,}\\. ","")
q1Name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Julius Hibbert"
Next, remove all second names, the pattern is a single character followed by a ‘.’, which is an initial. Exception: an initial can also represent a first name or last name. After titles are removed from a name, the name should have 3 components (first name, second name and last name) if it has a second name. Only remove the initial from names with more than 2 components
q1Name <- ifelse(lapply(str_split(q1Name," "), length) > 2,str_replace(q1Name, " [A-z]{1}\\. "," "),q1Name)
q1Name
## [1] "Moe Szyslak" "Burns, Montgomery" "Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Julius Hibbert"
Finally, we rearrange the names with the standard last_name, first_name to the standard frist_name last_name
q1Name <- str_replace(q1Name, "(\\w+),\\s(\\w+)","\\2 \\1")
q1Name
## [1] "Moe Szyslak" "Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
q2Name <- str_detect(name, "[A-z]{2,}\\. ")
q2Name
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
q3Name <- str_replace(name,"[A-z]{2,}\\. ","")
q3Name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Julius Hibbert"
Next, find all names with more than 2 components:
q3Name <- lapply(str_split(q3Name," "), length) > 2
q3Name
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Problem #3
Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
pat = "[0-9]+\\$"
exampleStrs = c("23 46$","456s df 4r5$","gte arf0 01$&#","$46 %46$","0$")
str_extract(exampleStrs, pat)
## [1] "46$" "5$" "01$" "46$" "0$"
pat = "\\b[a-z]{1,4}\\b"
exampleStrs = c("hts tht","456s df 4r5$","gte arf0 01$&#","g s hs f","rshg")
str_extract(exampleStrs, pat)
## [1] "hts" "df" "gte" "g" "rshg"
pat = ".*?\\.txt$"
exampleStrs = c("thisfile.txt","whateverfile.txt","...txt","gra$#%txt%$.txt",".txt")
str_extract(exampleStrs, pat)
## [1] "thisfile.txt" "whateverfile.txt" "...txt"
## [4] "gra$#%txt%$.txt" ".txt"
pat = "\\d{2}/\\d{2}/\\d{4}"
exampleStrs = c("today is 09/08/2019","09/09/2019 will be tomorrow","Jim's birthday is 02/05/2015","13/32/9999 is not a date","h545rwajh88/23/7355/234/hgw")
str_extract(exampleStrs, pat)
## [1] "09/08/2019" "09/09/2019" "02/05/2015" "13/32/9999" "88/23/7355"
pat = "<(.+?)>.+?</\\1>"
exampleStrs = c("dhj <a>abc</a>bgfhf", "<a><b>fgerger</a></b>","<a><b>fgerger</b></b>","<a><b>fgerger</b></a>","<div 1-2> RR</aaa>RRR </div 1-2>")
str_extract(exampleStrs, pat)
## [1] "<a>abc</a>" "<a><b>fgerger</a>"
## [3] "<b>fgerger</b>" "<a><b>fgerger</b></a>"
## [5] "<div 1-2> RR</aaa>RRR </div 1-2>"
Problem #9
The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.
enCodedMessage <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
enCodedMessage
## [1] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo\nUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO\nd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5\nfy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
The hint is that some of the characters are more revealing than others. We can first extract the characters into different types: upper cases, lower cases, numbers and punctuations
paste(unlist(str_extract_all(enCodedMessage, "[:upper:]")), collapse = "")
## [1] "CONGRATULATIONSYOUAREASUPERNERD"
paste(unlist(str_extract_all(enCodedMessage, "[:lower:]")), collapse = "")
## [1] "clcopowzmstcdwnkigvdicpuggvhrynjuwczihqrfpxsjdwpnanwowisdijjkpfdrcocbtyczjataootjtjnecfekrwwwojigdvrfrbzbknbhzgvizcropwgnbqofaotfbwmktszqefyndtkcfgmcgxonhkgr"
paste(unlist(str_extract_all(enCodedMessage, "[:punct:]")), collapse = "")
## [1] "....!"
paste(unlist(str_extract_all(enCodedMessage, "[:digit:]")), collapse = "")
## [1] "1087792855078035307553364116224905651724639589659490545"
We can clearly see that all upper case letters form a message, but we need to add the punctuations back
deCodedMessage <- paste(unlist(str_extract_all(enCodedMessage, "[:upper:]|[:punct:]")), collapse = "")
deCodedMessage
## [1] "CONGRATULATIONS.YOU.ARE.A.SUPERNERD!"