Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to problems 3 and 4 from chapter 8 of Automated Data Collection in R. Problem 9 is extra credit. You may work in a small group, but please submit separately with names of all group participants in your submission.
Here is the referenced code for the introductory example in #3:
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
library(stringr)
Q3. Copy the introductory example. The vector name stores the extracted names.
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
(a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
To relocate the pattern of “LastName, (title/Second Name) FirstName” and exclude title:
flname <- str_replace(str_replace(name, "(\\w+)(, )(.+)", "\\3 \\1"), "^([A-Z][a-z]+\\.)( )(.+)", "\\3")
flname
## [1] "Moe Szyslak" "C. Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
(b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
It can be done in two ways:
title1 <- str_detect(name, "[A-Z][a-z]+\\.")
title1
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
name[title1]
## [1] "Rev. Timothy Lovejoy" "Dr. Julius Hibbert"
title2 <- pmatch(c("Rev.","Dr."), name)
title2
## [1] 3 6
name[title2]
## [1] "Rev. Timothy Lovejoy" "Dr. Julius Hibbert"
They both show that the third and the sixth names has a title. Those names are “Rev. Timothy Lovejoy” and “Dr. Julius Hibbert”.
(c) Construct a logical vector indicating whether a character has a second name.
To detect if a second name is in the corresponding names, return as boolean, from name
Sname1 <- str_detect(name, "[A-Z]+?\\.")
Sname1
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
flname[Sname1]
## [1] "C. Montgomery Burns"
To detect if a second name is in the corresponding names, return as boolean, from flname
Sname2 <- str_detect(flname, "\\w+\\.")
Sname2
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
flname[Sname2]
## [1] "C. Montgomery Burns"
Q4. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
(a) [0-9]+\$ (Note: There are two backslashes in front of the dollar sign)
Pattern(a): Any type of string that contains one or more numbers continues with a dollar sign.
## (a) [0-9]+\\$
unlist(str_extract_all(c("a123b234c78$d0986", "6789$", "000ab$", "000$abc"), "[0-9]+\\$"))
## [1] "78$" "6789$" "000$"
(b) \b[a-z]{1,4}\b (Note: There are two backslashes in front of both b’s)
Pattern(b): Any words within a string that have and only have one to four lower case letters. Nothing included at the front or back.
## (b) \\b[a-z]{1,4}\\b
str_extract_all(c("Michael's a baby boy.", "When Kay has $789,", "This is Data607 assignment 3 question four"), "\\b[a-z]{1,4}\\b")
## [[1]]
## [1] "s" "a" "baby" "boy"
##
## [[2]]
## [1] "has"
##
## [[3]]
## [1] "is" "four"
(c) .*?\.txt$ (Note: There is a pair of backslashes)
Pattern(c): Any character that appears zero or one time, which is totally optional and will be matched at most once, ends with “.txt”. To simplify the statement, it matches with any words that ends with “.txt”, including “.txt” itself.
## (c) .*?\\.txt$
unlist(str_extract_all(c("assignment3.txt", "Q4.text", "607.txt.txt", "Data607.txtbook", "textbook.text"), ".*?\\.txt$"))
## [1] "assignment3.txt" "607.txt.txt"
(d) \d{2}/\d{2}/\d{4} (Note: There are 3 pairs of backslashes)
Pattern(d): Two digits follows with a slash, follows with two digits, follows with a slash, and follows with four digits. Which is the pattern 00/00/0000 while 0 can be any numbers.
## (d) \\d{2}/\\d{2}/\\d{4}
unlist(str_extract_all(c("today is 9/14/2019", "Christmas is 12/25", "01/01/2020 is new year.", "01/01/20 returns false"), "\\d{2}/\\d{2}/\\d{4}"))
## [1] "01/01/2020"
(e) <(.+?)>.+?</\1> (Note: There are two backslashes in front of “1”)
Pattern(e):
Backreference means when we enclose elements in parenthesis, we want to match further instances of that particular elements, where “\1” here means matching the elements in parenthesis one more time.
This pattern (e) means a string contains symbol “<”, follows with any characters appears one or more times while those characters will be matched at most once (hereby the refernce group), follows with symbol “>”, follows with any characters appears one or more times while those characters will be matched at most once, follows with symbols “</”, follows with the reference group again, plus a “>” symbol.
To simplify in meaning: <reference group>anything once</reference group>
## (e) <(.+?)>.+?</\\1>
unlist(str_extract_all(c("<n>test</n>", "<group>123123</group>", "<it>is</false>", "</1234>", "<abcd></abcd>"), "<(.+?)>.+?</\\1>"))
## [1] "<n>test</n>" "<group>123123</group>"
Q9. The following code hides a secret message. Crack it with R and regular expressions.
Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.
clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr
code <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
secretmsg <- unlist(str_extract_all(code, "[[:upper:][:punct:]]"))
secretmsg
## [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "." "Y"
## [18] "O" "U" "." "A" "R" "E" "." "A" "." "S" "U" "P" "E" "R" "N" "E" "R"
## [35] "D" "!"
secretmsg <- str_replace_all(str_c(secretmsg, sep="", collapse=""), "\\.", " ")
secretmsg
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD!"