Question 1
Copy the introductory example. The vector name stores the extracted names.
R> name [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert”
Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
Construct a logical vector indicating whether a character has a second name.
Answer 1
## Installing package into 'C:/Users/Anil Akyildirim/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## package 'stringr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Anil Akyildirim\AppData\Local\Temp\RtmpAHN0c4\downloaded_packages
# creating the example raw data
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
raw.data## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
#creating the example name list
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
Answer (a)
When we look at the name list we can see that we need to pay attention.
Title of a person
Spaces
Order of the first name and the last name
Middle Initial of an individual.
Commas
# removing the comma
names_without_comma <- str_replace_all(name, pattern = ",", replacement = "")
names_without_comma## [1] "Moe Szyslak" "Burns C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson Homer" "Dr. Julius Hibbert"
# removing the title
names_without_titles <- str_replace_all(names_without_comma, pattern = "[[:alpha:]]{2,}\\. ", replacement = "")
names_without_titles## [1] "Moe Szyslak" "Burns C. Montgomery" "Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson Homer" "Julius Hibbert"
# removing the middle names
names_without_middle_names <- str_replace_all(names_without_titles, pattern = "[[:alpha:]]{1,}\\. ", replacement = "" )
names_without_middle_names## [1] "Moe Szyslak" "Burns Montgomery" "Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson Homer" "Julius Hibbert"
# displaying the first_name _last_name
first_name_last_name <- str_replace(names_without_middle_names, pattern = "(\\w+)\\s+(\\w+)", replacement = "\\2 \\1")
first_name_last_name## [1] "Szyslak Moe" "Montgomery Burns" "Lovejoy Timothy"
## [4] "Flanders Ned" "Homer Simpson" "Hibbert Julius"
Answer (b) Logical Vector indicating whether a character has a title
#using str_detect to give us True or False of title existence
str_detect(names_without_comma, "[[:alpha:]]{2,}\\.")## [1] FALSE FALSE TRUE FALSE FALSE TRUE
Answer (c) Logical Vector indicating whether a character has a middle name
# using str_detect to give us True of False of middle name existence
str_detect(names_without_titles, "[[:alpha:]]{1,}\\.")## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Question 2
Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
[0-9]+\$
\b[a-z]{1,4}\b
.*?\.txt$
\d{2}/\d{2}/\d{4}
<(.+?)>.+?</\1>
Answer 2
Answer (a)
This string that has one or more numbers between 0 and 9 and will end with $ so something like this - 589$ -
## [1] "In Europe they use $ sign at the end of the numbers, for example 435$"
## [[1]]
## [1] "435$"
Answer(b)
The string that has any character counts from 1 to 4. (words that has 1 to 4 letters/characters)
## [[1]]
## [1] "they" "use" "sign" "at" "the" "end" "of" "the" "for"
Answer(c)
The string that ends with .txt
## [1] "We have a lot of files named info.txt"
## [[1]]
## [1] "We have a lot of files named info.txt"
Answer(d)
The string consists of 2 digit number followed by “/”, 2 digit number follwed by “/” and 4 digit number. Looks like date/month/year or month/date/year.
web_date <- "The last day for enrollment to the class was 08/27/2019"
str_extract_all(web_date, "\\d{2}/\\d{2}/\\d{4}")## [[1]]
## [1] "08/27/2019"
Answer(e)
The string consists of <…> and anything in between and ends with </> . Looks like html.
web_code <- "The way you construct the html body is that you start with <p>These pretzels are making me thirsty!</p>"
str_extract_all(web_code, "<(.+?)>.+?</\\1>")## [[1]]
## [1] "<p>These pretzels are making me thirsty!</p>"
Question 9
The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.
clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr
Answer 9
# creating the message object
scripted_message <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
scripted_message## [1] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
## [[1]]
## [1] "1" "0" "8" "7" "7" "9" "2" "8" "5" "5" "0" "7" "8" "0" "3" "5" "3"
## [18] "0" "7" "5" "5" "3" "3" "6" "4" "1" "1" "6" "2" "2" "4" "9" "0" "5"
## [35] "6" "5" "1" "7" "2" "4" "6" "3" "9" "5" "8" "9" "6" "5" "9" "4" "9"
## [52] "0" "5" "4" "5"
#lets see if we can find anything in the capital numbers.
str_extract_all(scripted_message, "[:upper:]")## [[1]]
## [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"
Well it says Congratulations You are a super nerd but all characters are individually shown,
# combining each separate character, adding period
cat(unlist(str_extract_all(scripted_message, "[[:upper:].]")))## C O N G R A T U L A T I O N S . Y O U . A R E . A . S U P E R N E R D