- Copy the introductory example. The vector name stores the extracted names.
- Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
- Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr. ).
- Construct a logical vector indicating whether a character has a second name.
library(stringr)
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
#a Rearrange the vector so that all elements conform to the standard first_name, last_name
name <- str_replace_all(str_replace_all(name, "(.+)(, .+)$", "\\2 \\1"), ", ", "")
#b A logical vector indicating whether a character has a title (i.e. Rev. and Dr.)
title <- str_detect(name, "[[:alpha:]]{2,}\\.")
#c A logical vector indicating whether a character has a second name.
middle_name <- str_detect(name, "[:upper:]\\.")
print(data.frame(name, title, middle_name))
## name title middle_name
## 1 Moe Szyslak FALSE FALSE
## 2 C. Montgomery Burns FALSE TRUE
## 3 Rev. Timothy Lovejoy TRUE FALSE
## 4 Ned Flanders FALSE FALSE
## 5 Homer Simpson FALSE FALSE
## 6 Dr. Julius Hibbert TRUE FALSE
- Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
#a [0-9]+ \\ $
# the following regular expression matches vector containing a range of number from 0 to 9 (one or more digits) and ending with a dollar sign.
q4a <- "We write dollar amount as $1000 not 1000$"
unlist(str_extract_all(q4a, "[0-9]+\\$"))
## [1] "1000$"
#b \\ b[a-z]{1,4} \\ b
# the following regular expression matches vector containing a word with 1 to 4 lowercase letters.
q4b <- "The Best preparation for tomorrow is doing your best today"
unlist(str_extract_all(q4b, "\\b[a-z]{1,4}\\b"))
## [1] "for" "is" "your" "best"
#c .*? \\ .txt$
# the following regular expression matches vector that ends in .txt
q4c <- c("This is a text", "This is my file.txt")
unlist(str_extract_all(q4c, ".*?\\.txt$"))
## [1] "This is my file.txt"
#d \\ d{2}/ \\ d{2}/ \\ d{4}
# the following regular expression matches vector with two digits, forward slash, two digits, foward slash, four digits, respectively.
q4d <- c("This is a date 09/05/2018", "This is a division 2/4/8")
unlist(str_extract_all(q4d, "\\d{2}/\\d{2}/\\d{4}"))
## [1] "09/05/2018"
#e <(.+?)>.+?</ \\ 1>
# the following regular expression matches vector that starts with a zero or more letter inside inequality symbol and ends with forward slash and the same word or letter inside inequality symbol.
q4e <- c("<tag> </tag>", "<tag>Hello World!</tag>", "<taG>Hello World!</tag>")
unlist(str_extract_all(q4e, "<(.+?)>.+?</\\1>"))
## [1] "<tag> </tag>" "<tag>Hello World!</tag>"
- The following code hides a secret message. Crack it with R and regular expressions.
q9 <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
# from a glance, the code seems to have a mixture of digits, lowercase, uppercase and some punctuation.
# unlist(str_extract_all(q9, "\\d"))
# unlist(str_extract_all(q9, "[:lower:]"))
# unlist(str_extract_all(q9, "[:punct:]"))
# unlist(str_extract_all(q9, "[:upper:]"))
message <- unlist(str_extract_all(q9, "[[:upper:].!]")) # this could be the message, lets clean it
paste(str_replace(message, "[.]", " "), collapse = "")
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD!"