3. Copy the introductory example. The vector name stores the extracted name.
library(stringr)
#input the raw data
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
#list the vector names from the raw data
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
(a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
#split the name list by the "," delimiter
#only names with last name, first name format will be split
#names with first name - last name format will be untouched
sp_name<-str_split(name,",")
sp_name
## [[1]]
## [1] "Moe Szyslak"
##
## [[2]]
## [1] "Burns" " C. Montgomery"
##
## [[3]]
## [1] "Rev. Timothy Lovejoy"
##
## [[4]]
## [1] "Ned Flanders"
##
## [[5]]
## [1] "Simpson" " Homer"
##
## [[6]]
## [1] "Dr. Julius Hibbert"
for (i in 1:length(sp_name)){
if (length(sp_name[[i]]) > 1) {
temp <- sp_name[[i]][1]
sp_name[[i]][1] <- sp_name[[i]][2]
sp_name[[i]][2] <- temp
}
}
sp_name
## [[1]]
## [1] "Moe Szyslak"
##
## [[2]]
## [1] " C. Montgomery" "Burns"
##
## [[3]]
## [1] "Rev. Timothy Lovejoy"
##
## [[4]]
## [1] "Ned Flanders"
##
## [[5]]
## [1] " Homer" "Simpson"
##
## [[6]]
## [1] "Dr. Julius Hibbert"
(b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
title_name <- str_detect(name,"Rev.|Dr.|Mr.|Ms.|Mrs.")
comb_list <- data.frame (name, title_name)
comb_list
## name title_name
## 1 Moe Szyslak FALSE
## 2 Burns, C. Montgomery FALSE
## 3 Rev. Timothy Lovejoy TRUE
## 4 Ned Flanders FALSE
## 5 Simpson, Homer FALSE
## 6 Dr. Julius Hibbert TRUE
(c) Construct a logical vector indicating whether a character has a second name
mid_name <- str_detect(name," [A-Z]\\.")
comb_list2 <- data.frame (name, mid_name)
comb_list2
## name mid_name
## 1 Moe Szyslak FALSE
## 2 Burns, C. Montgomery TRUE
## 3 Rev. Timothy Lovejoy FALSE
## 4 Ned Flanders FALSE
## 5 Simpson, Homer FALSE
## 6 Dr. Julius Hibbert FALSE
**7. Consider the string
. We would like to extract the first HTML tag. To do so we write the regular expression <.+>. Explain why this fails and correct the expression.**
tag1 <- "<title>+++BREAKING NEWS+++</title>"
#using the wrong regex <.+>
# this regex will pick up the entire string since it will pick-up
# the second ">" (Greedy quantification)
wrong_tag <- str_extract(tag1,"<.+>")
wrong_tag
## [1] "<title>+++BREAKING NEWS+++</title>"
#to prevent this from happening, we use the
#the regex below
right_tag <- str_extract(tag1,"<[[:alpha:]]+>")
right_tag
## [1] "<title>"
8. Consider the string (5-3)2=52-253+3^2 conforms to the binomial theorem. We would like to extract the formula in the string. To do so we write the regular expression [^0-9=+*()]+. Explain why this fails and correct the expression.
#assign formula to string variable
bi_for <- "(5-3)^2=5^2-2*5*3+3^2"
#wrong formula used
#the ^ sign and the - sign are special signs in regex
#they must be preceeded by escape characters \\
wrong_formula <- str_extract(bi_for, "[^0-9=+*()]+")
wrong_formula
## [1] "-"
#the righ formula with escpae characters
#digits where also converted to \\d instead of using 0-9
right_formula <- str_extract(bi_for, "[\\-\\^\\d=+*()]+")
right_formula
## [1] "(5-3)^2=5^2-2*5*3+3^2"