Chapter_8 :Automated Data Collection in R

Problem_3

Copy the introductory example. The vector name stores the extracted names.

a. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"
## [1] FALSE
## Loading required package: devtools
## Loading required package: usethis
names
Moe Szyslak
Burns, C. Montgomery
Rev. Timothy Lovejoy
Ned Flanders
Simpson, Homer
Dr. Julius Hibbert
##      [,1]                   [,2]            
## [1,] "Moe Szyslak"          ""              
## [2,] "Burns"                " C. Montgomery"
## [3,] "Rev. Timothy Lovejoy" ""              
## [4,] "Ned Flanders"         ""              
## [5,] "Simpson"              " Homer"        
## [6,] "Dr. Julius Hibbert"   ""
x
Moe Szyslak
C. Montgomery Burns
Rev. Timothy Lovejoy
Ned Flanders
Homer Simpson
Dr. Julius Hibbert
b. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Problem_4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

a. “[0-9]+\$”
  • type: This is a numeric type
  • from 0 to 9 match any charcter, case sensitive
  • the + sign is the quantifier means match one or more from the preceding token.
  • the $ the excape character match $ sign.
  • example: 500$
b. “\b[a-z]{1,4}\b”
  • type: This is a alphabetic type
  • “b” means that this is the word boundry
  • “[a-z]” alphabetic characters from a to z
  • “{1,4}” the qualifier match one to 4 times of the preceding token
  • example: “bees” and “sara”
c. ".*?\.txt$"
  • type: Mixed with numeric, alphabetic, and special characters - more general pattern
  • “.” matches any char except the line breaks
  • "*" the quantifier matches 0 or more from the preceding token.
  • “?” lazy token makes the preceding quantifier lazy - matching as few chars as possible.
  • “txt” match exactly 3 chars t, x, and t following the same order.
  • “$” end of the string or end of the line.
  • example: " c.txt“,”a.txt“,” a.txt"
d. “\d{2}/\d{2}/\d{4}”
  • type: This a numeric characters - most likely only integers of 7 digits
  • “d” matches any digit
  • “{2}” the quantifier matches 2 from the preceding token.
  • “{4}” the quantifier matches 4 from the preceding token.
  • example: “23984560” and “20708090”
e. “<(.+?)>.+?</\1>”
  • type: This a mixed type numeric alphabetic characters - most likely used for selecting DOM(HTML) tags
  • “<” matches the char of <
  • “()” means a group of regex.
  • “.” matches any char except line break.
  • “+” is the quantifier means match one or more from the preceding token.
  • “?” lazy token makes the preceding quantifier lazy - matching as few chars as possible.
  • “>” matches the char of >
  • example: an HTML tag “hi” and “< a>hi< a>”