The following problems come from chapter 8, regular expressions and essential string functions from “Automated Data Collection with R”
library(stringr)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
get_first_last <- function(n){
# Strip out title and middle name
n <- str_replace(n, "(Rev. )|(C. )|(Dr. )", "")
# Reorder if comma is present
split_n <- unlist(str_split(n, ", "))
if (length(split_n) == 2){
n <- paste(split_n[2], split_n[1])
}
return(n)
}
first_name_last_name <- as.vector(sapply(name, get_first_last))
first_name_last_name
## [1] "Moe Szyslak" "Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
# Look for two letters followed by a period and a space
has_title <- str_count(name, "\\w{2}. ") > 1
has_title
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
# Find a space followed by one letter then a period then a space
has_second_name <- str_count(name, " \\w{1}. ") == 1
has_second_name
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression
Question | Regular Expression | Type of String | Examples |
---|---|---|---|
(a) | [0-9]+\$ | Numbers followed by a dollar sign | 100\(, 25\) |
(b) | \b[a-z]{1,4}\b | A one to four lower-case character string | help, bat |
(c) | .*?\.txt$ | Alphanumeric characters followed by .txt | readme.txt, 1.txt |
(d) | \d{2}/\d{2}/\d{4} | A date in a DD/MM/YYYY or MM/DD/YYYY format | 12/31/2018 |
(e) | <(.+?)>.+?</\1> | Text between to HTML tags | <b>This is bold</b> |