This assignment is focused on using the stringr package and base R tools to extract data from messy text strings:
- Problem 3 is about extracting names
- Problem 4 is about more complex regular expressions
- Problem 9 is a riddle
This example introduces the stringr package and extracts the names of characters from the raw data.
The assignment consists of three parts with the final goal to provide a string of character names conforming to the order first_name last_name
# Code from the introductory example
# Source: Munzert, S. et. al. (2015). Automated Data Collection with R - A Practical Guide to Web Scraping and Text Mining. Wiley.
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
library(stringr)
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
The resulting string looks like this
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
names_w_titles = str_detect(name, "Rev.|Dr.")
Here are the names again and the correspondent logical values for titles
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
names_w_titles
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
The goal is to rearrange the following string so that all names follow the order first_name last_name.
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
First we remove the titles from the names using the output of 3(b)
name1 = str_replace_all(name, pattern = "Rev. |Dr. ", replacement = "")
For strings that do have a comma, the order of the words should be reversed
last_names_comma = str_extract_all(name1, "\\w+,")
first_names_comma = str_extract_all(name1, ", \\w.+$")
correct_order = paste(first_names_comma, last_names_comma)
Now insert the correctly ordered names into the main string
comma_positions = grep(",", name1)
name1[comma_positions] <- correct_order[comma_positions]
Finally, remove all commas and empty spaces
name1 = str_trim(str_replace_all(name1, ",", ""))
The original string and the rearranged version
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
name1
## [1] "Moe Szyslak" "C. Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
Task: Construct a logical string indicating whether a character has a second name
Assuming one of the names (first or second) is abbreviated, use the “.” as an indicator for a second name
names_second_name_1 = str_detect(name1, "\\.")
An alternative approach would be to say that if a person has more than two words in their name, one of them should be a second name. Then we can use the count of space characters > 1 as and indicator.
names_second_name_2 = str_detect(name1, "(\\s.+){2,}")
In our case, the output is the same
name1
## [1] "Moe Szyslak" "C. Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
names_second_name_1
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
names_second_name_2
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
[0-9]+\\$\\b[a-z]{1,4}\\b.*?\\.txt$\\d{2}/\\d{2}/\\d{4}<( .+?)>.+?<``/ \\1>[0-9]+\\$ matches a literal character \ at the end of the string preceded by any number of digits. Example: from a string “drive123\” this expression would match “123\”.
The R code example does not work yet due to the issues of passing the literal character \ into the input string:
str_match("drive123\\", "[0-9]+\\$")
## [,1]
## [1,] NA
\\b[a-z]{1,4}\\b matches any word that is lowercase and between 1 and 4 characters long
str_extract_all("a to few euro pound Euro","\\b[a-z]{1,4}\\b")
## [[1]]
## [1] "a" "to" "few" "euro"
.*?\\.txt$ optionally matches any number of any characters preceding a literal case-sensitive “.txt”. So either any characters followed by “.txt” at the end of the string, or just “.txt”. The line break character \n prevents backward search beyond it.
str_extract_all(c("match123 ! 321.txt", "nomatch txt", "line break \n .txt"),".*?\\.txt$")
## [[1]]
## [1] "match123 ! 321.txt"
##
## [[2]]
## character(0)
##
## [[3]]
## [1] " .txt"
\\d{2}/\\d{2}/\\d{4} matches two digits followed by “/”, then another two digits followed by “/”, then four digits. This looks like a date format:
str_extract_all(c("10/10/1980", "10.10.1980"),"\\d{2}/\\d{2}/\\d{4}")
## [[1]]
## [1] "10/10/1980"
##
## [[2]]
## character(0)
<( .+?)>.+?<``/ \\1> This expression finds the shortest possible character or group of characters inside HTML tag marks < >, and then returns anything between this found matching group and its value enclosed in </ >. Thus it can extract parts of text that are inside html code:
str_extract_all(c("<p> This paragraph is matched </p>", "<This is not matched>"),"<(.+?)>.+?</\\1>")
## [[1]]
## [1] "<p> This paragraph is matched </p>"
##
## [[2]]
## character(0)