DATA 607 Week 4: Regular Expressions

Problem 3

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson,Homer"        "Dr. Julius Hibbert"

Put first names first

# separate last_name, first_name by comma
last_first <- str_split(name, ",")
# reorder to first_name last_name
for (i in 1:length(name)) {
  name[i] <- paste(str_trim(rev(last_first[[i]])), collapse = " ")
}
name

## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

Find titles

The titles given end with periods. However, there is a first name with a period as well, so length will have to be considered. Titles have 2 or 3 letters before the period. Using these criteria, a regular expression can be created:

(has_title <- str_detect(name, "[:alpha:]{2,}\\."))

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Find second names

Characters with second names will have an extra space in their full name. Taking into account that titles also add a space:

(second_name <- str_count(name, " ") > ifelse(has_title, 2, 1))

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Problem 7

The tag, as written, returns the entire string:

str_extract("<title>+++BREAKING NEWS+++</title>", "<.+>")

## [1] "<title>+++BREAKING NEWS+++</title>"

This can be corrected by adding a question mark to the expression to indicate that the goal is to find the shortest sequence of characters between html tags:

str_extract("<title>+++BREAKING NEWS+++</title>", "<.+?>")

## [1] "<title>"

Problem 8

The tag, as written, returns only a dash:

str_extract("(5-3)^2=5^2-2*5*3+3^2 conforms to the binomial theorem", "[^0-9=+*()]+")

## [1] "-"

This fails because the caret needs to be designated as a literal character. The literal dash also needs to be added (and the one in 0-9 needs to remain).

str_extract("(5-3)^2=5^2-2*5*3+3^2 conforms to the binomial theorem", "[\\^0-9=+*()\\-]+")

## [1] "(5-3)^2=5^2-2*5*3+3^2"