library(stringr)
library(Hmisc)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5543642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
first_name last_namelast_first <- name[!is.na((str_extract(name, ".+, .+")))] #finds all names in "ln, fn"" format
first_name <- (str_extract(last_first,", .+")) #gets just the first name from "ln, fn" format
last_name <- (str_extract(last_first,".+, ")) #gets just the last name from "ln, fn" format
names <- as.list(paste(first_name,last_name)) #puts the first name and last name together
last_first <- as.list(gsub(", ", "", names)) # removes left over "," in the names
first_last <- as.list(name[is.na((str_extract(name, ".+, .+")))]) # gets all names in "fn ln" format
name <- append(last_first, first_last) #combines the two lists
list.tree(name)
## name = list 6 (760 bytes)
## . [[1]] = character 1= C. Montgomery Burns
## . [[2]] = character 1= Homer Simpson
## . [[3]] = character 1= Moe Szyslak
## . [[4]] = character 1= Rev. Timothy Lovejoy
## . [[5]] = character 1= Ned Flanders
## . [[6]] = character 1= Dr. Julius Hibbert
str_detect(name, "[:alpha:]{2,}\\..")
## [1] FALSE FALSE FALSE TRUE FALSE TRUE
title <- name[!is.na((str_extract(name, "[:alpha:]{2,}\\..")))] #gets all names that have more than 2 letters before period
list.tree(title)
## title = list 2 (296 bytes)
## . [[1]] = character 1= Rev. Timothy Lovejoy
## . [[2]] = character 1= Dr. Julius Hibbert
str_detect(name, ("[:upper:]\\.+"))
## [1] TRUE FALSE FALSE FALSE FALSE FALSE
has_middle_name <- name[!is.na(str_extract(name, ("[:upper:]\\.+")))] # gets names with one capital letter before next name part
list.tree(has_middle_name)
## has_middle_name = list 1 (168 bytes)
## . [[1]] = character 1= C. Montgomery Burns
string <- "<title>+++BREAKING NEWS+++</title>"
attempt_1 <- str_extract(string, "<.+>")
attempt_1
## [1] "<title>+++BREAKING NEWS+++</title>"
attempt_2 <- str_extract(string, "<\\w+>")
attempt_2
## [1] "<title>"
The reason that attempt_1 fails is that it is grabbing everything between < and > and since there is a second tag and the end with > we get the whole string. By getting only a whole word we get the first tag since the second tag has \.
string <- "(5-3)^2=5^2-2*5*3+3^2 conforms to the binomial theorem"
attempt_1 <- str_extract(string, "[^0-9=+*()]+")
attempt_1
## [1] "-"
attempt_2 <- str_extract(string, "[0-9=+*()].+[0-9]")
attempt_2
## [1] "(5-3)^2=5^2-2*5*3+3^2"
The reason that attempt_1 fails is because the carat ^ symbol inverses the matching. So it is looking for anything but 0-9. Also the period . is any characters after the matching of one time using the + symbol. By adding the [0-9] we make sure to end at the last numeric value in the string.