R> name [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy" [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
Below is what we have
library(stringr)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555
-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
To get standard names below are steps
#convert list to dataframe
namesdf<-do.call(rbind, lapply(name, data.frame, stringsAsFactors=FALSE))
#rename column
namesdf$names<-namesdf$X..i..
namesdf$stdFormatNames<-ifelse(grepl( ",",namesdf$names),paste(word(namesdf$names,-1),word(namesdf$names,1)),namesdf$names)
namesdf$stdFormatNames<-gsub("Rev.|Dr.|,","", namesdf$stdFormatNames)
namesdf$stdFormatNames
## [1] "Moe Szyslak" "Montgomery Burns" " Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" " Julius Hibbert"
Using str_detect to check if there is title in name
namesdf$names
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
namesdf$hasTitle<-str_detect(namesdf$names, "Rev.|Dr.")
namesdf$names
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
namesdf[,c("names","hasTitle")]
## names hasTitle
## 1 Moe Szyslak FALSE
## 2 Burns, C. Montgomery FALSE
## 3 Rev. Timothy Lovejoy TRUE
## 4 Ned Flanders FALSE
## 5 Simpson, Homer FALSE
## 6 Dr. Julius Hibbert TRUE
Spliting by space
namesdf$stdFormatNames
## [1] "Moe Szyslak" "Montgomery Burns" " Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" " Julius Hibbert"
grepl( " ",str_trim(namesdf$stdFormatNames))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
R applies greedy quantification. This means that the program tries to extract the greatest possible sequence of the preceding character. As the . matches any character, the function returns the greatest possible sequence of any characters before a sequence of sentence.
stringToSearch<-"<title>+++BREAKING NEWS+++</title>"
str_extract(stringToSearch,"<.+>")
## [1] "<title>+++BREAKING NEWS+++</title>"
We can change this behavior by adding a ? to the expression in order to signal that we are only looking for the shortest possible sequence of any characters before a sequence of sentence.
str_extract(stringToSearch, "<.+?>")
## [1] "<title>"
Putting the caret at the beginning of a character class does the inverse. keeping it at the end resolves the problem.
binomial_str <- "(5-3)^2=5^2-2*5*3+3^2 conforms to the binomial theorem."
str_extract(binomial_str, "[^0-9=+*()]+")
## [1] "-"
str_extract(binomial_str, "[0-9=+*()^-]+")
## [1] "(5-3)^2=5^2-2*5*3+3^2"