Raw Data:
raw.data <- raw.data<-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
Copy the introductory example. The vector “name”" stores the extracted names.
Name
library("stringr")
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
#Note: Used documentation and examples from http://www.endmemo.com/program/R/sub.php as a guide for replacement
#Remove titles
name_notitle<-sub("[A-z]{2,3}\\. ","",name)
#Swap names seperated by commas
name_swap <- sub("(\\w+),\\s(\\w+)","\\2 \\1", name_notitle)
#Now swap names seperated by period
name_swap2<-sub("(\\w+)\\s(\\w+).\\s(\\w+)","\\1. \\3 \\2", name_swap)
name_swap2
## [1] "Moe Szyslak" "C. Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
#Use the remove title logic and "Detect String" Function
title<-str_detect(name,"[A-z]{2,3}\\. ")
#Join
data.frame(name_swap2,title)
## name_swap2 title
## 1 Moe Szyslak FALSE
## 2 C. Montgomery Burns FALSE
## 3 Timothy Lovejoy TRUE
## 4 Ned Flanders FALSE
## 5 Homer Simpson FALSE
## 6 Julius Hibbert TRUE
count <- str_count( name_swap2, "\\S+" )
data.frame(name_swap2,count>2)
## name_swap2 count...2
## 1 Moe Szyslak FALSE
## 2 C. Montgomery Burns TRUE
## 3 Timothy Lovejoy FALSE
## 4 Ned Flanders FALSE
## 5 Homer Simpson FALSE
## 6 Julius Hibbert FALSE
Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
This will find numeric strings that have a dollar sign attached to the end.
a <- c("1234asd","12345$","$123as","sadsf$")
str_detect(a, "[0-9]+\\$")
## [1] FALSE TRUE FALSE FALSE
This will find the strings that have from exactly four lower case letters in them
b <- c("abcde","123aabc22$","Abcd","abcd")
str_detect(b, "\\b[a-z]{1,4}\\b")
## [1] FALSE FALSE FALSE TRUE
This will find strings that end in .txt. Because of the ? the preceeding txt is optional so it will find anything ending .txt.
c <- c("1234asd.txt",".txt","sa.txtdk$$","$")
str_detect(c, ".*?\\.txt$")
## [1] TRUE TRUE FALSE FALSE
This will find strings that are formatted like numbers seperated by “/”. Specifically, strings that have at least two digits followed by / then exactly two digits then / then exactly 4 digits.
d <- c("aa/bb/cccc","09/18/2016","9/18/2016","9999999/99/9999")
str_detect(d, "\\d{2}/\\d{2}/\\d{4}")
## [1] FALSE TRUE FALSE TRUE
This will find strings that are enclosed by the following:
e <- c("<b> Hi </b>","asdasfa","<i>Italic</i>","<i>Failed<i>")
str_detect(e, "<(.+?)>.+?</\\1>")
## [1] TRUE FALSE TRUE FALSE
The following code hides a secret message. Crack it with R and regular expressions.
code<-"clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
# First find all capital letters
caps<-str_extract_all(code, "[[:upper:].]{1,}")
#unlist
u<-unlist(caps)
#concatenate all
pst<-paste(u, collapse ='')
#replace periods
result<-str_replace_all(pst,"[.]"," ")
result
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"