We will start by loading our libraries and creating our dataset per the assignment specifications.
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
We can extract our names
fullname<- unlist(str_extract_all(raw.data,"[[:alpha:]., ]{2,}"))
Areacodes<-unlist(str_extract_all(raw.data,"[[:digit:]-() ]{2,}")) %>%
str_replace_all("\\D","") %>%
str_extract("[:digit:]{8,}") %>%
substr(1,3)
Areacodes
## [1] NA "636" NA NA "636" NA
Extracting all the phone numbers from the raw data
Phone<-unlist(str_extract_all(raw.data,"[[:digit:]-() ]{2,}")) %>%
str_replace_all("\\D","") %>%
str_extract("[:digit:]{1,7}")
Phone
## [1] "5551239" "6365550" "5556542" "5558904" "6365553" "5553642"
now we can remove initials and extracting titles such as “Dr.”
#remove initials
fullname<-unlist(str_extract_all(raw.data,"[[:alpha:]., ]{2,}")) %>%
str_replace_all(" [:alpha:]{1}\\.","")
#extract titles
Titles<-unlist(str_extract_all(fullname,"[[:alpha:]., ]{2,}")) %>%
str_extract_all("[:alpha:]{1,}\\.") %>%
lapply(function(x) if(identical(x, character(0))) NA_character_ else x) %>% unlist
Titles
## [1] NA NA "Rev." NA NA "Dr."
We can get everything into first name and last name vectors with the following stringr package code:
#extracting names without titles and
FnameLname<-str_replace_all(fullname,"\\w+\\.\\s","")%>%
#and switching names around so that first name is always first
str_replace_all("(\\w+),\\s(\\w+)","\\2 \\1")
#extract first names
first_name<-str_extract_all(FnameLname,"\\w+\\s+")%>%str_replace_all("\\s+","")
#extract last names
last_name<-str_extract_all(FnameLname,"\\s+\\w+")%>%str_replace_all("\\s+","")
Finally, to create a nice dataframe summarizing all the information we’ve extracted from the raw.data text, we can employ the following script to combine all of our vectors.
#create table
extracted.info<-cbind.data.frame(Titles, first_name,last_name,Areacodes,Phone)
extracted.info
## Titles first_name last_name Areacodes Phone
## 1 <NA> Moe Szyslak <NA> 5551239
## 2 <NA> Montgomery Burns 636 6365550
## 3 Rev. Timothy Lovejoy <NA> 5556542
## 4 <NA> Ned Flanders <NA> 5558904
## 5 <NA> Homer Simpson 636 6365553
## 6 Dr. Julius Hibbert <NA> 5553642
3-1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name
extracted.info[c(2,3)]
## first_name last_name
## 1 Moe Szyslak
## 2 Montgomery Burns
## 3 Timothy Lovejoy
## 4 Ned Flanders
## 5 Homer Simpson
## 6 Julius Hibbert
3-2. construct a logical vector indicating whether a character has a title i.e., Rev. and Dr. since this type of information was extracted into our dataframe using the stringr package as shown above, we can now construct the vector simply by checking which values are not NA within our dataframe
!is.na(extracted.info$Titles)
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
If we want to do this directly on the strings we could use the str_detect function as shown below
unlist(str_extract_all(fullname,"[[:alpha:]., ]{2,}")) %>%
str_detect("[:alpha:]{1,}\\.")
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
3-3. Again, since we have already created a dataframe summarizing all of our relevant information, we can create a logical vector indicating who has a second name by looking at our “last_name” column as shown below which shows that all individuals have a second name
!is.na(extracted.info$last_name)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
4-1. [0-9]+\$
This expression describes a search for any number between 0 and 9 that occur sequenctially from 0 to infinite amount of times, followed by a dollar sign. an example of this is shown below. The following shows a vector with true and false examples:
str_detect(c("123$","sdf544$","werh","32432"),"[0-9]+\\$")
## [1] TRUE TRUE FALSE FALSE
4-2. \b[a-z]{1,4}\b
This expression will return word boundaries followed by 1-4 character words with lower case letters a-z followed by anouther word boundary. The following shows a vector with true and false examples
str_detect(c("four","hi","Hi","hello"),"\\b[a-z]{1,4}\\b")
## [1] TRUE TRUE FALSE FALSE
4-3. .*?\.txt$
This expression returns any characters that end in .txt. The following shows a vector with true and false examples
str_detect(c("@#%f.txt","a d 4.txt","Hi.tx","hell"),".*?\\.txt$")
## [1] TRUE TRUE FALSE FALSE
4-4. \d{2}/\d{2}/\d{4}
This expression returns any string matching a date format like 09/14/2019. The following shows a vector with true and false examples
str_detect(c("09/14/2019","05/02/1992","9/14/2019","09/14/19"),"\\d{2}/\\d{2}/\\d{4}")
## [1] TRUE TRUE FALSE FALSE
4-5. <(.+?)>.+?</\1>
This expression returns any string value containing < followed by any number of sequential characters, folled by the less than sign. This is presumably used for HTML (or similarly formated) documents. The following shows a vector with true and false examples
str_detect(c("<html>This Should Return Truet</html>","<html>thisShouldReturnFalse"),"<(.+?)>.+?</\\1>")
## [1] TRUE FALSE
Defining challenge problem vector:
challenge_problem<- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
Lets start off with some exploratory string analysis. We will go down the list of regular expressions to see if any patterns arise.
Testing for character only:
unlist(str_extract_all(challenge_problem,"[:alpha:]+")) %>%
str_c(collapse= " ")
## [1] "clcopCow zmstc d wnkig OvdicpNuggvhryn Gjuwczi hqrfpRxs Aj dwpn Tanwo Uwisdij Lj kpf AT Idr coc bt yczjatOaootj t Nj ne c Sfek r w YwwojigO d vrfUrbz bkAnbhzgv R i zEcrop wAgnb SqoU fPa otfb wEm k t sR zqe fy n Nd t kc fE gmc Rgxo nhDk gr"
Testing for numeric only:
unlist(str_extract_all(challenge_problem,"[:digit:]+")) %>%
str_c(collapse= " ")
## [1] "1 0 87 7 92 8 5 5 0 7 8 03 5 3 0 7 55 3 3 6 4 1 1 6 2 2 4 9 05 65 1 7 24 6 3 9 5 89 6 5 9 4 905 4 5"
Testing for lowercase only:
unlist(str_extract_all(challenge_problem,"[:lower:]+")) %>%
str_c(collapse= " ")
## [1] "clcop ow zmstc d wnkig vdicp uggvhryn juwczi hqrfp xs j dwpn anwo wisdij j kpf dr coc bt yczjat aootj t j ne c fek r w wwojig d vrf rbz bk nbhzgv i z crop w gnb qo f a otfb w m k t s zqe fy n d t kc f gmc gxo nh k gr"
Testing for uppercase only:
unlist(str_extract_all(challenge_problem,"[:upper:]+")) %>%
str_c(collapse= " ")
## [1] "C O N G R A T U L AT I O N S Y O U A R E A S U P E R N E R D"
Wow! This seems to provide an actual message, lets try cleaning this up!
unlist(str_extract_all(challenge_problem,"[[:upper:].]")) %>%
str_c(collapse= "")%>%str_replace_all("\\."," ")
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"
The puzzle is solved!
We did it!