Stringr Data Extraction

We will start by loading our libraries and creating our dataset per the assignment specifications.

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

Names and Area Codes

We can extract our names

fullname<- unlist(str_extract_all(raw.data,"[[:alpha:]., ]{2,}"))

  
Areacodes<-unlist(str_extract_all(raw.data,"[[:digit:]-() ]{2,}")) %>%
  str_replace_all("\\D","") %>%
  str_extract("[:digit:]{8,}") %>% 
  substr(1,3)

Areacodes
## [1] NA    "636" NA    NA    "636" NA

Extracting Phone Number

Extracting all the phone numbers from the raw data

Phone<-unlist(str_extract_all(raw.data,"[[:digit:]-() ]{2,}")) %>%
  str_replace_all("\\D","") %>%
  str_extract("[:digit:]{1,7}")

Phone
## [1] "5551239" "6365550" "5556542" "5558904" "6365553" "5553642"

Extracting Titles

now we can remove initials and extracting titles such as “Dr.”

#remove initials
fullname<-unlist(str_extract_all(raw.data,"[[:alpha:]., ]{2,}")) %>%
  str_replace_all(" [:alpha:]{1}\\.","")


#extract titles
Titles<-unlist(str_extract_all(fullname,"[[:alpha:]., ]{2,}")) %>%
  str_extract_all("[:alpha:]{1,}\\.") %>%
  lapply(function(x) if(identical(x, character(0))) NA_character_ else x) %>% unlist

Titles
## [1] NA     NA     "Rev." NA     NA     "Dr."

Extracting First and Last Name

We can get everything into first name and last name vectors with the following stringr package code:

#extracting names without titles and 
FnameLname<-str_replace_all(fullname,"\\w+\\.\\s","")%>%
  #and switching names around so that first name is always first
  str_replace_all("(\\w+),\\s(\\w+)","\\2 \\1")

#extract first names
first_name<-str_extract_all(FnameLname,"\\w+\\s+")%>%str_replace_all("\\s+","")

#extract last names
last_name<-str_extract_all(FnameLname,"\\s+\\w+")%>%str_replace_all("\\s+","")

Data Extraction Summary

Finally, to create a nice dataframe summarizing all the information we’ve extracted from the raw.data text, we can employ the following script to combine all of our vectors.

#create table
extracted.info<-cbind.data.frame(Titles, first_name,last_name,Areacodes,Phone)

extracted.info
##   Titles first_name last_name Areacodes   Phone
## 1   <NA>        Moe   Szyslak      <NA> 5551239
## 2   <NA> Montgomery     Burns       636 6365550
## 3   Rev.    Timothy   Lovejoy      <NA> 5556542
## 4   <NA>        Ned  Flanders      <NA> 5558904
## 5   <NA>      Homer   Simpson       636 6365553
## 6    Dr.     Julius   Hibbert      <NA> 5553642

Problem 3

Problem 3-1

3-1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name

extracted.info[c(2,3)]
##   first_name last_name
## 1        Moe   Szyslak
## 2 Montgomery     Burns
## 3    Timothy   Lovejoy
## 4        Ned  Flanders
## 5      Homer   Simpson
## 6     Julius   Hibbert

Problem 3-2

3-2. construct a logical vector indicating whether a character has a title i.e., Rev. and Dr.  since this type of information was extracted into our dataframe using the stringr package as shown above, we can now construct the vector simply by checking which values are not NA within our dataframe

!is.na(extracted.info$Titles)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

If we want to do this directly on the strings we could use the str_detect function as shown below

unlist(str_extract_all(fullname,"[[:alpha:]., ]{2,}")) %>%
  str_detect("[:alpha:]{1,}\\.")
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Problem 3-3

3-3. Again, since we have already created a dataframe summarizing all of our relevant information, we can create a logical vector indicating who has a second name by looking at our “last_name” column as shown below which shows that all individuals have a second name

!is.na(extracted.info$last_name)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Problem 4

Problem 4-1

  1. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression

4-1. [0-9]+\$

This expression describes a search for any number between 0 and 9 that occur sequenctially from 0 to infinite amount of times, followed by a dollar sign. an example of this is shown below. The following shows a vector with true and false examples:

str_detect(c("123$","sdf544$","werh","32432"),"[0-9]+\\$")
## [1]  TRUE  TRUE FALSE FALSE

Problem 4-2

4-2. \b[a-z]{1,4}\b

This expression will return word boundaries followed by 1-4 character words with lower case letters a-z followed by anouther word boundary. The following shows a vector with true and false examples

str_detect(c("four","hi","Hi","hello"),"\\b[a-z]{1,4}\\b")
## [1]  TRUE  TRUE FALSE FALSE

Problem 4-3

4-3. .*?\.txt$

This expression returns any characters that end in .txt. The following shows a vector with true and false examples

str_detect(c("@#%f.txt","a d 4.txt","Hi.tx","hell"),".*?\\.txt$")
## [1]  TRUE  TRUE FALSE FALSE

Problem 4-4

4-4. \d{2}/\d{2}/\d{4}

This expression returns any string matching a date format like 09/14/2019. The following shows a vector with true and false examples

str_detect(c("09/14/2019","05/02/1992","9/14/2019","09/14/19"),"\\d{2}/\\d{2}/\\d{4}")
## [1]  TRUE  TRUE FALSE FALSE

Problem 4-5

4-5. <(.+?)>.+?</\1>

This expression returns any string value containing < followed by any number of sequential characters, folled by the less than sign. This is presumably used for HTML (or similarly formated) documents. The following shows a vector with true and false examples

str_detect(c("<html>This Should Return Truet</html>","<html>thisShouldReturnFalse"),"<(.+?)>.+?</\\1>")
## [1]  TRUE FALSE

Problem 9

  1. The following code hides a secret message. crack it with R and regular expressions. Hint: some of the characters are more revealing than others!

Defining challenge problem vector:

challenge_problem<- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

Lets start off with some exploratory string analysis. We will go down the list of regular expressions to see if any patterns arise.

Testing for character only:

unlist(str_extract_all(challenge_problem,"[:alpha:]+")) %>%
str_c(collapse= " ")
## [1] "clcopCow zmstc d wnkig OvdicpNuggvhryn Gjuwczi hqrfpRxs Aj dwpn Tanwo Uwisdij Lj kpf AT Idr coc bt yczjatOaootj t Nj ne c Sfek r w YwwojigO d vrfUrbz bkAnbhzgv R i zEcrop wAgnb SqoU fPa otfb wEm k t sR zqe fy n Nd t kc fE gmc Rgxo nhDk gr"

Testing for numeric only:

unlist(str_extract_all(challenge_problem,"[:digit:]+")) %>%
  str_c(collapse= " ")
## [1] "1 0 87 7 92 8 5 5 0 7 8 03 5 3 0 7 55 3 3 6 4 1 1 6 2 2 4 9 05 65 1 7 24 6 3 9 5 89 6 5 9 4 905 4 5"

Testing for lowercase only:

unlist(str_extract_all(challenge_problem,"[:lower:]+")) %>%
  str_c(collapse= " ")
## [1] "clcop ow zmstc d wnkig vdicp uggvhryn juwczi hqrfp xs j dwpn anwo wisdij j kpf dr coc bt yczjat aootj t j ne c fek r w wwojig d vrf rbz bk nbhzgv i z crop w gnb qo f a otfb w m k t s zqe fy n d t kc f gmc gxo nh k gr"

Testing for uppercase only:

unlist(str_extract_all(challenge_problem,"[:upper:]+")) %>%
  str_c(collapse= " ")
## [1] "C O N G R A T U L AT I O N S Y O U A R E A S U P E R N E R D"

Wow! This seems to provide an actual message, lets try cleaning this up!

unlist(str_extract_all(challenge_problem,"[[:upper:].]")) %>%
  str_c(collapse= "")%>%str_replace_all("\\."," ")
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"

The puzzle is solved!

We did it!

We did it!