Load Packages
Chapter 8 - Problem 3
- 3.1)
- 3.2)
- 3.3)
Chapter 8 - Problem 4
- 4.1)
- 4.2)
- 4.3)
- 4.4)
- 4.5)
Chapter 8 - Problem 9

This doc can be found in my github, or on rpubs.com. For this week’s assignment, all the data used is contained within this document.

Load Packages

library(stringr)
library(tidyr)

Chapter 8 - Problem 3

First we’ll format the original data as in the example:

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}" ))

phone <- unlist(str_extract_all(raw.data, "\\(?(\\d{3})?\\)?(-| )?\\d{3}(-| )?\\d{4}"))

phonebook <- data.frame(name = name, phone = phone)

name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

3.1)

Examining the data, we can see that most of the names already adhere to the first_name last_name format. Those that don’t have a comma separator (Last, First). So we’ll loop over the names, and when we find one with a comma, we’ll split it on the comma, and re-order the names:

for (i in 1:length(name)){
  if (grepl(",",name[i]) == TRUE){
   name_split = unlist(str_split(name[i],","))
   first = name_split[2]
   last = name_split[1]
   name[i] = paste(name_split[2], name_split[1])
  }
}

name

## [1] "Moe Szyslak"          " C. Montgomery Burns" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         " Homer Simpson"       "Dr. Julius Hibbert"

First and last names are now ordered correctly. Next we’ll update our phonebook with the first and last names separated. We’ll also remove titles, but we don’t want to upset Mr. Burns by removing the “C.” from in front of his name!

And while we’re at it, we’ll clean up the phone numbers a bit too…

#separate first and last names
phonebook$name <- name
phonebook <- separate(phonebook,name, sep = " (?=[^ ]+$)",
         into=c("first_name","last_name"))

#remove the titles
phonebook$first_name <- gsub("[[:alpha:]]{2,}\\.\\s*", "\\1", phonebook$first_name)


#clean up the phone numbers 
#drop brackets
phonebook$phone <- gsub("[()]", "", phonebook$phone)

#replace spaces with dashes
phonebook$phone <- gsub("\\s", "-", phonebook$phone)

#add dashes where missing
phonebook$phone <- gsub("(\\d{3})(\\d{4})$","\\1-\\2",phonebook$phone)

#create a little function to add an area code where it's missing
add_area_code <- function(num,chars){
  result <- num
  if(grepl(chars,num) == FALSE){
    result <- paste(chars,num)
  }

  return(result)
}

#i assume that springfield is all area code 636
phonebook$phone <- sapply(phonebook$phone, function(x) add_area_code(x,"636-") )

#remove any extra spaces
phonebook$phone <- gsub("\\s", "", phonebook$phone)

phonebook

##       first_name last_name        phone
## 1            Moe   Szyslak 636-555-1239
## 2  C. Montgomery     Burns 636-555-0113
## 3        Timothy   Lovejoy 636-555-6542
## 4            Ned  Flanders 636-555-8904
## 5          Homer   Simpson 636-555-3226
## 6         Julius   Hibbert 636-555-3642

3.2)

Here we’re looking to see if the individual has a title. We’ll use a regular expression to look for pre-fixes that have >=2 characters, and that end in a “.”. If true, then the individual has a title.

has_title <- str_detect(name,"[[:alpha:]]{2,}\\.\\s*")

title <- data.frame(name,has_title)

title

##                   name has_title
## 1          Moe Szyslak     FALSE
## 2  C. Montgomery Burns     FALSE
## 3 Rev. Timothy Lovejoy      TRUE
## 4         Ned Flanders     FALSE
## 5        Homer Simpson     FALSE
## 6   Dr. Julius Hibbert      TRUE

3.3)

Now we’re looking to see if a character has a second name. We’ll use almost the same regular expression as for the last question. Here, we’re looking for a single occurence of an initial, followed by a “.”, as in the case of Mr. Burns.

has_second_name <- str_detect(phonebook$first_name,"[[:alpha:]]?\\.\\s*")

second_name <- data.frame(name,has_second_name)

second_name

##                   name has_second_name
## 1          Moe Szyslak           FALSE
## 2  C. Montgomery Burns            TRUE
## 3 Rev. Timothy Lovejoy           FALSE
## 4         Ned Flanders           FALSE
## 5        Homer Simpson           FALSE
## 6   Dr. Julius Hibbert           FALSE

Chapter 8 - Problem 4

See below for code examples for all of these

4.1)

match a string that contains any number of digits and ends in a “$”

4.2)

match words containing 1-4 lowercase letters in the range a-z

4.3)

match any string (or lack thereof) preceeding “.txt” where the string ends with “.txt”

4.4)

matches strings in the following common date formats “dd/mm/yyyy” and/or “mm/dd/yyyy”

4.5)

matches text within markup language tags.

Examples below:

#the regex
one <- "[0-9]+\\$"
two <- "\\b[a-z]{1,4}\\b"
three <- ".*?\\.txt$"
four <- "\\d{2}/\\d{2}/\\d{4}"
five <- "<(.+?)>.+?</\\1>"

#an example match, and no-match case.
t1 <- c("123456789$","money$")
t2 <- "Match this But Not THIS"
t3 <- c("match_me.txt","dont_match_me.txt ")
t4 <- c("valentines day is 14/02/2018", "not 2018-02-15")
t5 <- c("example html tag = <h>words</h>", "example garbage <h>garbage</q>")

unlist(str_extract_all(t1,one))

## [1] "123456789$"

unlist(str_extract_all(t2,two))

## [1] "this"

unlist(str_extract_all(t3,three))

## [1] "match_me.txt"

unlist(str_extract_all(t4,four))

## [1] "14/02/2018"

unlist(str_extract_all(t5,five))

## [1] "<h>words</h>"

Chapter 8 - Problem 9

secret_code <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr" 

answer <- unlist(str_extract_all(secret_code,"[[:upper:]]"))

answer

##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"

“Congratulations you are a super nerd”… yeah… that sounds about right. :)

DATA 607 - WEEK 3 - Assignment

Paul Britton