This doc can be found in my github, or on rpubs.com. For this week’s assignment, all the data used is contained within this document.
library(stringr)
library(tidyr)First we’ll format the original data as in the example:
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}" ))
phone <- unlist(str_extract_all(raw.data, "\\(?(\\d{3})?\\)?(-| )?\\d{3}(-| )?\\d{4}"))
phonebook <- data.frame(name = name, phone = phone)
name## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
Examining the data, we can see that most of the names already adhere to the first_name last_name format. Those that don’t have a comma separator (Last, First). So we’ll loop over the names, and when we find one with a comma, we’ll split it on the comma, and re-order the names:
for (i in 1:length(name)){
if (grepl(",",name[i]) == TRUE){
name_split = unlist(str_split(name[i],","))
first = name_split[2]
last = name_split[1]
name[i] = paste(name_split[2], name_split[1])
}
}
name## [1] "Moe Szyslak" " C. Montgomery Burns" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" " Homer Simpson" "Dr. Julius Hibbert"
First and last names are now ordered correctly. Next we’ll update our phonebook with the first and last names separated. We’ll also remove titles, but we don’t want to upset Mr. Burns by removing the “C.” from in front of his name!
And while we’re at it, we’ll clean up the phone numbers a bit too…
#separate first and last names
phonebook$name <- name
phonebook <- separate(phonebook,name, sep = " (?=[^ ]+$)",
into=c("first_name","last_name"))
#remove the titles
phonebook$first_name <- gsub("[[:alpha:]]{2,}\\.\\s*", "\\1", phonebook$first_name)
#clean up the phone numbers
#drop brackets
phonebook$phone <- gsub("[()]", "", phonebook$phone)
#replace spaces with dashes
phonebook$phone <- gsub("\\s", "-", phonebook$phone)
#add dashes where missing
phonebook$phone <- gsub("(\\d{3})(\\d{4})$","\\1-\\2",phonebook$phone)
#create a little function to add an area code where it's missing
add_area_code <- function(num,chars){
result <- num
if(grepl(chars,num) == FALSE){
result <- paste(chars,num)
}
return(result)
}
#i assume that springfield is all area code 636
phonebook$phone <- sapply(phonebook$phone, function(x) add_area_code(x,"636-") )
#remove any extra spaces
phonebook$phone <- gsub("\\s", "", phonebook$phone)
phonebook## first_name last_name phone
## 1 Moe Szyslak 636-555-1239
## 2 C. Montgomery Burns 636-555-0113
## 3 Timothy Lovejoy 636-555-6542
## 4 Ned Flanders 636-555-8904
## 5 Homer Simpson 636-555-3226
## 6 Julius Hibbert 636-555-3642
Here we’re looking to see if the individual has a title. We’ll use a regular expression to look for pre-fixes that have >=2 characters, and that end in a “.”. If true, then the individual has a title.
has_title <- str_detect(name,"[[:alpha:]]{2,}\\.\\s*")
title <- data.frame(name,has_title)
title## name has_title
## 1 Moe Szyslak FALSE
## 2 C. Montgomery Burns FALSE
## 3 Rev. Timothy Lovejoy TRUE
## 4 Ned Flanders FALSE
## 5 Homer Simpson FALSE
## 6 Dr. Julius Hibbert TRUE
Now we’re looking to see if a character has a second name. We’ll use almost the same regular expression as for the last question. Here, we’re looking for a single occurence of an initial, followed by a “.”, as in the case of Mr. Burns.
has_second_name <- str_detect(phonebook$first_name,"[[:alpha:]]?\\.\\s*")
second_name <- data.frame(name,has_second_name)
second_name## name has_second_name
## 1 Moe Szyslak FALSE
## 2 C. Montgomery Burns TRUE
## 3 Rev. Timothy Lovejoy FALSE
## 4 Ned Flanders FALSE
## 5 Homer Simpson FALSE
## 6 Dr. Julius Hibbert FALSE
See below for code examples for all of these
match a string that contains any number of digits and ends in a “$”
match words containing 1-4 lowercase letters in the range a-z
match any string (or lack thereof) preceeding “.txt” where the string ends with “.txt”
matches strings in the following common date formats “dd/mm/yyyy” and/or “mm/dd/yyyy”
matches text within markup language tags.
Examples below:
#the regex
one <- "[0-9]+\\$"
two <- "\\b[a-z]{1,4}\\b"
three <- ".*?\\.txt$"
four <- "\\d{2}/\\d{2}/\\d{4}"
five <- "<(.+?)>.+?</\\1>"
#an example match, and no-match case.
t1 <- c("123456789$","money$")
t2 <- "Match this But Not THIS"
t3 <- c("match_me.txt","dont_match_me.txt ")
t4 <- c("valentines day is 14/02/2018", "not 2018-02-15")
t5 <- c("example html tag = <h>words</h>", "example garbage <h>garbage</q>")
unlist(str_extract_all(t1,one))## [1] "123456789$"
unlist(str_extract_all(t2,two))## [1] "this"
unlist(str_extract_all(t3,three))## [1] "match_me.txt"
unlist(str_extract_all(t4,four))## [1] "14/02/2018"
unlist(str_extract_all(t5,five))## [1] "<h>words</h>"
secret_code <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
answer <- unlist(str_extract_all(secret_code,"[[:upper:]]"))
answer## [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"
“Congratulations you are a super nerd”… yeah… that sounds about right. :)