Homework week 4

library(stringr)
library(Hmisc)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5543642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

3 Name Regex

(a) rearrange vector so all names are `first_name last_name`

last_first <- name[!is.na((str_extract(name, ".+, .+")))] #finds all names in "ln, fn"" format 
first_name <- (str_extract(last_first,", .+")) #gets just the first name from "ln, fn" format 
last_name <- (str_extract(last_first,".+, ")) #gets just the last name from "ln, fn" format 
names <- as.list(paste(first_name,last_name)) #puts the first name and last name together
last_first <- as.list(gsub(", ", "", names)) # removes left over "," in the names
first_last <- as.list(name[is.na((str_extract(name, ".+, .+")))]) # gets all names in "fn ln" format
name <- append(last_first, first_last) #combines the two lists
list.tree(name)

##  name = list 6 (760 bytes)
## .  [[1]] = character 1= C. Montgomery Burns 
## .  [[2]] = character 1= Homer Simpson 
## .  [[3]] = character 1= Moe Szyslak 
## .  [[4]] = character 1= Rev. Timothy Lovejoy 
## .  [[5]] = character 1= Ned Flanders 
## .  [[6]] = character 1= Dr. Julius Hibbert

(b) check for titles

str_detect(name, "[:alpha:]{2,}\\..")

## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE

title <- name[!is.na((str_extract(name, "[:alpha:]{2,}\\..")))] #gets all names that have more than 2 letters before period 
list.tree(title)

##  title = list 2 (296 bytes)
## .  [[1]] = character 1= Rev. Timothy Lovejoy 
## .  [[2]] = character 1= Dr. Julius Hibbert

(c) check for second name

str_detect(name, ("[:upper:]\\.+"))

## [1]  TRUE FALSE FALSE FALSE FALSE FALSE

has_middle_name <- name[!is.na(str_extract(name, ("[:upper:]\\.+")))] # gets names with one capital letter before next name part
list.tree(has_middle_name)

##  has_middle_name = list 1 (168 bytes)
## .  [[1]] = character 1= C. Montgomery Burns

7 HTML tag problem

string <- "<title>+++BREAKING NEWS+++</title>"
attempt_1 <- str_extract(string, "<.+>") 
attempt_1

## [1] "<title>+++BREAKING NEWS+++</title>"

attempt_2 <- str_extract(string, "<\\w+>")
attempt_2

## [1] "<title>"

The reason that attempt_1 fails is that it is grabbing everything between < and > and since there is a second tag and the end with > we get the whole string. By getting only a whole word we get the first tag since the second tag has \.

8 binomial formula from string

string <- "(5-3)^2=5^2-2*5*3+3^2 conforms to the binomial theorem"
attempt_1 <- str_extract(string, "[^0-9=+*()]+")
attempt_1

## [1] "-"

attempt_2 <- str_extract(string, "[0-9=+*()].+[0-9]")
attempt_2

## [1] "(5-3)^2=5^2-2*5*3+3^2"

The reason that attempt_1 fails is because the carat ^ symbol inverses the matching. So it is looking for anything but 0-9. Also the period . is any characters after the matching of one time using the + symbol. By adding the [0-9] we make sure to end at the last numeric value in the string.

Homework week 4

Christophe

February 15, 2016

3 Name Regex

(a) rearrange vector so all names are `first_name last_name`

(b) check for titles

(c) check for second name

7 HTML tag problem

8 binomial formula from string

Homework week 4

Christophe

February 15, 2016

3 Name Regex

(a) rearrange vector so all names are first_name last_name

(b) check for titles

(c) check for second name

7 HTML tag problem

8 binomial formula from string

(a) rearrange vector so all names are `first_name last_name`