607 Week 4 Assignment

Copy the introductory example. The vector name stores the extracted names.

R> name
[1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
[4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

Below is what we have

library(stringr)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555
-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

To get standard names below are steps

we need to convert list to dataframe
rename it

#convert list to dataframe
namesdf<-do.call(rbind, lapply(name, data.frame, stringsAsFactors=FALSE))
#rename column
namesdf$names<-namesdf$X..i..

If name contains comma, that means last name is first and first name is last.so re-arrange it

namesdf$stdFormatNames<-ifelse(grepl( ",",namesdf$names),paste(word(namesdf$names,-1),word(namesdf$names,1)),namesdf$names)

remove prefixes and commas

namesdf$stdFormatNames<-gsub("Rev.|Dr.|,","", namesdf$stdFormatNames)
namesdf$stdFormatNames

## [1] "Moe Szyslak"      "Montgomery Burns" " Timothy Lovejoy"
## [4] "Ned Flanders"     "Homer Simpson"    " Julius Hibbert"

1. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

Using str_detect to check if there is title in name

namesdf$names

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

namesdf$hasTitle<-str_detect(namesdf$names, "Rev.|Dr.")
namesdf$names

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

namesdf[,c("names","hasTitle")]

##                  names hasTitle
## 1          Moe Szyslak    FALSE
## 2 Burns, C. Montgomery    FALSE
## 3 Rev. Timothy Lovejoy     TRUE
## 4         Ned Flanders    FALSE
## 5       Simpson, Homer    FALSE
## 6   Dr. Julius Hibbert     TRUE

1. Construct a logical vector indicating whether a character has a second name.

Spliting by space

namesdf$stdFormatNames

## [1] "Moe Szyslak"      "Montgomery Burns" " Timothy Lovejoy"
## [4] "Ned Flanders"     "Homer Simpson"    " Julius Hibbert"

grepl( " ",str_trim(namesdf$stdFormatNames))

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

1. Consider the string +++BREAKING NEWS+++ . We would like to extract the first HTML tag. To do so we write the regular expression <.+>. Explain why this fails and correct the expression.
- R applies greedy quantification. This means that the program tries to extract the greatest possible sequence of the preceding character. As the . matches any character, the function returns the greatest possible sequence of any characters before a sequence of sentence.
```
stringToSearch<-"<title>+++BREAKING NEWS+++</title>"
str_extract(stringToSearch,"<.+>")
```
```
## [1] "<title>+++BREAKING NEWS+++</title>"
```
  We can change this behavior by adding a ? to the expression in order to signal that we are only looking for the shortest possible sequence of any characters before a sequence of sentence.
```
str_extract(stringToSearch, "<.+?>")
```
```
## [1] "<title>"
```

1. Consider the string (5-3)²⁼⁵2-253+3^2 conforms to the binomial theorem. We would like to extract the formula in the string. To do so we write the regular expression [^0-9=+*()]+. Explain why this fails and correct the expression.
- Putting the caret at the beginning of a character class does the inverse. keeping it at the end resolves the problem.
```
binomial_str <- "(5-3)^2=5^2-2*5*3+3^2 conforms to the binomial theorem."
str_extract(binomial_str, "[^0-9=+*()]+")
```
```
## [1] "-"
```
```
str_extract(binomial_str, "[0-9=+*()^-]+")
```
```
## [1] "(5-3)^2=5^2-2*5*3+3^2"
```

607 Week 4 Assignment

Chirag Vithalani

February 16, 2016