title: “607 Assignment3- String” author: “Hui (Gracie) Han date:”Sep16, 2018" output: html_document: — Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to problems 3 and 4 from chapter 8 of Automated Data Collection in R. Problem 9 is extra credit. You may work in a small group, but please submit separately with names of all group participants in your submission. Problem 3. Copy the introductory example. The vector name stores the extracted names. [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert” (a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.–will work on 3a after completed 3b and 3C (b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).—– will work on b first, then do a, which makes more logical sense (c) Construct a logical vector indicating whether a character has a second name.

first, set the library and load the original data

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.3.3

## -- Attaching packages ---------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## Warning: package 'tibble' was built under R version 3.3.3

## Warning: package 'tidyr' was built under R version 3.3.3

## Warning: package 'readr' was built under R version 3.3.3

## Warning: package 'purrr' was built under R version 3.3.3

## Warning: package 'dplyr' was built under R version 3.3.3

## Warning: package 'forcats' was built under R version 3.3.3

## -- Conflicts ------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library (stringr)
library(stringi)

## Warning: package 'stringi' was built under R version 3.3.3

getwd()

## [1] "D:/607 Sabrina Khan Andy C HomePC"

NamesInput <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
NamesInput

## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

Do question 3B first 3b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.) # Get all the names as they are first

Names1<-unlist(str_extract_all(NamesInput, "[[:alpha:]., ]{2,}"))
Names1

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

NamesWtitlesLogic = str_detect(Names1, "Rev.|Dr.")
NamesWtitlesLogic

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

3c) Construct a logical vector indicating whether a character has a second name. to solve this: detect a name that has space in between, then that is the second name

Names1

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

NamesSecond <- str_detect (Names1, "[A-Z]{1}\\.")
NamesSecond

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

3C) Construct a logical vector indicating whether a character has a second name. To solve this, we do a word count of the strings. All regular namesshould have 2 strings. The 2nd named string have three strings, that’s the answer.

## remove the titles first
NameWOtitle <- str_replace (Names1,"Rev.|Dr.", replacement = "" )
NameWOtitle

## [1] "Moe Szyslak"          "Burns, C. Montgomery" " Timothy Lovejoy"    
## [4] "Ned Flanders"         "Simpson, Homer"       " Julius Hibbert"

# count the strings
NamesCount <-str_count(NameWOtitle, "\\w+")
NamesCount

## [1] 2 3 2 2 2 2

As seen from last run, only the 2nd element have 3 strings (2nd name), construct a logical vector for that

NamesCountGT2 <- str_detect (NamesCount, "3")
NamesCountGT2

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

NExt, do 3a, (a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name. ## remove the titles, replace the title with space

# str_replace(string, pattern, replacement)
NameWOtitle <- str_replace (Names1,"Rev.|Dr.", replacement = "" )
NameWOtitle

## [1] "Moe Szyslak"          "Burns, C. Montgomery" " Timothy Lovejoy"    
## [4] "Ned Flanders"         "Simpson, Homer"       " Julius Hibbert"

Next, for Burns, C. MOngomery, the middle nmae need to be removed for two people (LastName, FirstName), their name order need to be re-arranged remove middle names -

namesWOMidName <- str_replace(NameWOtitle, "\\s[A-z]\\. ", " ")
### pattern: \-- match a space character, [A-z]--followed by a chacter, -- \ followed by a coma (,)
#  replace them with a nothing
namesWOMidName

## [1] "Moe Szyslak"       "Burns, Montgomery" " Timothy Lovejoy" 
## [4] "Ned Flanders"      "Simpson, Homer"    " Julius Hibbert"

Split those names(“Burns, MOntgemety” & “Simpson,HOmer”) separated by a comma into two vectors, trim, and reverse order.

 splitnameTemp1 <- sapply(str_split(namesWOMidName, ","), str_trim)
 splitnameTemp1

## [[1]]
## [1] "Moe Szyslak"
## 
## [[2]]
## [1] "Burns"      "Montgomery"
## 
## [[3]]
## [1] "Timothy Lovejoy"
## 
## [[4]]
## [1] "Ned Flanders"
## 
## [[5]]
## [1] "Simpson" "Homer"  
## 
## [[6]]
## [1] "Julius Hibbert"

 splitnameTemp2 <- sapply(splitnameTemp1,rev)
 splitnameTemp2

## [[1]]
## [1] "Moe Szyslak"
## 
## [[2]]
## [1] "Montgomery" "Burns"     
## 
## [[3]]
## [1] "Timothy Lovejoy"
## 
## [[4]]
## [1] "Ned Flanders"
## 
## [[5]]
## [1] "Homer"   "Simpson"
## 
## [[6]]
## [1] "Julius Hibbert"

Then paste the vectors together with a space in between. Then turn the list back into a vector

for (i in 1:length(namesWOMidName)) {
  splitnameTemp2[i]<- paste(unlist(splitnameTemp2[i]), collapse = " ")  
  }
 
splitname7 <- unlist(splitnameTemp2)
splitname7

## [1] "Moe Szyslak"      "Montgomery Burns" "Timothy Lovejoy" 
## [4] "Ned Flanders"     "Homer Simpson"    "Julius Hibbert"

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$ this string extracts any number 0-9 at the end of string ($)

String4a <- c("$243464","9090$09090")
string4aE <- str_extract(String4a,"[0-9]+\\$")
string4aE

## [1] NA      "9090$"

\b[a-z]{1,4}\b this describes any lower case letter [a-z] between 1 to 4 charcters long {1,4}, with blanks (\b) before and after

String4b <- c("ty"," z8B98$"," c "," 8SDF ","asdfg "," ssas ")
String4bE <- str_extract(String4b,"\\b[a-z]{1,4}\\b")
String4bE

## [1] "ty"   NA     "c"    NA     NA     "ssas"

.*?\.txt$ this evaluates strings with .txt at end ($)

String4c <- c("Mydoc.doc", "Mydoc.txt", ".txt", ".xls")
String4cE <- str_extract (String4c, ".*?\\.txt$")
String4cE

## [1] NA          "Mydoc.txt" ".txt"      NA

\d{2}/\d{2}/\d{4} this evaluates the patten with two forward slash (such as date DD/MM/YYYY format) with exactly 2 , 2, 4 characters long

String4d <-  c("11/10/2008","12/2/2008 ","12/12\1999"," asdf ","2/12/2010 ", "01/01/99" )
String4dE <- str_extract (String4d, '\\d{2}/\\d{2}/\\d{4}')
String4dE

## [1] "11/10/2008" NA           NA           NA           NA          
## [6] NA

<(.+?)>.+?</\1> this evaluates any <> with a forward slash within

String4e <- c("<tag> myname/>","<tag>something</tag>")
String4eE <- str_extract(String4e,"<(.+?)>.+?</\\1>")
String4eE

## [1] NA                     "<tag>something</tag>"