title: “607 Assignment3- String” author: “Hui (Gracie) Han date:”Sep16, 2018" output: html_document: — Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to problems 3 and 4 from chapter 8 of Automated Data Collection in R. Problem 9 is extra credit. You may work in a small group, but please submit separately with names of all group participants in your submission. Problem 3. Copy the introductory example. The vector name stores the extracted names. [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert” (a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.–will work on 3a after completed 3b and 3C (b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).—– will work on b first, then do a, which makes more logical sense (c) Construct a logical vector indicating whether a character has a second name.
first, set the library and load the original data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## -- Attaching packages ---------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'readr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
## Warning: package 'forcats' was built under R version 3.3.3
## -- Conflicts ------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library (stringr)
library(stringi)
## Warning: package 'stringi' was built under R version 3.3.3
getwd()
## [1] "D:/607 Sabrina Khan Andy C HomePC"
NamesInput <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
NamesInput
## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
Do question 3B first 3b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.) # Get all the names as they are first
Names1<-unlist(str_extract_all(NamesInput, "[[:alpha:]., ]{2,}"))
Names1
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
NamesWtitlesLogic = str_detect(Names1, "Rev.|Dr.")
NamesWtitlesLogic
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
3c) Construct a logical vector indicating whether a character has a second name. to solve this: detect a name that has space in between, then that is the second name
Names1
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
NamesSecond <- str_detect (Names1, "[A-Z]{1}\\.")
NamesSecond
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
3C) Construct a logical vector indicating whether a character has a second name. To solve this, we do a word count of the strings. All regular namesshould have 2 strings. The 2nd named string have three strings, that’s the answer.
## remove the titles first
NameWOtitle <- str_replace (Names1,"Rev.|Dr.", replacement = "" )
NameWOtitle
## [1] "Moe Szyslak" "Burns, C. Montgomery" " Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" " Julius Hibbert"
# count the strings
NamesCount <-str_count(NameWOtitle, "\\w+")
NamesCount
## [1] 2 3 2 2 2 2
As seen from last run, only the 2nd element have 3 strings (2nd name), construct a logical vector for that
NamesCountGT2 <- str_detect (NamesCount, "3")
NamesCountGT2
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
NExt, do 3a, (a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name. ## remove the titles, replace the title with space
# str_replace(string, pattern, replacement)
NameWOtitle <- str_replace (Names1,"Rev.|Dr.", replacement = "" )
NameWOtitle
## [1] "Moe Szyslak" "Burns, C. Montgomery" " Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" " Julius Hibbert"
Next, for Burns, C. MOngomery, the middle nmae need to be removed for two people (LastName, FirstName), their name order need to be re-arranged remove middle names -
namesWOMidName <- str_replace(NameWOtitle, "\\s[A-z]\\. ", " ")
### pattern: \-- match a space character, [A-z]--followed by a chacter, -- \ followed by a coma (,)
# replace them with a nothing
namesWOMidName
## [1] "Moe Szyslak" "Burns, Montgomery" " Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" " Julius Hibbert"
Split those names(“Burns, MOntgemety” & “Simpson,HOmer”) separated by a comma into two vectors, trim, and reverse order.
splitnameTemp1 <- sapply(str_split(namesWOMidName, ","), str_trim)
splitnameTemp1
## [[1]]
## [1] "Moe Szyslak"
##
## [[2]]
## [1] "Burns" "Montgomery"
##
## [[3]]
## [1] "Timothy Lovejoy"
##
## [[4]]
## [1] "Ned Flanders"
##
## [[5]]
## [1] "Simpson" "Homer"
##
## [[6]]
## [1] "Julius Hibbert"
splitnameTemp2 <- sapply(splitnameTemp1,rev)
splitnameTemp2
## [[1]]
## [1] "Moe Szyslak"
##
## [[2]]
## [1] "Montgomery" "Burns"
##
## [[3]]
## [1] "Timothy Lovejoy"
##
## [[4]]
## [1] "Ned Flanders"
##
## [[5]]
## [1] "Homer" "Simpson"
##
## [[6]]
## [1] "Julius Hibbert"
Then paste the vectors together with a space in between. Then turn the list back into a vector
for (i in 1:length(namesWOMidName)) {
splitnameTemp2[i]<- paste(unlist(splitnameTemp2[i]), collapse = " ")
}
splitname7 <- unlist(splitnameTemp2)
splitname7
## [1] "Moe Szyslak" "Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
String4a <- c("$243464","9090$09090")
string4aE <- str_extract(String4a,"[0-9]+\\$")
string4aE
## [1] NA "9090$"
String4b <- c("ty"," z8B98$"," c "," 8SDF ","asdfg "," ssas ")
String4bE <- str_extract(String4b,"\\b[a-z]{1,4}\\b")
String4bE
## [1] "ty" NA "c" NA NA "ssas"
String4c <- c("Mydoc.doc", "Mydoc.txt", ".txt", ".xls")
String4cE <- str_extract (String4c, ".*?\\.txt$")
String4cE
## [1] NA "Mydoc.txt" ".txt" NA
String4d <- c("11/10/2008","12/2/2008 ","12/12\1999"," asdf ","2/12/2010 ", "01/01/99" )
String4dE <- str_extract (String4d, '\\d{2}/\\d{2}/\\d{4}')
String4dE
## [1] "11/10/2008" NA NA NA NA
## [6] NA
String4e <- c("<tag> myname/>","<tag>something</tag>")
String4eE <- str_extract(String4e,"<(.+?)>.+?</\\1>")
String4eE
## [1] NA "<tag>something</tag>"