This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
The following document is solution to exercises from Text: Automated Data Collection with ‘R’. Exercise 3 page 217: In this exercise we are using string manipulation techniques to reformat names and create logical vector to carry some information about content of name strings.
# Invoque String Package
library(stringr)
# Set Name vector
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5543642Dr. Julius Hibbert"
# Extract name from raw.data
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
# Dispay name
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
## Unfortunately, I am sill trying to get the function to work properly in all cases...
## Assume that name will contain at least first and last name
format_name <- function(raw_string){
# reset name variables
first_name <- " "
last_name <- " "
title_name <- " "
middle_name <- " "
name_str <- unlist(str_split(raw_string, "[[:blank:]]"))
for (i in 1:length(name_str) ) {
name_tst <- name_str[i]
name_tst <- str_trim(name_tst)
name_alpha <- str_extract(name_tst, "[[:alpha:]]+")
# check for ending ',' or '.'
if (str_detect(name_tst, ",")){
last_name <-name_alpha
}
}else if(str_detect(name_tst, "\\.")){
if (i == 1) {
title_name <- name_tst
}else (i<length(name_tst)){
middle_name <-name_tst
}
}else{ # no punctuation character detected
if ( i== 1){
first_name <- name_alpha
}else if (i<length(name_tst)){
middle_name <- name_alpha
}else{
last_name <- name_alpha
}
} # end of else
} # end of for loop
out_string <- str_c(first_name, last_name, sep = " ")
return(out_string)
}
format_name(name[1])
format_name(name[2])
format_name(name[3])
format_name(name[4])
format_name(name[5])
format_name(name[6])
v_rev <- str_detect(name, "Rev.")
v_dr <- str_detect(name, "Dr.")
v_title <- v_rev | v_dr
# display logical vector
v_title
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
v_middle <- str_detect(name, "[[:upper:]]\\.")
# Display logical vector
v_middle
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Exercise # 7 page 218
Consider the string, we would like to extract first HTML tag, we write expression: <.+> Explain why it is not working and fix it. First let us see what we get:
test_str <- "<title>+++BREAKING NEWS+++</title>"
str_extract(test_str, "<.+>")
## [1] "<title>+++BREAKING NEWS+++</title>"
When we execute this command, we get the all string. This is due to R applying “greedy quantification”. That is R will extract the greatest possible sequence of any characters before “<….>” to modify this behavior we have to use the ? to indicate that we only want shortest possible expression. Hence by modifying the sequence by adding ?
test_str <- "<title>+++BREAKING NEWS+++</title>"
str_extract(test_str, "<.+?>")
## [1] "<title>"
Exercise # 8 page 218 Consider the string: (5-3)2=52-253+3^2, we would like to extract the formula to the string by writing following regular expression “[^0-9=+*()]+”. This does not lead to the desire result. Explain why and fix it. Again, we will try this expression and consider the results:
test_str2 <- "(5-3)^2=5^2-2*5*3+3^2"
str_extract(test_str2, "[^0-9=+*()]+")
## [1] "-"
The result is ‘-’. This is due to the Metacharacters and their meaning… ^ indicates “not in” and the - is interpreted as a range between digit. Most Metacharacter are interpreted literally when included within bracket in expression. However, this is not the case for ^ at beginning of expression and - between digits. Hence we need to make the following modifications to the expression… Move the ^ within expression not at beginning and add - in expression to account for - sign in expression, we still want to interpret 0-9 as range of digits.
str_extract(test_str2, "[0-9=+*^()-]+")
## [1] "(5-3)^2=5^2-2*5*3+3^2"
Exercise # 9 page 218
I have not craked the code. I am still trying to figure it out…