In this assignment I have tried to provide solution for some selective questions in Chapter 8 of Automated Data Collection in R
knitr::opts_chunk$set(echo = TRUE)
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Q.3. Copy the introductory example. The vector name stores the extracted names. R> name [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert”
(a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
Solution
nData <- str_replace(name, "[[:alpha:]]{1,3}[.]", "")
nData <- sub("^\\s+", "", nData)
modData <- str_split_fixed(nData, " ", 2)
listName = list()
for (i in 1:length(name)) {
if(length(grep(',',modData[i, 1]))==0) {
firstName <- modData[i, 1]
lastName <- modData[i,2]
}
else {
firstName <- modData[i, 2]
lastName <- modData[i,1]
}
lastName <- str_replace_all(lastName, ",","")
fd <- data.frame(firstName, lastName)
listName[[i]] <- fd
}
Name <- bind_rows(listName)
colnames(Name) <- c("First Name", "Last Name")
knitr::kable(Name)
First Name | Last Name |
---|---|
Moe | Szyslak |
Montgomery | Burns |
Timothy | Lovejoy |
Ned | Flanders |
Homer | Simpson |
Julius | Hibbert |
(b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
Solution
tName <- str_detect(name, "[[:alpha:]]{2,3}[.]")
for (i in 1:length(name)) {
if (length(grep(c('Rev.'), name[i]))>0) {
print(name[i])
}
else if (length(grep(c('Dr.'), name[i]))>0) {
print(name[i])
}
}
## [1] "Rev. Timothy Lovejoy"
## [1] "Dr. Julius Hibbert"
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
tName
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
(c) Construct a logical vector indicating whether a character has a second name.
sName <- str_replace(name, "[[:alpha:]]{2,3}[.]", "")
get2Name <- function(sName){
lastName = list()
for (i in 1:length(name)) {
if(length(grep('[.]',sName[i]))>0) {
SecondName <- TRUE
}
else {
SecondName <- FALSE
}
ld <- data.frame(SecondName)
lastName[[i]] <- ld
}
Name2 <- bind_rows(lastName)
return(Name2)
}
has2Name <-get2Name(sName)
sName
## [1] "Moe Szyslak" "Burns, C. Montgomery" " Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" " Julius Hibbert"
has2Name
## SecondName
## 1 FALSE
## 2 TRUE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
Q.4 Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
(a) [0-9]+\$
Solution The above REGEX will work for any string that has numbers followed by Dollar sign.The Dollar sign would be ignored.
exData1 <- c("131dfsf$dv1", '3$12313', 'SomeRandomText')
str_detect(exData1, "[0-9]+\\$")
## [1] FALSE TRUE FALSE
(b) \b[a-z]{1,4}\b
Solution The above REGEX will extract pattern out of string that has upto four continous lowercase alphabet characters.
exData2 <-c("TRuE", "fals!", "true", "wILLNotw0rK")
unlist(str_extract_all(exData2, "\\b[a-z]{1,4}\\b"))
## [1] "fals" "true"
(c) .*?\.txt$
Solution The above REGEX will work for strings that ends in “.txt”.
exData3 <- c("131dfsf$dv1", '3$12313', 'SomeRandomText.txt')
str_detect(exData3, ".*?\\.txt$")
## [1] FALSE FALSE TRUE
(d) \d{2}/\d{2}/\d{4}
Solution The above REGEX is useful in matching patterns like date i.e. two number followed by forword slash, another two number followed by forward slash and in the end four numbers. Its still not a validation for date format as it will extract upto 2/2/4 numeric characters from the string irrespective of the string length. Please see below example.
exData4 <- c("mm/dd/yyyy", "09/15/2016", "121/12/2222")
str_detect(exData4, "\\d{2}/\\d{2}/\\d{4}")
## [1] FALSE TRUE TRUE
(e) <(.+?)>.+?</\1>
Solution The above REGEX is useful in finding text or characters with valid and syntactically correct HTML tags from the given string.
exData5 <- c("<h3> This will work</h3>", "<h3>This will not<h3>", "SomeRandomText.html")
str_detect(exData5, "<(.+?)>.+?</\\1>")
## [1] TRUE FALSE FALSE
Q.9 The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.
clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr
Solution Below is the solution, although I must admint that the following post on stackoverflow did help me : http://stackoverflow.com/questions/35542346/r-using-regmatches-to-extract-certain-characters
cipher <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
decipher <- unlist(str_extract_all(cipher, "[[:upper:].]{1,}"))
decipher <- str_replace_all(paste(decipher, collapse = ''), "[.]", " ")
decipher
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"