Please create an R Markdown file that provides a solution for #4, #5 and #6 in Automated Data Collection in R, chapter 8. Publish the R Markdown file to rpubs.com, and include links to your R Markdown file (in GitHub) and your rpubs.com URL In your assignment solution.
To test the first expression, we are going to use test4a
library(stringr)
test4a<-"00$, aa$, 990$, 777, $, absd$, 88888$, 000000$, 765$, 3$009$"
results4a <- unlist(str_extract_all(test4a, "[0-9]+\\$"))
results4a
## [1] "00$" "990$" "88888$" "000000$" "765$" "3$" "009$"
We are expecting 00\(, 990\), 88888\(, 000000\), 765$ as return
The regular expression (b) means at least one or at most 4 lower case letters. Those letters should make a word.
To test the first expression, we are going to use test4b
test4b<-"s, b, c, ab, abc, abcd, abbbbd, abbcdf, AA, BBB, ZZZZZ"
results4b <- unlist(str_extract_all(test4b, "\\b[a-z]{1,4}\\b"))
results4b
## [1] "s" "b" "c" "ab" "abc" "abcd"
We are expecting: s, b, c, ab, abc, abcd
The regular expression (c) means at least one (.) or more characters () ending with .txt. () is optional because of the (?)
To test the first expression, we are going to use test4c
test4c<-"0.txt, w.txt, bcd.txt, 12ct.txt, BBB.txt, ZZZZZ.txt, ,.txt, [[[.txt"
results4c <- unlist(str_extract_all(test4c, ".*?\\.txt$"))
results4c
## [1] "0.txt, w.txt, bcd.txt, 12ct.txt, BBB.txt, ZZZZZ.txt, ,.txt, [[[.txt"
The regular expression (d) means 2 digits with / as separator followed by two digits , / and then four digits.
To test the first expression, we are going to use test4c
test4d<-"34/89/9889, 01/01/0000, 657/980/90000, 9/9/8909, 4/5/66666, 78/09909987"
results4d <- unlist(str_extract_all(test4d, "\\d{2}/\\d{2}/\\d{4}"))
results4d
## [1] "34/89/9889" "01/01/0000"
The regular expression (e) means a list of any characters begining with a tag and ending with /tag. Other tags can be defined inside the first tags. It looks like a parser for simple HTML.
test4e <- "<footer><p>Posted by: John Smith</p> <p>Contact information</p></footer> other things"
unlist(str_extract_all(test4e, "<(.+?)>.+?</\\1>"))
## [1] "<footer><p>Posted by: John Smith</p> <p>Contact information</p></footer>"
I can think about two ways to proceed here but, I could not eliminate the dollar sign.
[0-9]+\$ = [[:digit:]][[:digit:]][$] = \d\d[$]
To test the first expression, we are going to use test4a
We are expecting 00\(, 990\), 88888\(, 000000\), 765$ as return
test4a<-"00$, aa$, 990$, 777, $, absd$, 88888$, 000000$, 765$, 3$009$"
results4a <- unlist(str_extract_all(test4a, "[0-9]+\\$"))
results51<- unlist(str_extract_all(test4a, "[[:digit:]][[:digit:]]*[$]")) #first way
results52<- unlist(str_extract_all(test4a, "\\d\\d*[$]")) #second way
results4a
## [1] "00$" "990$" "88888$" "000000$" "765$" "3$" "009$"
results51
## [1] "00$" "990$" "88888$" "000000$" "765$" "3$" "009$"
results52
## [1] "00$" "990$" "88888$" "000000$" "765$" "3$" "009$"
test6a <- "chunkylover53[at]aol[dot]com"
str_replace_all(test6a, c("\\[at\\]" = "@", "\\[dot\\]" = "\\."))
## [1] "chunkylover53@aol.com"
[:digit:] is the same thing as and it means 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9. It will extract only one digit and we have 2 in the email address. In order for it to work, we need to perform a repetition by adding the sign + after [:digit:], like [:digit:]+.
test6b <- "chunkylover53[at]aol[dot]com"
str_extract(test6b, "[:digit:]") #test with [:digit:] and str_extract
## [1] "5"
str_extract_all(test6b, "[:digit:]") #test with [:digit:] and str_extract_all
## [[1]]
## [1] "5" "3"
str_extract(test6b, "[:digit:]+") #test with [:digit:]+ and str_extract
## [1] "53"
str_extract_all(test6b, "[:digit:]+") #test with [:digit:]+ and str_extract_all
## [[1]]
## [1] "53"
\D means no digits, by using \D we will receive all the character one but one and no digits. By searching with \D+, we will receive 2 vectors of characters. But, we would like to extract digits. \d+ will do it.
test6c <- "chunkylover53[at]aol[dot]com"
str_extract_all(test6c, "\\d+")
## [[1]]
## [1] "53"