Week 4 Assignment

Please create an R Markdown file that provides a solution for #4, #5 and #6 in Automated Data Collection in R, chapter 8. Publish the R Markdown file to rpubs.com, and include links to your R Markdown file (in GitHub) and your rpubs.com URL In your assignment solution.

  1. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
  1. [0-9]+\$ The regular expression (a) means one or more digits include in the set 0,1,2,3,4,5,6,7,8,9. The same digit can appear more than once and it needs to be at least one digit followed by dollar sign.

To test the first expression, we are going to use test4a

library(stringr)
test4a<-"00$, aa$, 990$, 777, $, absd$, 88888$, 000000$, 765$, 3$009$"
results4a <- unlist(str_extract_all(test4a, "[0-9]+\\$"))
results4a
## [1] "00$"     "990$"    "88888$"  "000000$" "765$"    "3$"      "009$"

We are expecting 00\(, 990\), 88888\(, 000000\), 765$ as return

  1. \b[a-z]{1,4}\b

The regular expression (b) means at least one or at most 4 lower case letters. Those letters should make a word.

To test the first expression, we are going to use test4b

test4b<-"s, b, c, ab, abc, abcd, abbbbd, abbcdf, AA, BBB, ZZZZZ"
results4b <- unlist(str_extract_all(test4b, "\\b[a-z]{1,4}\\b"))
results4b
## [1] "s"    "b"    "c"    "ab"   "abc"  "abcd"

We are expecting: s, b, c, ab, abc, abcd

  1. .*?\.txt$

The regular expression (c) means at least one (.) or more characters () ending with .txt. () is optional because of the (?)

To test the first expression, we are going to use test4c

test4c<-"0.txt, w.txt, bcd.txt, 12ct.txt, BBB.txt, ZZZZZ.txt, ,.txt, [[[.txt"
results4c <- unlist(str_extract_all(test4c, ".*?\\.txt$"))
results4c
## [1] "0.txt, w.txt, bcd.txt, 12ct.txt, BBB.txt, ZZZZZ.txt, ,.txt, [[[.txt"
  1. \d{2}/\d{2}/\d{4}

The regular expression (d) means 2 digits with / as separator followed by two digits , / and then four digits.

To test the first expression, we are going to use test4c

test4d<-"34/89/9889, 01/01/0000, 657/980/90000, 9/9/8909, 4/5/66666, 78/09909987"
results4d <- unlist(str_extract_all(test4d, "\\d{2}/\\d{2}/\\d{4}"))
results4d
## [1] "34/89/9889" "01/01/0000"

The regular expression (e) means a list of any characters begining with a tag and ending with /tag. Other tags can be defined inside the first tags. It looks like a parser for simple HTML.

test4e <- "<footer><p>Posted by: John Smith</p> <p>Contact information</p></footer> other things"
unlist(str_extract_all(test4e, "<(.+?)>.+?</\\1>"))
## [1] "<footer><p>Posted by: John Smith</p> <p>Contact information</p></footer>"
  1. Rewrite the expression [0-9]+\$ in a way that all elements are altered but the expression performs the same task.

I can think about two ways to proceed here but, I could not eliminate the dollar sign.

[0-9]+\$ = [[:digit:]][[:digit:]][$] = \d\d[$]

To test the first expression, we are going to use test4a

We are expecting 00\(, 990\), 88888\(, 000000\), 765$ as return

test4a<-"00$, aa$, 990$, 777, $, absd$, 88888$, 000000$, 765$, 3$009$"
results4a <- unlist(str_extract_all(test4a, "[0-9]+\\$"))
results51<- unlist(str_extract_all(test4a, "[[:digit:]][[:digit:]]*[$]")) #first way
results52<- unlist(str_extract_all(test4a, "\\d\\d*[$]")) #second way
results4a
## [1] "00$"     "990$"    "88888$"  "000000$" "765$"    "3$"      "009$"
results51
## [1] "00$"     "990$"    "88888$"  "000000$" "765$"    "3$"      "009$"
results52
## [1] "00$"     "990$"    "88888$"  "000000$" "765$"    "3$"      "009$"
  1. Consider the mail address chunkylover53[at]aol[dot]com.
  1. Transform the mail address into a standard mail format using regular expressions.
test6a <- "chunkylover53[at]aol[dot]com"
str_replace_all(test6a, c("\\[at\\]" = "@", "\\[dot\\]" = "\\."))
## [1] "chunkylover53@aol.com"
  1. Imagine we are trying to extract the digits in the mail address. To do so we write the expression [:digit:]. Explain why this fails and correct the expression.

[:digit:] is the same thing as and it means 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9. It will extract only one digit and we have 2 in the email address. In order for it to work, we need to perform a repetition by adding the sign + after [:digit:], like [:digit:]+.

test6b <- "chunkylover53[at]aol[dot]com"
str_extract(test6b, "[:digit:]") #test with [:digit:] and str_extract
## [1] "5"
str_extract_all(test6b, "[:digit:]") #test with [:digit:] and str_extract_all
## [[1]]
## [1] "5" "3"
str_extract(test6b, "[:digit:]+") #test with [:digit:]+ and str_extract
## [1] "53"
str_extract_all(test6b, "[:digit:]+") #test with [:digit:]+ and str_extract_all
## [[1]]
## [1] "53"
  1. Instead of using the predefined character classes, we would like to use the predefined symbols to extract the digits in the mail address. To do so we write the expression \D. Explain why this fails and correct the expression.

\D means no digits, by using \D we will receive all the character one but one and no digits. By searching with \D+, we will receive 2 vectors of characters. But, we would like to extract digits. \d+ will do it.

test6c <- "chunkylover53[at]aol[dot]com"
str_extract_all(test6c, "\\d+")
## [[1]]
## [1] "53"