Week 4 Assignment

Please create an R Markdown file that provides a solution for #4, #5 and #6 in Automated Data Collection in R, chapter 8. Publish the R Markdown file to rpubs.com, and include links to your R Markdown file (in GitHub) and your rpubs.com URL In your assignment solution.

Answer 4:
- Answer 4a
- Answer 4b
- Answer 4c
- Answer 4d
- Answer 4e
Answer 5
Answer 6:

Answer 4:

Answer 4a: [0-9]+\\$ is a regular expression that means it will match any object where there is at least one digit before a $ sign.

Example:

library(stringr)
four.a <- c("Sue is 2$nd place", "Fred is 3", "550$", "5th is Jen", "Nile is number$ 9")
unlist(str_extract(four.a, "[0-9]+\\$"))

## [1] "2$"   NA     "550$" NA     NA

Answer 4b: \\b[a-z]{1,4}\\b is a regular expression that means any match any cases where the object have letters from a to z (lower case), at least once but no more the 4 times. Meaning match the instance where there is one lower case letter or up to 4 letter words.

Example:

four.b <- "I think sunsets are very beautiful. The best place to see it is near a body of water."
str_extract(four.b, "\\b[a-z]{1,4}\\b")

## [1] "are"

unlist(str_extract_all(four.b, "\\b[a-z]{1,4}\\b"))

##  [1] "are"  "very" "best" "to"   "see"  "it"   "is"   "near" "a"    "body"
## [11] "of"

Answer 4c: .*?\\.txt$ is a regular expression that means match any cases where the object ends in .txt. The first part of the expression .*? means does not matter what comes before .txt.

Example:

four.c <- c("web: abd/nyc.org/assignment.txt", "nml/like.edu/4txt.cvs", "cdn/fyc.com/four.txt")
unlist(str_extract_all(four.c, ".*?\\.txt$"))

## [1] "web: abd/nyc.org/assignment.txt" "cdn/fyc.com/four.txt"

Answer 4d: \\d{2}/\\d{2}/\\d{4} is a regular expression that means two digits then two digits then four digits separated by /. Usually the this type of format can be used to get dates, two digit month and day and four digit year, in the format of dd/mm/yyyy or mm/dd/yyyy.

Example:

four.d <- "He was born on February 7th 1940. His father died on 07/18/1953. At the Age of 13/14 he had to drop out to school and start work. He had his first child on 01/06/1966. He boarded the ship on 12/24/1968."
unlist(str_extract_all(four.d, "\\d{2}/\\d{2}/\\d{4}"))

## [1] "07/18/1953" "01/06/1966" "12/24/1968"

<(.+?)>.+?</\\1>

Answer 4e: The regular expression above means to find the object inside <> in the beginning and the same <> with a / in the beginning of the object, later in the string. It is hard to explain so I will break down the regular expression above to show what it means part by part, with the example below. First look at the beginning <(.+?)>, which means to match any object surrounded by <> no matter it is text or numeric. Next, let’s look at the last part in the <>. The \\1 is back reference to the text in the beginning .+? enclosed in parenthesis (). So another way to see it is: match anything that we found inside <> in the first part and look for the a object that inside<> that has the same contain as before except with a / in the front. The middle .+? means that include anything that comes between the first and last part. As you can see on the bottom example: R matched 4ever21 and /4ever21 enclosed in <> and everything in between the two object.

Example:

four.e <- c("This is <4ever21> an  example.  Not a long example </4ever21>. Very small example.")
unlist(str_extract_all(four.e, "<(.+?)>.+?</\\1>"))

## [1] "<4ever21> an  example.  Not a long example </4ever21>"

Answer 5:

[0-9]+\\$ is a regular expression that I have rewritten as [[:digit:]]{1,}[$] . I use the expression [[:digit:]]{1,}[$] by enclosing the [:digit:] class in a [] to indicate that we are looking for digits, same as saying [0-9]. I also added the {1,} to match the class one or more times, same as using a +. Then I used [$] to state that I am looking of $ in the vector, same as saying \\$.

Let’s look at the same vector at answer four a:

four.a <- c("Sue is 2$nd place", "Fred is 1$2", "550$", "5th is Jen", "Nile is number$ 9")
unlist(str_extract(four.a, "[0-9]+\\$"))

## [1] "2$"   "1$"   "550$" NA     NA

unlist(str_extract(four.a, "[[:digit:]]{1,}[$]"))

## [1] "2$"   "1$"   "550$" NA     NA

Answer 6: Consider the mail address chunkylover53[at]aol[dot]com.

Answer 6a: I transform the above sting to a standard mail address using regular expression. I used the stingr package’s str_replace_all function to replace [at] and [dot] to @ and . respectively.

six.a <- "Consider the mail address chunkylover53[at]aol[dot]com."
six.a <- str_replace_all(six.a, pattern = "\\[at]", replacement = "@")
six.a <- str_replace_all(six.a, pattern = "\\[dot]", replacement = ".")
six.a

## [1] "Consider the mail address chunkylover53@aol.com."

Answer 6b: To extract the digits in the mail address we first try to use [:digit:] but that does not work as you can see below. It only gives us one digit at a time. It shows “5” and “3” not “53” which we want.

unlist(str_extract_all(six.a, "[:digit:]"))

## [1] "5" "3"

Then I use the expression [[:digit:]]{1,} which gives us the correct number “53”. I enclose the [:digit:] in a [] to indicate that we are looking at the predefined class [:digit:]. I also added the {1,} to match the class one or more times.

unlist(str_extract_all(six.a, "[[:digit:]]{1,}"))

## [1] "53"

Answer 6c: We want to extract the digits from the mail without using the predefined class [:digit:]. First we try using \\D to get the digits but it does not work out as we can see below.

unlist(str_extract_all(six.a, "\\D"))

##  [1] "C" "o" "n" "s" "i" "d" "e" "r" " " "t" "h" "e" " " "m" "a" "i" "l"
## [18] " " "a" "d" "d" "r" "e" "s" "s" " " "c" "h" "u" "n" "k" "y" "l" "o"
## [35] "v" "e" "r" "@" "a" "o" "l" "." "c" "o" "m" "."

The reason \\D does not give us any digit is because the expression is the same as [^[:digit:]], which means any characters except digits. The correct way to write is with a lower case d not an upper case D, like: \\d .

I use the expression \\d{1,} which gives us the number “53”. I use \\d to get all the digits. I also added the {1,} to match previous case one or more times.

unlist(str_extract_all(six.a, "\\d{1,}"))

## [1] "53"

Week 4 Assignment

Nabila Hossain

September 18, 2015