Please create an R Markdown file that provides a solution for #4, #5 and #6 in Automated Data Collection in R, chapter 8. Publish the R Markdown file to rpubs.com, and include links to your R Markdown file (in GitHub) and your rpubs.com URL In your assignment solution.
Answer 4:
[0-9]+\\$ is a regular expression that means it will match any object where there is at least one digit before a $ sign.Example:
library(stringr)
four.a <- c("Sue is 2$nd place", "Fred is 3", "550$", "5th is Jen", "Nile is number$ 9")
unlist(str_extract(four.a, "[0-9]+\\$"))
## [1] "2$" NA "550$" NA NA
\\b[a-z]{1,4}\\b is a regular expression that means any match any cases where the object have letters from a to z (lower case), at least once but no more the 4 times. Meaning match the instance where there is one lower case letter or up to 4 letter words.Example:
four.b <- "I think sunsets are very beautiful. The best place to see it is near a body of water."
str_extract(four.b, "\\b[a-z]{1,4}\\b")
## [1] "are"
unlist(str_extract_all(four.b, "\\b[a-z]{1,4}\\b"))
## [1] "are" "very" "best" "to" "see" "it" "is" "near" "a" "body"
## [11] "of"
.*?\\.txt$ is a regular expression that means match any cases where the object ends in .txt. The first part of the expression .*? means does not matter what comes before .txt.Example:
four.c <- c("web: abd/nyc.org/assignment.txt", "nml/like.edu/4txt.cvs", "cdn/fyc.com/four.txt")
unlist(str_extract_all(four.c, ".*?\\.txt$"))
## [1] "web: abd/nyc.org/assignment.txt" "cdn/fyc.com/four.txt"
\\d{2}/\\d{2}/\\d{4} is a regular expression that means two digits then two digits then four digits separated by /. Usually the this type of format can be used to get dates, two digit month and day and four digit year, in the format of dd/mm/yyyy or mm/dd/yyyy.Example:
four.d <- "He was born on February 7th 1940. His father died on 07/18/1953. At the Age of 13/14 he had to drop out to school and start work. He had his first child on 01/06/1966. He boarded the ship on 12/24/1968."
unlist(str_extract_all(four.d, "\\d{2}/\\d{2}/\\d{4}"))
## [1] "07/18/1953" "01/06/1966" "12/24/1968"
<(.+?)>.+?</\\1>
<> in the beginning and the same <> with a / in the beginning of the object, later in the string. It is hard to explain so I will break down the regular expression above to show what it means part by part, with the example below. First look at the beginning <(.+?)>, which means to match any object surrounded by <> no matter it is text or numeric. Next, let’s look at the last part in the <>. The \\1 is back reference to the text in the beginning .+? enclosed in parenthesis (). So another way to see it is: match anything that we found inside <> in the first part and look for the a object that inside<> that has the same contain as before except with a / in the front. The middle .+? means that include anything that comes between the first and last part. As you can see on the bottom example: R matched 4ever21 and /4ever21 enclosed in <> and everything in between the two object.Example:
four.e <- c("This is <4ever21> an example. Not a long example </4ever21>. Very small example.")
unlist(str_extract_all(four.e, "<(.+?)>.+?</\\1>"))
## [1] "<4ever21> an example. Not a long example </4ever21>"
Answer 5:
[0-9]+\\$ is a regular expression that I have rewritten as [[:digit:]]{1,}[$] . I use the expression [[:digit:]]{1,}[$] by enclosing the [:digit:] class in a [] to indicate that we are looking for digits, same as saying [0-9]. I also added the {1,} to match the class one or more times, same as using a +. Then I used [$] to state that I am looking of $ in the vector, same as saying \\$.
Let’s look at the same vector at answer four a:
four.a <- c("Sue is 2$nd place", "Fred is 1$2", "550$", "5th is Jen", "Nile is number$ 9")
unlist(str_extract(four.a, "[0-9]+\\$"))
## [1] "2$" "1$" "550$" NA NA
unlist(str_extract(four.a, "[[:digit:]]{1,}[$]"))
## [1] "2$" "1$" "550$" NA NA
Answer 6: Consider the mail address chunkylover53[at]aol[dot]com.
stingr package’s str_replace_all function to replace [at] and [dot] to @ and . respectively.six.a <- "Consider the mail address chunkylover53[at]aol[dot]com."
six.a <- str_replace_all(six.a, pattern = "\\[at]", replacement = "@")
six.a <- str_replace_all(six.a, pattern = "\\[dot]", replacement = ".")
six.a
## [1] "Consider the mail address chunkylover53@aol.com."
[:digit:] but that does not work as you can see below. It only gives us one digit at a time. It shows “5” and “3” not “53” which we want.unlist(str_extract_all(six.a, "[:digit:]"))
## [1] "5" "3"
Then I use the expression [[:digit:]]{1,} which gives us the correct number “53”. I enclose the [:digit:] in a [] to indicate that we are looking at the predefined class [:digit:]. I also added the {1,} to match the class one or more times.
unlist(str_extract_all(six.a, "[[:digit:]]{1,}"))
## [1] "53"
[:digit:]. First we try using \\D to get the digits but it does not work out as we can see below.unlist(str_extract_all(six.a, "\\D"))
## [1] "C" "o" "n" "s" "i" "d" "e" "r" " " "t" "h" "e" " " "m" "a" "i" "l"
## [18] " " "a" "d" "d" "r" "e" "s" "s" " " "c" "h" "u" "n" "k" "y" "l" "o"
## [35] "v" "e" "r" "@" "a" "o" "l" "." "c" "o" "m" "."
The reason \\D does not give us any digit is because the expression is the same as [^[:digit:]], which means any characters except digits. The correct way to write is with a lower case d not an upper case D, like: \\d .
I use the expression \\d{1,} which gives us the number “53”. I use \\d to get all the digits. I also added the {1,} to match previous case one or more times.
unlist(str_extract_all(six.a, "\\d{1,}"))
## [1] "53"