Assignment - Regular Expressions

Assignment Instructions

Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to problems 3 and 4 from chapter 8 of Automated Data Collection in R. Problem 9 is extra credit. You may work in a small group, but please submit separately with names of all group participants in your submission.

Chapter 8, Problem 3

Copy the introductory example. The vector name stores the extracted names.

library(stringr)
raw.data <- paste("555-1239Moe Szyslak(636) 555-0113Burns, C. ",
                  "Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned ",
                  "Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert")
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"           "Burns, C.  Montgomery" "Rev. Timothy Lovejoy" 
## [4] "Ned  Flanders"         "Simpson, Homer"        "Dr. Julius Hibbert"

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

First normal form requires that different types of information are not mixed within one column. Therefore, the first name and last name were split up into two columns. ⁱ The first step is optional. It changes Mr. Burns’ first name from an initial to the full name. The next part finds words that begin after a punctuation, removes the punctuation, and uses those post-punctuation names as first names in the cases where they are found. In cases where they are not found, the first word is used as the first name. The next part uses words that are either at the end of the string or before a punctuation as last names, and then deletes the punctuations. After all this the results are displayed as a data frame.

first_name <- str_replace(name, "C.", "Charles")
first_name <- str_extract(first_name, "[[:punct:]]\\s[[:alpha:]]+")
first_name <- str_extract(first_name, "[[:alpha:]]+")
first_name[is.na(first_name)] <- str_extract(name, "[[:alpha:]]+")[is.na(first_name)]
last_name <- str_extract(name,"[[:alpha:]]+($|,)")
last_name <- str_extract(last_name,"[[:alpha:]]+")
data.frame(first_name, last_name)

##   first_name last_name
## 1        Moe   Szyslak
## 2    Charles     Burns
## 3    Timothy   Lovejoy
## 4        Ned  Flanders
## 5      Homer   Simpson
## 6     Julius   Hibbert

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

The search looks for words that are two to three characters long and with a punctuation.

str_detect(name, "[[:alpha:]]{2,3}[.]")

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Construct a logical vector indicating whether a character has a second name.

First the titles are removed using (b) above. Then number of words (names) are counted. Last, the logical values testing whether or not there are more than 2 words (names) are assigned to a vector.

second_name <- str_replace(name, "[[:alpha:]]{2,3}[.]", "")
second_name <- str_count(second_name, "\\w+")
second_name <- second_name > 2; second_name

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

References

ⁱ Automated data collection with R a practical guide to web scraping and text mining. Page 171. Simon Munzert. John Wiley & Sons Inc. 2015

Chapter 8, Problem 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$

Vectors containing strings with any number of continuous digits followed by a dollar sign at the end.

a <- c("asdad  0123456789$")
unlist(str_extract_all(a, "[0-9]+\\$"))

## [1] "0123456789$"

\b[a-z]{1,4}\b

Vectors containing strings with one to four continuous lowercase letters from $a$ to $z$ surrounded by word edges (beginnings and end of words).

b <- c("one two three four five six seven eight nine ten")
unlist(str_extract_all(b, "\\b[a-z]{1,4}\\b"))

## [1] "one"  "two"  "four" "five" "six"  "nine" "ten"

.*?\.txt$

Vectors containing strings that end in .txt no matter what, if anything, precedes the .txt.

c <- c("characters,  spaces, filename: example.txt")
unlist(str_extract_all(c, ".*?\\.txt$"))

## [1] "characters,  spaces, filename: example.txt"

\d{2}/\d{2}/\d{4}

Vectors containing strings that have a pattern of two digits, a forward slash, two more digits, another forward slash, and then four digits.

d <- c("12/34/5678 1234/56/789")
unlist(str_extract_all(d, "\\d{2}/\\d{2}/\\d{4}"))

## [1] "12/34/5678"

<(.+?)>.+?</\1>

Vectors containing strings with any type of HTML tag. The back reference removes the outer HTML tags.

e <- c("<!DOCTYPE html><html><body>Hello World</body></html></html>")
unlist(str_extract_all(e, "<(.+?)>.+?</\\1>"))

## [1] "<html><body>Hello World</body></html>"

Chapter 8, Problem 9

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr ⁱⁱ

secret <- paste("clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo",
                "Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO",
                "d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5",
                "fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr")
message <- unlist(str_extract_all(secret, "[[:upper:].]{1,}"))
message <- str_replace_all(paste(message, collapse = ''), "[.]", " "); message

## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"

References

ⁱⁱ http://stackoverflow.com/questions/35542346/r-using-regmatches-to-extract-certain-characters

Assignment - Regular Expressions

Jose Zuniga

Assignment Instructions

Chapter 8, Problem 3

References

Chapter 8, Problem 4

Chapter 8, Problem 9

References