Fall 2019 - Data 607 - Week 3 Assignment

I started by loading stringr to facilitate analysis.

library(stringr)

3. Copy the introductory example.

I extracted the names, identified by at least 2 sequences of alphabetic characters, from the character string.

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
names <- unlist(str_extract_all(raw.data, "[[A-z].,' ']{2,}"))
names

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard “first_name last_name”.

I copied the list over then identified and replaced the relevant strings, using my knowledge of The Simpsons to change “C.” to “Charles”. I then used the word function in stringr to extract the first and last names, which I displayed as columns in a data frame.

names1 <- names
grep(",", names1, value = TRUE)

## [1] "Burns, C. Montgomery" "Simpson, Homer"

names1 <- str_replace(names1, "Burns, C. Montgomery", "Charles Montgomery Burns")
names1 <- str_replace(names1, "Simpson, Homer", "Homer Simpson")
grep(".", names1, value = TRUE)

## [1] "Moe Szyslak"              "Charles Montgomery Burns"
## [3] "Rev. Timothy Lovejoy"     "Ned Flanders"            
## [5] "Homer Simpson"            "Dr. Julius Hibbert"

names1 <- str_replace(names1, "[A-z]{2,3}\\.+\\s", "")
names1

## [1] "Moe Szyslak"              "Charles Montgomery Burns"
## [3] "Timothy Lovejoy"          "Ned Flanders"            
## [5] "Homer Simpson"            "Julius Hibbert"

first_name <- word(names1, 1)
last_name <- word(names1, -1)
data.frame(first_name, last_name)

##   first_name last_name
## 1        Moe   Szyslak
## 2    Charles     Burns
## 3    Timothy   Lovejoy
## 4        Ned  Flanders
## 5      Homer   Simpson
## 6     Julius   Hibbert

2. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

I located persons with titles by using a pattern of 2 or 3 alphabetic characters preceding a period. This pattern avoided the “C.” initial for Mr. Burns.

titles <- grepl("[A-z]{2,3}\\.", names)
titles

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

data.frame(names, titles)

##                  names titles
## 1          Moe Szyslak  FALSE
## 2 Burns, C. Montgomery  FALSE
## 3 Rev. Timothy Lovejoy   TRUE
## 4         Ned Flanders  FALSE
## 5       Simpson, Homer  FALSE
## 6   Dr. Julius Hibbert   TRUE

3. Construct a logical vector indicating whether a character has a second name.

Using the cleaned list names1, I created an object containing the string counts for each person. I then created a logical vector indicating TRUE for each person with more than 2 strings, or names.

second_name_check <- str_count(names1, "[\\w+]{3,}")
second_name <- second_name_check > 2
second_name

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

data.frame(names1, second_name_check)

##                     names1 second_name_check
## 1              Moe Szyslak                 2
## 2 Charles Montgomery Burns                 3
## 3          Timothy Lovejoy                 2
## 4             Ned Flanders                 2
## 5            Homer Simpson                 2
## 6           Julius Hibbert                 2

4. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

1. “[0-9]+\$”

This regular expression would match a string consisting of some number of numeric digits ending in a dollar sign.

one <- c("abcdef$", "123456", "abcdef", "123456$")
grepl("[0-9]+\\$", one)

## [1] FALSE FALSE FALSE  TRUE

2. “\b[a-z]{1,4}\b”

This regular expression would match a string consisting of one to four lowercase letters with an empty string on both edges.

two <- c("Charlie", "Name", "charlie", "one", "four")
grepl("\\b[a-z]{1,4}\\b", two)

## [1] FALSE FALSE FALSE  TRUE  TRUE

3. “.*?\.txt$"

This regular expression would match a string ending in “.txt” (e.g. a .txt file).

three <- c("some.txt", "file.csv", "another.txt", "other.xls")
grepl(".*?\\.txt$", three)

## [1]  TRUE FALSE  TRUE FALSE

4. “\d{2}/\d{2}/\d{4}”

This regular expression would match a string consisting of, in order, two digits, a forward slash, two digits, a forward slash, and four digits (e.g. a date).

four <- c("9/12/19", "09/12/19", "09/12/2019")
grepl("\\d{2}/\\d{2}/\\d{4}", four)

## [1] FALSE FALSE  TRUE

5. “<(.+?)>.+?</\1>”

This regular expression appears to match pairs of properly formatted HTML tags and whatever they contain.

five <- c("<html><body>Some text</body></html>", "</html>something</body>")
grepl("<(.+?)>.+?</\\1>", five)

## [1]  TRUE FALSE

9. The following code hides a secret message. Crack it with R and regular expressions.

I had difficulty finding the snippet from the book online, so I actually typed it out. In doing so, I figured the secret message related to either the numbers or the upper case letters (I wasn’t sure what to make of “clcop”). Extracting uppercase letters spelled out a phrase, and upon closer inspection, the words in the phrase were separated by punctuation. I replaced the periods with spaces, creating “CONGRATULATIONS YOU ARE A SUPERNERD!”

snippet <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
str_extract_all(snippet, "[0-9]")

## [[1]]
##  [1] "1" "0" "8" "7" "7" "9" "2" "8" "5" "5" "0" "7" "8" "0" "3" "5" "3"
## [18] "0" "7" "5" "5" "3" "3" "6" "4" "1" "1" "6" "2" "2" "4" "9" "0" "5"
## [35] "6" "5" "1" "7" "2" "4" "6" "3" "9" "5" "8" "9" "6" "5" "9" "4" "9"
## [52] "0" "5" "4" "5"

str_extract_all(snippet, "[A-Z]")

## [[1]]
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"

uppercase <- unlist(str_extract_all(snippet, "[[A-Z].|!]"))
secret_message <- str_replace_all(paste(uppercase, collapse = ""), "\\.", " ")
secret_message

## [1] "CONGRATULATIONS YOU ARE A SUPERNERD!"

Sources:

“Basic Regular Expressions in R, Cheat Sheet”. RStudio. Accessed 091219 from https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf;

“Extract first word from a column and insert into new column [duplicate]”. StackOverflow. Accessed 091219 from https://stackoverflow.com/questions/31925811/extract-first-word-from-a-column-and-insert-into-new-column;

“Chapter 8”. Automated Data Collection with R. Accessed 091219 from http://kek.ksu.ru/eos/WM/AutDataCollectR.pdf