DATA 607 Week 3 Assignment

Week 3

# Load packages
library(stringr)

Problem 3

raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:],. ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Part (a)

# Isolate only names containing commas
names_with_commas <- name[str_detect(name,",")]

# Find a comma and extract anything to the left of it = Last names
# Trim just in case
last_names <- str_trim(str_sub(names_with_commas, 1, str_locate(names_with_commas, ",")[,1]-1))

# Find a comma and extract anything to the right of it = First names
# Trim to get rid of space after comma
first_names <- str_trim(str_sub(names_with_commas, str_locate(names_with_commas,",")[,1]+1, str_length(names_with_commas)))

# Populate original vector
name[str_detect(name,",")] <- str_c(first_names, last_names, sep = " ")
name

## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

Note: Above code has an issue with names containing suffixes - Jr, Sr, Esq, II, etc.

Part (b)

# Create a vector of possible titles
titles <- c("dr", "rev", "hon", "mr", "mrs", "ms")

# Extract first words of all names 
check_titles <- unlist(str_trim(str_extract(name, "^.\\w+")))

# Convert titles and first words to upper case to force search to be case insensitive
# Check if there are any matches
check_titles <- pmatch(toupper(check_titles), toupper(titles)) > 0
check_titles

## [1]   NA   NA TRUE   NA   NA TRUE

Note: Above code relies on the title to be at the beginning of the name; however, it does account for titles without a trailing period.

Part (c)

# Get a number of words per name
word_count <- str_count(name, "\\w+")

# Subtract 1 from names with a title
word_count <- word_count - ifelse(is.na(check_titles), 0, 1)
word_count

## [1] 2 3 2 2 2 2

# Any name over 2 words should have a second name
second_name <- word_count > 2
second_name

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Note: Same as in part (a) the code above does not account for suffixes.

Problem 4

[0-9]+\\$ - matches one or more uninterrupted digits followed by a dollar sign. Perhaps can be used to find dollar amounts, but it does not account for numbers with comma separators or cents.

unlist(str_extract_all("Total amount is 9,999.99$.", "[0-9]+\\$"))

## [1] "99$"

\\b[a-z]{1,4}\\b - matches one to four-character words in the input string (words must not contain punctuation or digits and must be in lowercase).

unlist(str_extract_all("One two three four.", "\\b[a-z]{1,4}\\b"))

## [1] "two"  "four"

.*?\\.txt$ - matches any string ending with .txt including the string that equals .txt. Can be used to find text files in a file listing.

unlist(str_extract_all("c:\temp\test.txt", ".*?\\.txt$"))

## [1] "c:\temp\test.txt"

\\d{2}/\\d{2}/\\d{4} - matches two digits followed by a slash followed by another two digits followed by a slash and finally followed by four digits. Can be used to find properly formatted dates (mm/dd/yyyy or dd/mm/yyyy).

unlist(str_extract_all("02/16/2017", "\\d{2}/\\d{2}/\\d{4}"))

## [1] "02/16/2017"

<(.+?)>.+?</\\1> - matches a pair of openning and closing tags - one or more characters enclosed with < > brackets. The closing tag will have a slash in it, i.e. </ >. There must be one or more charcters between the tags. Opening and closing tag must be the same. Can be used to find HTML or XML tags; however, in case of nested tags, this expression will not grab the internal pair, but rather span across multiple pairs (see example below - matches from the first openning tag to the first closing tag when the most likely desired result is second opening tag to first closing tag).

unlist(str_extract_all("<a>Test-123<a>654.test</a></a>", "<(.+?)>.+?</\\1>"))

## [1] "<a>Test-123<a>654.test</a>"

Problem 9

secret <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
secret

## [1] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

# Not so secret message
paste(unlist(str_extract_all(secret, "[:upper:]|[:punct:]")), collapse = "")

## [1] "CONGRATULATIONS.YOU.ARE.A.SUPERNERD!"

I figured that it is most likely just the matter of matching some specific characters. My first thought was to get rid of numbers and go from there. However, I was too lazy to go online to search for the snippet and instead I retyped it from the book. As I was typing I have noticed that there are very few upper case characters, so that seemed of value. Sure enough that was all it took. The exclamation point at the end highlighted the need for punctuation.