DATA 607 - Homework 3

Exercise 8.3

(a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

Solution:

# Load the stringr library

library(stringr)

# Input the data

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"   

# Extract just the names (by looking for letters)

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

# Run a loop through the subset name. If there's a comma in that name, split it into a 1x2 vector, and switch the order. Lastly, 'trim' the whitespace from the left side.

for (i in 1:(length(name)))
{
  pat <- ","
  if (grepl(pat, name[i]) == 1)
  {
    separate_names <- unlist(str_split(name[i], ","))
    name[i] <- str_c(separate_names[2], separate_names[1], sep = " ")
  }
}
str_trim(name, side = "left")

## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

Exercise 8.3b

(b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

Solution:

# Search our subset for either of these titles

grep("(Rev.|Dr.)", name)

## [1] 3 6

# Identify the two results

name[c(3,6)]

## [1] "Rev. Timothy Lovejoy" "Dr. Julius Hibbert"

Exercise 8.3c

(c) Construct a logical vector indicating whether a character has a second name.

Solution:

second_name <- str_count(fixed(name[c(1,2,4,5)]), "\\w+")
grep("3", second_name)

## [1] 2

name[2]

## [1] " C. Montgomery Burns"

Exercise 8.4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

8.4a

[0-9]+ \ $

# Matches one or more digits followed by a dollar sign.
pat <- "[0-9]+\\$"
ex <- c("1$", "abc123$", "123$abc")
grepl(pat, ex)

## [1] TRUE TRUE TRUE

8.4b

\ b[a-z]{1,4} \ b

# Matches between 1 and 4 consecutive lowercase letters
pat <- "\\b[a-z]{1,4}\\b"
ex <- c("alpha", "beta", "gamma", "ZeTa", "eta")
grepl(pat, ex)

## [1] FALSE  TRUE FALSE FALSE  TRUE

8.4c

.*? \ .txt$

# Matches wildcard string ending in .txt
pat <- ".*?\\.txt$"
ex <- c("abc.txt", "123.txt", "123abc123.txt", "%#LMV#Ckdnf3i2.txt")
grepl(pat, ex)

## [1] TRUE TRUE TRUE TRUE

8.4d

\ d{2}/ \ d{2}/ \ d{4}

# Matches a string of digits in the format 12/34/5678
pat <- "\\d{2}/\\d{2}/\\d{4}"
ex <- c("98/76/5432", "1/2/3", "12/34/56")
grepl(pat, ex)

## [1]  TRUE FALSE FALSE

8.4e

<(.+?)>.+?</ \ 1>

# Matches

Exercise 8.9

The following code hides a secret message. Crack it with R and regular expressions.

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

Solution:

My first guess was all the capital letters, since there were only a handful of them. Extracting each manually, I found this: CONGRATULATIONSYOUAREASUPERNERD. Now to do it in R:

# Load the encrypted message 

encrypt <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

# Extract just the capital letters

decrypt <- unlist(str_extract_all(encrypt, "[[:upper:]]{1,}"))
decrypt

##  [1] "C"  "O"  "N"  "G"  "R"  "A"  "T"  "U"  "L"  "AT" "I"  "O"  "N"  "S" 
## [15] "Y"  "O"  "U"  "A"  "R"  "E"  "A"  "S"  "U"  "P"  "E"  "R"  "N"  "E" 
## [29] "R"  "D"

DATA 607 - Homework 3

Joshua Sturm

09/14/2017

Exercise 8.3

Exercise 8.3b

Exercise 8.3c

Exercise 8.4

8.4a

8.4b

8.4c

8.4d

8.4e

Exercise 8.9