Load libraries:

library(stringr)

Problem 3

Copy the introductory example. The vector name stores the extracted names.

raw.data <- '555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert'

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

First, let’s try to extract just the names into a character vector where each element corresponds to a full name. This is somewhat tricky as there is no delimiter seperating names from phone numbers.

Extracting Names

As a first attempt, let’s try a simple solution, extracting all alphabetic characters:

unlist(str_extract_all(raw.data, '[[:alpha:]]'))

##  [1] "M" "o" "e" "S" "z" "y" "s" "l" "a" "k" "B" "u" "r" "n" "s" "C" "M"
## [18] "o" "n" "t" "g" "o" "m" "e" "r" "y" "R" "e" "v" "T" "i" "m" "o" "t"
## [35] "h" "y" "L" "o" "v" "e" "j" "o" "y" "N" "e" "d" "F" "l" "a" "n" "d"
## [52] "e" "r" "s" "S" "i" "m" "p" "s" "o" "n" "H" "o" "m" "e" "r" "D" "r"
## [69] "J" "u" "l" "i" "u" "s" "H" "i" "b" "b" "e" "r" "t"

This does indeed return every alphabetic character, but not quite the full names as we want. To rectify this, let’s specify that each match must be at least 2 characters or longer:

unlist(str_extract_all(raw.data, '[[:alpha:]]{2,}'))

##  [1] "Moe"        "Szyslak"    "Burns"      "Montgomery" "Rev"       
##  [6] "Timothy"    "Lovejoy"    "Ned"        "Flanders"   "Simpson"   
## [11] "Homer"      "Dr"         "Julius"     "Hibbert"

This is much closer, but we want to keep the titles and first/last names together. Let’s specify not to ‘break’ on spaces by including a space along with the alphabetic characters:

unlist(str_extract_all(raw.data, '[[:alpha:] ]{2,}'))

##  [1] "Moe Szyslak"      "Burns"            " C"              
##  [4] " Montgomery"      "Rev"              " Timothy Lovejoy"
##  [7] "Ned Flanders"     "Simpson"          " Homer"          
## [10] "Dr"               " Julius Hibbert"

An improvement, but it still ‘breaking’ the words on the puntuation within each name: add comma and period with the space:

name <- unlist(str_extract_all(raw.data, '[[:alpha:],. ]{2,}'))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Perfect! name can be described as all sequences of characters in raw.data that contain alphabetic characters + period + comma + space that contain at least two or more characters.

Standardizing `name`

Now to the core of the question.

Normally, I would probably detect if a comma is present, and if so, split the string on the comma with strsplit() and rev() the resulting strings.

Instead, I’ll attampt to do this with RegEx and stringr.

Though not mentioned in the text, RegEx has a feature called matching groups, that enumerates pattern matches. This is apparent in str_match():

str_match('Simpson, Homer', '(\\w+),\\s(\\w+)')

##      [,1]             [,2]      [,3]   
## [1,] "Simpson, Homer" "Simpson" "Homer"

Parentheses are used to break up the pattern into groups, in this case, into two words (\\w+). The resulting matrix allows access to these groups via subscripts.

We can use these groups with str_replace() to replace the words such that ‘last_name, first_name’ becomes ‘first_name last_name’:

str_replace('Simpson, Homer', '(\\w+),\\s(\\w+)', '\\2 \\1')

## [1] "Homer Simpson"

Mr. Burns’s name is more challenging because of the first name initial, but can be solved along similar lines:

str_replace('Burns, C. Montgomery', '(\\w+),\\s(\\w.)\\s(\\w+)', '\\2 \\3 \\1')

## [1] "C. Montgomery Burns"

Pulling It Altogether

Let’s construct a function that converts any name string to a standardized format, and that can then be applied to name:

standardize_name <- function(s) {
  last_first <- '^(\\w+),\\s(\\w+)$'
  last_first_init <-  '^(\\w+),\\s(\\w.)\\s(\\w+)$'

  s <- str_trim(s)
  
  # Test for last_name, first_name
  if (str_detect(s, last_first) == TRUE) {
    standard <- str_replace(s, last_first, '\\2 \\1')
  } 
  # Test for last_name, first_initial middle_name
  else if (str_detect(s, last_first_init) == TRUE) {
    standard <- str_replace(s, last_first_init, '\\2 \\3 \\1')
  }
  else {
    standard <- s
  }
  return(standard)
}

Test to make sure the function works as desired:

standardize_name('Simpson, Homer')

## [1] "Homer Simpson"

standardize_name('Burns, C. Montgomery')

## [1] "C. Montgomery Burns"

standardize_name('Lisa Simpson')

## [1] "Lisa Simpson"

Finally—apply to entire name vector:

lapply(name, FUN=standardize_name)

## [[1]]
## [1] "Moe Szyslak"
## 
## [[2]]
## [1] "C. Montgomery Burns"
## 
## [[3]]
## [1] "Rev. Timothy Lovejoy"
## 
## [[4]]
## [1] "Ned Flanders"
## 
## [[5]]
## [1] "Homer Simpson"
## 
## [[6]]
## [1] "Dr. Julius Hibbert"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

Using str_detect() and the pipe operator:

str_detect(name, 'Dr.|Rev.')

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Construct a logical vector indicating whether a character has a second name.

Mr. Burns is the only one with a second name in this list, and it is observable as a single capital alphabetic character and a period, followed by a space:

str_detect(name, '[A-Z].\\s')

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Problem 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\\$

This matches a sequence of numbers (of any size) that ends with a dollar sign (note the double backslash before $).

str_extract_all('1$', '[0-9]+\\$')

## [[1]]
## [1] "1$"

str_extract_all('1000000$ is a lot of money', '[0-9]+\\$')

## [[1]]
## [1] "1000000$"

\\b[a-z]{1,4}\\b

This command will return every sequence that starts with a lowercase letter, is composed of lowercase letters, and is between 1 and 4 characters in length. The command below should only return test and for as they are the only words that meet all the criteria.

str_extract_all('test test1 amalgamation for Ben eB4n', '\\b[a-z]{1,4}\\b')

## [[1]]
## [1] "test" "for"

.*?\\.txt$

This matches strings resembling files with the extension .txt. The first two commands will return a match and the third will not:

str_extract('notes.txt', '.*?\\.txt$')

## [1] "notes.txt"

str_extract('KKK*8293ufskf.txt', '.*?\\.txt$')

## [1] "KKK*8293ufskf.txt"

str_extract('txt.exe', '.*?\\.txt$')

## [1] NA

\\d{2}/\\d{2}/\\d{4}

This is a date format, like ‘09/08/2018’ for September 8, 2018, or (perhaps if European) for August 9, 2018. The first should return a match, the second will not.

str_extract('09/08/2018', '\\d{2}/\\d{2}/\\d{4}')

## [1] "09/08/2018"

str_extract('2018/09/08', '\\d{2}/\\d{2}/\\d{4}')

## [1] NA

<(.+?)>.+?</\\1>

This looks like it would match HTML or similar style tags (without attributes like href), like:

str_extract('<h1>Trump Shocks Nation</h1>', '<(.+?)>.+?</\\1>')

## [1] "<h1>Trump Shocks Nation</h1>"

str_extract('<p>The president\'s decline continues this week as...</p>', '<(.+?)>.+?</\\1>')

## [1] "<p>The president's decline continues this week as...</p>"

str_extract('<a href="google.com">Google</a>', '<(.+?)>.+?</\\1>')

## [1] NA

Problem 9

The following code hides a secret message. Crack it with R and regular expressions.

Try out different combinations of character classes until one works:

msg <- 'clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr'
str_c(unlist(str_extract_all(msg, '[[:upper:].]')), collapse='')

## [1] "CONGRATULATIONS.YOU.ARE.A.SUPERNERD"

DATA 607—Homework No. 3

Ben Horvath

September 16, 2018

Problem 3

Extracting Names

Standardizing `name`

Pulling It Altogether

Problem 4

Problem 9

Problem 3

Extracting Names

Standardizing name

Pulling It Altogether

Problem 4

Problem 9

Standardizing `name`