Data_607_Assignment

Copy the introductory example. The vector name stores the extracted names.

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

library('stringr')

## Warning: package 'stringr' was built under R version 3.4.4

raw.data = "555-1239Moe Szyslak(636) 555-0113 Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name = unlist(str_extract_all(raw.data,"[[:alpha:] ,.]{2,}"))

This will do most of the work: find the commas and put the last names at the end if they find one:

name = paste(str_replace(name,'[[:alpha:]., ]+,',''), str_replace_na(str_extract(name,'[[:alpha:]]+,'),''))
name

## [1] "Moe Szyslak "          " C. Montgomery Burns," "Rev. Timothy Lovejoy "
## [4] "Ned Flanders "         " Homer Simpson,"       "Dr. Julius Hibbert "

Only issue now is some trailing spaces and commas

name = str_trim(str_replace(name, ',$',''))
name

## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
I found a list of common titles on the internet, I’ll assume this is inclusive for now

common.titles = c('Dr', 'Esq', 'Hon', 'Jr', 'Mr', 'Mrs', 'Ms', 'Messrs', 'Mmes', 'Msgr', 'Prof', 'Rev', 'Rt Hon', 'Sr', 'St')

This should build a big ugly expression that looks for one of these titles at the beginning of the word followed by a “.”

title.expression = paste(paste('\\b',common.titles,'\\.',sep=""),collapse="|")
title.expression

## [1] "\\bDr\\.|\\bEsq\\.|\\bHon\\.|\\bJr\\.|\\bMr\\.|\\bMrs\\.|\\bMs\\.|\\bMessrs\\.|\\bMmes\\.|\\bMsgr\\.|\\bProf\\.|\\bRev\\.|\\bRt Hon\\.|\\bSr\\.|\\bSt\\."

Look for those titles, ignoring case

has.title = str_detect(name, regex(title.expression, ignore_case=TRUE))
cbind(name, has.title)

##      name                   has.title
## [1,] "Moe Szyslak"          "FALSE"  
## [2,] "C. Montgomery Burns"  "FALSE"  
## [3,] "Rev. Timothy Lovejoy" "TRUE"   
## [4,] "Ned Flanders"         "FALSE"  
## [5,] "Homer Simpson"        "FALSE"  
## [6,] "Dr. Julius Hibbert"   "TRUE"

Construct a logical vector indicating whether a character has a second name.
I want to count the names, which is easy, but I don’t want to include titles since those aren’t actually second names.
I will use as.integer(has.title) to “count” 1 if it has a title, then subtract that from the name count

second.name <- str_count(name,'[[:alpha:]]+')-as.integer(has.title)>=3
cbind(name, second.name)

##      name                   second.name
## [1,] "Moe Szyslak"          "FALSE"    
## [2,] "C. Montgomery Burns"  "TRUE"     
## [3,] "Rev. Timothy Lovejoy" "FALSE"    
## [4,] "Ned Flanders"         "FALSE"    
## [5,] "Homer Simpson"        "FALSE"    
## [6,] "Dr. Julius Hibbert"   "FALSE"

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$
Any number of digits with a $ at the end

str_detect('12341234$', '[0-9]+\\$')

## [1] TRUE

\b[a-z]{1,4}\b
A 1-4 letter word, all lowercase

str_detect('duck', '\\b[a-z]{1,4}\\b')

## [1] TRUE

.*?\.txt$
Text file - any number of characters with .txt at the end

str_detect('blah blah blah 123.txt', '.*?\\.txt$')

## [1] TRUE

\d{2}/\d{2}/\d{4}
A date in EXACTLY MM/DD/YYYY format (i/e 7/5/2019 would not work)

str_detect('02/22/1985', '\\d{2}/\\d{2}/\\d{4}')

## [1] TRUE

<(.+?)>.+?</\1>
These would be HTML tags an opening with anything in it <>, any number of characters in between, then a close </> with the EXACT same character string from the opening

str_detect('<class>Data 607</class>', '<(.+?)>.+?</\\1>')

## [1] TRUE

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

raw.message <- 'clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr'

Going to pull out a few different categories to see if I can make sense of it

unlist(str_extract_all(raw.message,'\\d+'))

##  [1] "1"   "0"   "87"  "7"   "92"  "8"   "5"   "5"   "0"   "7"   "8"  
## [12] "03"  "5"   "3"   "0"   "7"   "55"  "3"   "3"   "6"   "4"   "1"  
## [23] "1"   "6"   "2"   "2"   "4"   "9"   "05"  "65"  "1"   "7"   "24" 
## [34] "6"   "3"   "9"   "5"   "89"  "6"   "5"   "9"   "4"   "905" "4"  
## [45] "5"

unlist(str_extract_all(raw.message,'[a-z]+'))

##  [1] "clcop"    "ow"       "zmstc"    "d"        "wnkig"    "vdicp"   
##  [7] "uggvhryn" "juwczi"   "hqrfp"    "xs"       "j"        "dwpn"    
## [13] "anwo"     "wisdij"   "j"        "kpf"      "dr"       "coc"     
## [19] "bt"       "yczjat"   "aootj"    "t"        "j"        "ne"      
## [25] "c"        "fek"      "r"        "w"        "wwojig"   "d"       
## [31] "vrf"      "rbz"      "bk"       "nbhzgv"   "i"        "z"       
## [37] "crop"     "w"        "gnb"      "qo"       "f"        "a"       
## [43] "otfb"     "w"        "m"        "k"        "t"        "s"       
## [49] "zqe"      "fy"       "n"        "d"        "t"        "kc"      
## [55] "f"        "gmc"      "gxo"      "nh"       "k"        "gr"

unlist(str_extract_all(raw.message,'[A-Z]+'))

##  [1] "C"  "O"  "N"  "G"  "R"  "A"  "T"  "U"  "L"  "AT" "I"  "O"  "N"  "S" 
## [15] "Y"  "O"  "U"  "A"  "R"  "E"  "A"  "S"  "U"  "P"  "E"  "R"  "N"  "E" 
## [29] "R"  "D"

Well I can see what the message is supposed to be, but I’m wondering if the punctuation is helpful as well. Lets get rid of lower case letter and digits:

str_replace_all(str_replace_all(raw.message,'[a-z]+',''), '\\d+', '')

## [1] "CONGRATULATIONS.YOU.ARE.A.SUPERNERD!"

There we go. Just one more change

str_replace_all(str_replace_all(str_replace_all(raw.message,'[a-z]+',''), '\\d+', ''), '\\.', ' ')

## [1] "CONGRATULATIONS YOU ARE A SUPERNERD!"

Data_607_Assignment_3

Steven Ellingson

September 9, 2019