3 Extracting Names From raw.data

The raw.data set has telephone numbers preceding the person they belong to. Using the stringr package and regex, we extract the names out:

raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542
Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226
Simpson, Homer5553642Dr. Julius Hibbert"

(name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}")))
## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

3.a Names in Uniform Order

The names in name are not in the same format, i.e. last name and first name separated by a comma. The following code utilizes groupings and look aheads to arrange the names in a uniform order:

(name <- sub("(^\\w+(?=,))(, )(.*)", "\\3 \\1", name, perl = TRUE))
## [1] "Moe Szyslak"          "C. Montgomery Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

3.b Detecting a Title in a Name

Some of the names in name have a title, i.e., Rev. and Dr. Keeping things simple, we can create a list of titles and pass it through functions in the sringr package to see if the names have a title matched from the list:

titles <- c("Rev.", "Dr.")

title <- str_detect(name, titles)

data.frame(name = name, title = title)
##                   name title
## 1          Moe Szyslak FALSE
## 2  C. Montgomery Burns FALSE
## 3 Rev. Timothy Lovejoy  TRUE
## 4         Ned Flanders FALSE
## 5        Homer Simpson FALSE
## 6   Dr. Julius Hibbert  TRUE
3.c Detecting Second Names

Since the individuals with titles are known from above, if the person’s name exceeds two names (not including their title) then the individual has a second name. Applying this logic:

data.frame(name = name, secondName = ((str_count(name, "\\s") - title) > 1))
##                   name secondName
## 1          Moe Szyslak      FALSE
## 2  C. Montgomery Burns       TRUE
## 3 Rev. Timothy Lovejoy      FALSE
## 4         Ned Flanders      FALSE
## 5        Homer Simpson      FALSE
## 6   Dr. Julius Hibbert      FALSE

4 String Types and Regex

The following examples are regular expressions and string types that they would match to.

4.a [0-9]+\\$

The above regex would match any non-comma/decimal number with a dollar sign after it. Some countries outside of the U.S. reflect dollar amounts as such. If the numbers did have commas or decimals, only the number to the right of these would show, e.g.,:

pat1 <- "[0-9]+\\$"

dummy_str1 <- c("Jane Doe Account balance = 123123$", "John Doe Account balance = 123,432$", 
    "John Lennon Account balance = $1234213")

data.frame(string = dummy_str1, match = str_extract(dummy_str1, pat1))
##                                   string   match
## 1     Jane Doe Account balance = 123123$ 123123$
## 2    John Doe Account balance = 123,432$    432$
## 3 John Lennon Account balance = $1234213    <NA>

4.b \\b[a-z]{1,4}\\b

The above regex would match any letter bounded by white space or punctuation not exceeding a length of four, e.g.,:

pat2 <- "\\b[a-z]{1,4}\\b"

dummy_str2 <- c("an1 _abc abcd.", "ab1 _a_ (abcd)", "ab1 123?a?123")

data.frame(string = dummy_str2, match = str_extract(dummy_str2, pat2))
##           string match
## 1 an1 _abc abcd.  abcd
## 2 ab1 _a_ (abcd)  abcd
## 3  ab1 123?a?123     a

4.c .*?\\.txt$

The above regex would be able to find file names in a list that have the .txt extension, e.g.,:

pat3 <- ".*?\\.txt$"

dummy_str3 <- c("Data607_HW3.csv", "Data607_HW3.txt", "Data607_HW3.Rmd")

data.frame(string = dummy_str3, match = str_extract(dummy_str3, pat3))
##            string           match
## 1 Data607_HW3.csv            <NA>
## 2 Data607_HW3.txt Data607_HW3.txt
## 3 Data607_HW3.Rmd            <NA>

4.d \\d{2}/\\d{2}/\\d{4}

The above regex would be able to find dates under the format mm/dd/yyyy or dd/mm/yyyy, e.g.,:

pat4 <- "\\d{2}/\\d{2}/\\d{4}"

dummy_str4 <- c("1/23/1994", "01/23/1994", "08/27/80", "08/27/1980", "1/1/2001", 
    "23/01/2001", "23-01-2001")

data.frame(string = dummy_str4, match = str_extract(dummy_str4, pat4))
##       string      match
## 1  1/23/1994       <NA>
## 2 01/23/1994 01/23/1994
## 3   08/27/80       <NA>
## 4 08/27/1980 08/27/1980
## 5   1/1/2001       <NA>
## 6 23/01/2001 23/01/2001
## 7 23-01-2001       <NA>

4.e <(.+?)>.+?<\/\\1>

The above regex matches the first, if applicable, instance of a string snide of <string> up to the next occurrence of </string> as long as there is a char(s) of sort in between, e.g.,:

pat5 <- "<(.+?)>.+?</\\1>"

dummy_str5 <- c("<123> </123>", "<123> </a23>", "<123> abcd </123>abcd", "<123></123>", 
    "<123> <123> </123>", "<123> abc </124>", "abc<123> def <<</1234>")

data.frame(string = dummy_str5, match = str_extract(dummy_str5, pat5))
##                   string              match
## 1           <123> </123>       <123> </123>
## 2           <123> </a23>               <NA>
## 3  <123> abcd </123>abcd  <123> abcd </123>
## 4            <123></123>               <NA>
## 5     <123> <123> </123> <123> <123> </123>
## 6       <123> abc </124>               <NA>
## 7 abc<123> def <<</1234>               <NA>

9 Cryptography With regex

The hint that “some characters are more revealing than others” reveals that a certain way that chars are represented must be the key.The majority of the list is chars, but reading the first few chars reveals nothing. There are speckled capitalized chars throughout, so we shall take a look at only these from the string:

cipher <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7L
j8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zE
crop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

gsub("[^A-Z]", "", cipher)
## [1] "CONGRATULATIONSYOUAREASUPERNERD"

Ouch! With discovery comes insult. I’ve been called worse though.