raw.dataThe raw.data set has telephone numbers preceding the person they belong to. Using the stringr package and regex, we extract the names out:
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542
Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226
Simpson, Homer5553642Dr. Julius Hibbert"
(name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}")))
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
The names in name are not in the same format, i.e. last name and first name separated by a comma. The following code utilizes groupings and look aheads to arrange the names in a uniform order:
(name <- sub("(^\\w+(?=,))(, )(.*)", "\\3 \\1", name, perl = TRUE))
## [1] "Moe Szyslak" "C. Montgomery Burns" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"
Some of the names in name have a title, i.e., Rev. and Dr. Keeping things simple, we can create a list of titles and pass it through functions in the sringr package to see if the names have a title matched from the list:
titles <- c("Rev.", "Dr.")
title <- str_detect(name, titles)
data.frame(name = name, title = title)
## name title
## 1 Moe Szyslak FALSE
## 2 C. Montgomery Burns FALSE
## 3 Rev. Timothy Lovejoy TRUE
## 4 Ned Flanders FALSE
## 5 Homer Simpson FALSE
## 6 Dr. Julius Hibbert TRUE
Since the individuals with titles are known from above, if the person’s name exceeds two names (not including their title) then the individual has a second name. Applying this logic:
data.frame(name = name, secondName = ((str_count(name, "\\s") - title) > 1))
## name secondName
## 1 Moe Szyslak FALSE
## 2 C. Montgomery Burns TRUE
## 3 Rev. Timothy Lovejoy FALSE
## 4 Ned Flanders FALSE
## 5 Homer Simpson FALSE
## 6 Dr. Julius Hibbert FALSE
The following examples are regular expressions and string types that they would match to.
[0-9]+\\$The above regex would match any non-comma/decimal number with a dollar sign after it. Some countries outside of the U.S. reflect dollar amounts as such. If the numbers did have commas or decimals, only the number to the right of these would show, e.g.,:
pat1 <- "[0-9]+\\$"
dummy_str1 <- c("Jane Doe Account balance = 123123$", "John Doe Account balance = 123,432$",
"John Lennon Account balance = $1234213")
data.frame(string = dummy_str1, match = str_extract(dummy_str1, pat1))
## string match
## 1 Jane Doe Account balance = 123123$ 123123$
## 2 John Doe Account balance = 123,432$ 432$
## 3 John Lennon Account balance = $1234213 <NA>
\\b[a-z]{1,4}\\bThe above regex would match any letter bounded by white space or punctuation not exceeding a length of four, e.g.,:
pat2 <- "\\b[a-z]{1,4}\\b"
dummy_str2 <- c("an1 _abc abcd.", "ab1 _a_ (abcd)", "ab1 123?a?123")
data.frame(string = dummy_str2, match = str_extract(dummy_str2, pat2))
## string match
## 1 an1 _abc abcd. abcd
## 2 ab1 _a_ (abcd) abcd
## 3 ab1 123?a?123 a
.*?\\.txt$The above regex would be able to find file names in a list that have the .txt extension, e.g.,:
pat3 <- ".*?\\.txt$"
dummy_str3 <- c("Data607_HW3.csv", "Data607_HW3.txt", "Data607_HW3.Rmd")
data.frame(string = dummy_str3, match = str_extract(dummy_str3, pat3))
## string match
## 1 Data607_HW3.csv <NA>
## 2 Data607_HW3.txt Data607_HW3.txt
## 3 Data607_HW3.Rmd <NA>
\\d{2}/\\d{2}/\\d{4}The above regex would be able to find dates under the format mm/dd/yyyy or dd/mm/yyyy, e.g.,:
pat4 <- "\\d{2}/\\d{2}/\\d{4}"
dummy_str4 <- c("1/23/1994", "01/23/1994", "08/27/80", "08/27/1980", "1/1/2001",
"23/01/2001", "23-01-2001")
data.frame(string = dummy_str4, match = str_extract(dummy_str4, pat4))
## string match
## 1 1/23/1994 <NA>
## 2 01/23/1994 01/23/1994
## 3 08/27/80 <NA>
## 4 08/27/1980 08/27/1980
## 5 1/1/2001 <NA>
## 6 23/01/2001 23/01/2001
## 7 23-01-2001 <NA>
<(.+?)>.+?<\/\\1>The above regex matches the first, if applicable, instance of a string snide of <string> up to the next occurrence of </string> as long as there is a char(s) of sort in between, e.g.,:
pat5 <- "<(.+?)>.+?</\\1>"
dummy_str5 <- c("<123> </123>", "<123> </a23>", "<123> abcd </123>abcd", "<123></123>",
"<123> <123> </123>", "<123> abc </124>", "abc<123> def <<</1234>")
data.frame(string = dummy_str5, match = str_extract(dummy_str5, pat5))
## string match
## 1 <123> </123> <123> </123>
## 2 <123> </a23> <NA>
## 3 <123> abcd </123>abcd <123> abcd </123>
## 4 <123></123> <NA>
## 5 <123> <123> </123> <123> <123> </123>
## 6 <123> abc </124> <NA>
## 7 abc<123> def <<</1234> <NA>
The hint that “some characters are more revealing than others” reveals that a certain way that chars are represented must be the key.The majority of the list is chars, but reading the first few chars reveals nothing. There are speckled capitalized chars throughout, so we shall take a look at only these from the string:
cipher <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7L
j8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zE
crop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
gsub("[^A-Z]", "", cipher)
## [1] "CONGRATULATIONSYOUAREASUPERNERD"
Ouch! With discovery comes insult. I’ve been called worse though.