We initialize the session with the required libraries. stringr is included in tidyverse so we just load the latter.
library(tidyverse)
We first initialize the raw data and isolate the names from phone numbers as provided in the textbook.
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5543642Dr. Julius Hibbert"
(name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}" ) ) )
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
Next, we construct a tibble to store 3 columns: raw name string, first and last names. Initially, the first and last names are obviously empty.
df = tibble( raw_string = name ) # Store the raw list of names in a tibble
df$first_name = character(length=length(name)) # allocate for the first name but leave blank
df$last_name = character(length=length(name)) # allocate for the last name but leave blank
To parse and map the names, we illustrate two methods.
Both methods used the same underlying regular expression strategy. Rather than use one large regular expression, we test and solve for 3 patterns as follows:
We also use the str_match function explained in Wickham’s R for Data Science. This function will be used to extract the grouped matches in order which is returned in a vector.
Field 1 of the vector is always the input text (and is ignored). All remaining fields are parsed and extracted in order.
| Case | First Name | Last Name |
|---|---|---|
| Regular Name | Field 2 | Field 3 |
| Titled Name | Field 3 | Field 4 |
| Inverted Name | Field 3 | Field 2 |
We define the regular expressions in advance and store them in named variables for ease of reference.
rgxInvertedName = "(\\w+),\\s+(\\w+[.]?)(\\s+\\w+)?"
rgxTitledName = "\\b(\\w+\\.)\\s+(\\w+)\\s+(\\w+)(\\s+\\w+)?"
rgxRegularName = "^(\\w+)\\s+(\\w+)"
The procedural programming approach is to use a for-loop to test each name iteratively. This solution is shown below and works but is verbose.
for( i in 1:length(df$raw_string) ) {
raw_string = df[i, "raw_string"]
detectInverted = str_detect(raw_string, rgxInvertedName )
if(detectInverted[1]==TRUE)
{
matches = str_match(raw_string , rgxInvertedName)
df[i, "first_name"] = matches[1,3]
df[i, "last_name"] = matches[1,2]
next
}
detectTitledName = str_detect(raw_string, rgxTitledName)
if( detectTitledName[1]==TRUE)
{
matches = str_match(raw_string, rgxTitledName)
df[i, "first_name"] = matches[1,3]
df[i, "last_name"] = matches[1,4]
next
}
detectRegularName = str_detect(raw_string, rgxRegularName)
if(detectRegularName[1]==TRUE)
{
matches = str_match(raw_string, rgxRegularName)
df[i, "first_name"] = matches[1,2]
df[i, "last_name"] = matches[1, 3]
next
}
}
df # The dataframe is displayed and shown to produce correct results.
## # A tibble: 6 x 3
## raw_string first_name last_name
## <chr> <chr> <chr>
## 1 Moe Szyslak Moe Szyslak
## 2 Burns, C. Montgomery C. Burns
## 3 Rev. Timothy Lovejoy Timothy Lovejoy
## 4 Ned Flanders Ned Flanders
## 5 Simpson, Homer Homer Simpson
## 6 Dr. Julius Hibbert Julius Hibbert
Using vectorized operations we can extract the first and last names with less code. First, we store boolean vector when each name type is detected. Then, we store the string matches into a vector of strings Finally, conditional assignment to the data frame is done for each type of name.
detectInverted = str_detect(df$raw_string, rgxInvertedName )
detectRegular = str_detect(df$raw_string, rgxRegularName)
detectTitledName = str_detect(df$raw_string, rgxTitledName)
matchInverted = str_match( df$raw_string, rgxInvertedName)
matchRegular = str_match(df$raw_string, rgxRegularName)
matchTitled = str_match(df$raw_string, rgxTitledName)
df$first_name[detectInverted] = matchInverted[detectInverted,3]
df$last_name[detectInverted] = matchInverted[detectInverted,2]
df$first_name[detectRegular] = matchRegular[detectRegular, 2]
df$last_name[detectRegular] = matchRegular[detectRegular, 3]
df$first_name[detectTitledName] = matchTitled[detectTitledName, 3]
df$last_name[detectTitledName] = matchTitled[detectTitledName, 4]
df
## # A tibble: 6 x 3
## raw_string first_name last_name
## <chr> <chr> <chr>
## 1 Moe Szyslak Moe Szyslak
## 2 Burns, C. Montgomery C. Burns
## 3 Rev. Timothy Lovejoy Timothy Lovejoy
## 4 Ned Flanders Ned Flanders
## 5 Simpson, Homer Homer Simpson
## 6 Dr. Julius Hibbert Julius Hibbert
To identified titled names, we observe that titles exists if it begins the string and non-zero alphabetic characters followed by a period. str_detect returns the required boolean vector below.
rgxTitledName = "\\b\\w+\\.\\s+\\w+\\s+\\w+"
(bool_TitledName = str_detect(name, rgxTitledName) )
## [1] FALSE FALSE TRUE FALSE FALSE TRUE
str_view_all(name, rgxTitledName) # Matches on titled names: Rev. Timothy Lovejoy and Dr. Julius Hibbert
Detecting a second name involves two cases detected using the “OR” | and grouped patterns.
First case: Not inverted name: Look for three words not counting the title.
Second case: Inverted name with 3 words.
rgxSecondName = "(\\b(\\w+\\.\\s+)?\\w+\\s+\\w+\\s+\\w+)|(\\w+,\\s+\\w+[.]?\\s+\\w+[.]?)"
str_view_all(name, rgxSecondName )
The boolean vector is shown below.
(bool_second_name = str_detect( name,rgxSecondName ))
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
Detect the types of string and give examples. The description of each regex’s purpose is contained in the comment line next to the regex definition.
rgxA = "[0-9]+\\$" # Matches a numeric string of length > 0 followed by a dollar sign
matchA = "The following is matched: 11234$"
str_view_all(matchA, rgxA)
rgxB = "\\b[a-z]{1,4}\\b" # Matches a word boundary followed by 1-4 lowercase letters followed by word boundary
matchB = "There is no justice." # Matches "is" and "no"
str_view_all(matchB, rgxB)
rgxC = ".*?\\.txt$" # A word of length zero or more followed by suffix .txt at end of string
matchC = c("myfile.txt", ".txt") # Two matches because the prefix text is optional
str_view_all(matchC, rgxC)
rgxD = "\\d{2}/\\d{2}/\\d{4}" # A month-day-year text string where the two digit for month and day is mandatory
matchD = c("02/01/2019", "11/15/200010") # for second string, year along matches first 4 digits
str_view_all(matchD, rgxD)
rgxE = "<(.+?)>.+?</\\1>" # Matches on angle bracketed text followed by some body text followed by end bracket with same inside text
matchE = "<foo>Some text body</foo> and stuff outside of the match."
str_view_all(matchE, rgxE)
The standard assumption in cryptanalysis is that the cipher text uses a transposition cipher. So frequency analysis of the text string can be useful in guessing the text.
So I split the text using str_split and counted the most frequent letters in English and compared them to the most common letters in the sample.
exercise9str = "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
exer9_chars = str_split(exercise9str, "") # split into indvidual characters
exer9_table = table( exer9_chars) # for a frequency count of each character
freq = data.frame(names = names(exer9_table), values = as.vector(exer9_table)) # put in a dat frame
(mostCommonFreq = freq[order(freq$values, decreasing=TRUE),] ) # order letters by most frequent usage
## names values
## 16 c 12
## 8 5 11
## 35 o 11
## 49 w 11
## 23 g 9
## 33 n 9
## 22 f 8
## 28 j 8
## 40 r 8
## 44 t 8
## 18 d 7
## 26 i 7
## 29 k 7
## 53 z 7
## 3 0 6
## 12 9 6
## 15 b 6
## 37 p 6
## 6 3 5
## 7 4 5
## 9 6 5
## 10 7 5
## 2 . 4
## 4 1 4
## 5 2 4
## 11 8 4
## 13 a 4
## 14 A 4
## 25 h 4
## 41 R 4
## 42 s 4
## 48 v 4
## 20 e 3
## 21 E 3
## 32 m 3
## 34 N 3
## 36 O 3
## 39 q 3
## 47 U 3
## 51 y 3
## 43 S 2
## 45 T 2
## 46 u 2
## 50 x 2
## 1 ! 1
## 17 C 1
## 19 D 1
## 24 G 1
## 27 I 1
## 30 l 1
## 31 L 1
## 38 P 1
## 52 Y 1
Attempts to compare the top 5 letters with text replacement with the top frequency letters did not work:
These letters are E, T, A, O.
Typically, with simple substitutions, simple plaintext words emerge from the ciphertext. That didn’t happen.
However, a stackoverflow.com link provided the immediate solution.
The lower case letters and digits were a red herring designed to look like meaningful. They were all randomly assigned to mask the plain text message formed using the capital letters.
exercise9str %>% str_replace_all("[^A-Z]", " ") -> theAnswer
theAnswer
## [1] " C O N G R A T U L AT I O N S Y O U A R E A S U P E R N E R D "
Collapsing the whitespace even more – gives the final answer: an awkward truth
theAnswer %>% str_replace_all(" ", "") -> theAwkwardTruth
theAwkwardTruth
## [1] "CONGRATULATIONSYOUAREASUPERNERD"