Assignment 3 - DATA 607

We initialize the session with the required libraries. stringr is included in tidyverse so we just load the latter.

library(tidyverse)

Problem 3

We first initialize the raw data and isolate the names from phone numbers as provided in the textbook.

raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5543642Dr. Julius Hibbert"

(name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}" ) ) )

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Next, we construct a tibble to store 3 columns: raw name string, first and last names. Initially, the first and last names are obviously empty.

df = tibble( raw_string = name )  # Store the raw list of names in a tibble
df$first_name = character(length=length(name))  # allocate for the first name but leave blank
df$last_name = character(length=length(name))   # allocate for the last name but leave blank

To parse and map the names, we illustrate two methods.

One using a for-loop to iterate through each person.
Another approach using vectorized operations.

Both methods used the same underlying regular expression strategy. Rather than use one large regular expression, we test and solve for 3 patterns as follows:

We also use the str_match function explained in Wickham’s R for Data Science. This function will be used to extract the grouped matches in order which is returned in a vector.
Field 1 of the vector is always the input text (and is ignored). All remaining fields are parsed and extracted in order.

Case	First Name	Last Name
Regular Name	Field 2	Field 3
Titled Name	Field 3	Field 4
Inverted Name	Field 3	Field 2

We define the regular expressions in advance and store them in named variables for ease of reference.

rgxInvertedName = "(\\w+),\\s+(\\w+[.]?)(\\s+\\w+)?"
rgxTitledName = "\\b(\\w+\\.)\\s+(\\w+)\\s+(\\w+)(\\s+\\w+)?"
rgxRegularName = "^(\\w+)\\s+(\\w+)"

The procedural programming approach is to use a for-loop to test each name iteratively. This solution is shown below and works but is verbose.

for( i in 1:length(df$raw_string) ) {
  
  raw_string = df[i, "raw_string"]
  
  detectInverted = str_detect(raw_string, rgxInvertedName )
  
  
  if(detectInverted[1]==TRUE)
  {
      matches = str_match(raw_string , rgxInvertedName)
      df[i, "first_name"] = matches[1,3]
      df[i, "last_name"] = matches[1,2]
      next
  }
  
  detectTitledName = str_detect(raw_string, rgxTitledName)
  
  if( detectTitledName[1]==TRUE)
  {
      matches = str_match(raw_string, rgxTitledName)
      df[i, "first_name"] = matches[1,3]
      df[i, "last_name"] = matches[1,4]
      next
  }
  
  detectRegularName = str_detect(raw_string, rgxRegularName)
  
  if(detectRegularName[1]==TRUE)
  {
      matches = str_match(raw_string, rgxRegularName)
      df[i, "first_name"] = matches[1,2]
      df[i, "last_name"] = matches[1, 3]
      next
  }
  
}

df  # The dataframe is displayed and shown to produce correct results.

## # A tibble: 6 x 3
##   raw_string           first_name last_name
##   <chr>                <chr>      <chr>    
## 1 Moe Szyslak          Moe        Szyslak  
## 2 Burns, C. Montgomery C.         Burns    
## 3 Rev. Timothy Lovejoy Timothy    Lovejoy  
## 4 Ned Flanders         Ned        Flanders 
## 5 Simpson, Homer       Homer      Simpson  
## 6 Dr. Julius Hibbert   Julius     Hibbert

Using vectorized operations we can extract the first and last names with less code. First, we store boolean vector when each name type is detected. Then, we store the string matches into a vector of strings Finally, conditional assignment to the data frame is done for each type of name.

detectInverted = str_detect(df$raw_string, rgxInvertedName )
detectRegular = str_detect(df$raw_string, rgxRegularName)
detectTitledName = str_detect(df$raw_string, rgxTitledName)

matchInverted = str_match( df$raw_string, rgxInvertedName)
matchRegular = str_match(df$raw_string, rgxRegularName)
matchTitled = str_match(df$raw_string, rgxTitledName)

df$first_name[detectInverted] = matchInverted[detectInverted,3]
df$last_name[detectInverted] = matchInverted[detectInverted,2]

df$first_name[detectRegular] = matchRegular[detectRegular, 2]
df$last_name[detectRegular] = matchRegular[detectRegular, 3]

df$first_name[detectTitledName] = matchTitled[detectTitledName, 3]
df$last_name[detectTitledName] = matchTitled[detectTitledName, 4]

df

## # A tibble: 6 x 3
##   raw_string           first_name last_name
##   <chr>                <chr>      <chr>    
## 1 Moe Szyslak          Moe        Szyslak  
## 2 Burns, C. Montgomery C.         Burns    
## 3 Rev. Timothy Lovejoy Timothy    Lovejoy  
## 4 Ned Flanders         Ned        Flanders 
## 5 Simpson, Homer       Homer      Simpson  
## 6 Dr. Julius Hibbert   Julius     Hibbert

Problem 3b.

To identified titled names, we observe that titles exists if it begins the string and non-zero alphabetic characters followed by a period. str_detect returns the required boolean vector below.

rgxTitledName = "\\b\\w+\\.\\s+\\w+\\s+\\w+"

(bool_TitledName = str_detect(name, rgxTitledName) )

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

str_view_all(name, rgxTitledName)  # Matches on titled names: Rev. Timothy Lovejoy and Dr. Julius Hibbert

Problem 3c.

Detecting a second name involves two cases detected using the “OR” | and grouped patterns.
First case: Not inverted name: Look for three words not counting the title.
Second case: Inverted name with 3 words.

rgxSecondName = "(\\b(\\w+\\.\\s+)?\\w+\\s+\\w+\\s+\\w+)|(\\w+,\\s+\\w+[.]?\\s+\\w+[.]?)"
str_view_all(name, rgxSecondName )

The boolean vector is shown below.

(bool_second_name = str_detect( name,rgxSecondName ))

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Problem 4

Detect the types of string and give examples. The description of each regex’s purpose is contained in the comment line next to the regex definition.

rgxA = "[0-9]+\\$"   # Matches a numeric string of length > 0 followed by a dollar sign
matchA = "The following is matched:  11234$"
str_view_all(matchA, rgxA)

rgxB = "\\b[a-z]{1,4}\\b" # Matches a word boundary followed by 1-4 lowercase letters followed by word boundary
matchB = "There is no justice."   # Matches "is" and "no"
str_view_all(matchB, rgxB)

rgxC = ".*?\\.txt$"   # A word of length zero or more followed by suffix .txt at end of string
matchC = c("myfile.txt", ".txt")  # Two matches because the prefix text is optional
str_view_all(matchC, rgxC)

rgxD = "\\d{2}/\\d{2}/\\d{4}"  # A month-day-year text string where the two digit for month and day is mandatory
matchD = c("02/01/2019", "11/15/200010") # for second string, year along matches first 4 digits
str_view_all(matchD, rgxD)

rgxE = "<(.+?)>.+?</\\1>"  # Matches on angle bracketed text followed by some body text followed by end bracket with same inside text
matchE = "<foo>Some text body</foo> and stuff outside of the match."
str_view_all(matchE, rgxE)

Problem 9 Decryption

The standard assumption in cryptanalysis is that the cipher text uses a transposition cipher. So frequency analysis of the text string can be useful in guessing the text.

So I split the text using str_split and counted the most frequent letters in English and compared them to the most common letters in the sample.

exercise9str = "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

exer9_chars = str_split(exercise9str, "")  # split into indvidual characters
exer9_table = table( exer9_chars)   # for a frequency count of each character

freq = data.frame(names = names(exer9_table), values = as.vector(exer9_table))  # put in a dat  frame

(mostCommonFreq = freq[order(freq$values, decreasing=TRUE),] )  # order letters by most frequent usage

##    names values
## 16     c     12
## 8      5     11
## 35     o     11
## 49     w     11
## 23     g      9
## 33     n      9
## 22     f      8
## 28     j      8
## 40     r      8
## 44     t      8
## 18     d      7
## 26     i      7
## 29     k      7
## 53     z      7
## 3      0      6
## 12     9      6
## 15     b      6
## 37     p      6
## 6      3      5
## 7      4      5
## 9      6      5
## 10     7      5
## 2      .      4
## 4      1      4
## 5      2      4
## 11     8      4
## 13     a      4
## 14     A      4
## 25     h      4
## 41     R      4
## 42     s      4
## 48     v      4
## 20     e      3
## 21     E      3
## 32     m      3
## 34     N      3
## 36     O      3
## 39     q      3
## 47     U      3
## 51     y      3
## 43     S      2
## 45     T      2
## 46     u      2
## 50     x      2
## 1      !      1
## 17     C      1
## 19     D      1
## 24     G      1
## 27     I      1
## 30     l      1
## 31     L      1
## 38     P      1
## 52     Y      1

Attempts to compare the top 5 letters with text replacement with the top frequency letters did not work:

These letters are E, T, A, O.

Typically, with simple substitutions, simple plaintext words emerge from the ciphertext. That didn’t happen.

However, a stackoverflow.com link provided the immediate solution.

https://stackoverflow.com/questions/35542346/r-using-regmatches-to-extract-certain-characters/35542480

The lower case letters and digits were a red herring designed to look like meaningful. They were all randomly assigned to mask the plain text message formed using the capital letters.

exercise9str %>% str_replace_all("[^A-Z]", " ") -> theAnswer

theAnswer

## [1] "     C                  O     N          G            R   A       T    U       L       AT I                O         N       S        Y      O     U        A       R     E      A    S  U   P        E        R           N        E       R      D    "

Collapsing the whitespace even more – gives the final answer: an awkward truth

theAnswer %>% str_replace_all(" ", "") -> theAwkwardTruth

theAwkwardTruth

## [1] "CONGRATULATIONSYOUAREASUPERNERD"