Data_607_AS

First, load all required libraries

library(stringr)

Problem #3
Copy the introductory example. The vector name stores the extracted names.

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
First, remove all titles, the pattern is a string containing at leaset 2 letters and then followed by a ‘.’

q1Name <- str_replace(name,"[A-z]{2,}\\. ","")
q1Name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Timothy Lovejoy"     
## [4] "Ned Flanders"         "Simpson, Homer"       "Julius Hibbert"

Next, remove all second names, the pattern is a single character followed by a ‘.’, which is an initial. Exception: an initial can also represent a first name or last name. After titles are removed from a name, the name should have 3 components (first name, second name and last name) if it has a second name. Only remove the initial from names with more than 2 components

q1Name <- ifelse(lapply(str_split(q1Name," "), length) > 2,str_replace(q1Name, " [A-z]{1}\\. "," "),q1Name)
q1Name

## [1] "Moe Szyslak"       "Burns, Montgomery" "Timothy Lovejoy"  
## [4] "Ned Flanders"      "Simpson, Homer"    "Julius Hibbert"

Finally, we rearrange the names with the standard last_name, first_name to the standard frist_name last_name

q1Name <- str_replace(q1Name, "(\\w+),\\s(\\w+)","\\2 \\1")
q1Name

## [1] "Moe Szyslak"      "Montgomery Burns" "Timothy Lovejoy" 
## [4] "Ned Flanders"     "Homer Simpson"    "Julius Hibbert"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.)
The pattern of a title is a string containing at leaset 2 letters and then followed by a ‘.’

q2Name <- str_detect(name, "[A-z]{2,}\\. ")
q2Name

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Construct a logical vector indicating whether a character has a second name.
First, remove all titles

q3Name <- str_replace(name,"[A-z]{2,}\\. ","")
q3Name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Timothy Lovejoy"     
## [4] "Ned Flanders"         "Simpson, Homer"       "Julius Hibbert"

Next, find all names with more than 2 components:

q3Name <- lapply(str_split(q3Name," "), length) > 2
q3Name

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Problem #3
Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$
Description: One or more numbers ranged from 0 to 9 and then followed by the ‘$’ symbol

pat = "[0-9]+\\$"
exampleStrs = c("23 46$","456s df 4r5$","gte arf0 01$&#","$46 %46$","0$")
str_extract(exampleStrs, pat)

## [1] "46$" "5$"  "01$" "46$" "0$"

\b[a-z]{1,4}\b
Description: a word that is a combination of one to four lower case letters

pat = "\\b[a-z]{1,4}\\b"
exampleStrs = c("hts tht","456s df 4r5$","gte arf0 01$&#","g s hs f","rshg")
str_extract(exampleStrs, pat)

## [1] "hts"  "df"   "gte"  "g"    "rshg"

.*?\.txt$
Description: A string that ends with ‘.txt’, including ‘.txt’ itselt

pat = ".*?\\.txt$"
exampleStrs = c("thisfile.txt","whateverfile.txt","...txt","gra$#%txt%$.txt",".txt")
str_extract(exampleStrs, pat)

## [1] "thisfile.txt"     "whateverfile.txt" "...txt"          
## [4] "gra$#%txt%$.txt"  ".txt"

\d{2}/\d{2}/\d{4}
Description: Two numbers, followed by a ‘/’ symbol, then another two numbers, again followed by a ‘/’ symbol and finally four additional numbers.This is the format for dates such as 01/01/2019

pat = "\\d{2}/\\d{2}/\\d{4}"
exampleStrs = c("today is 09/08/2019","09/09/2019 will be tomorrow","Jim's birthday is 02/05/2015","13/32/9999 is not a date","h545rwajh88/23/7355/234/hgw")
str_extract(exampleStrs, pat)

## [1] "09/08/2019" "09/09/2019" "02/05/2015" "13/32/9999" "88/23/7355"

<(.+?)>.+?</\1> Description: A string that starts with a ‘< >’ with one or more characters between ‘<’ and ‘>’, followed by one or more characters, then ends with a ‘</ >’ and inside the ‘</ >’ are the exact same characters that are inside the beginning ‘< >’. This is the format for html tags

pat = "<(.+?)>.+?</\\1>"
exampleStrs = c("dhj <a>abc</a>bgfhf", "<a><b>fgerger</a></b>","<a><b>fgerger</b></b>","<a><b>fgerger</b></a>","<div 1-2> RR</aaa>RRR </div 1-2>")
str_extract(exampleStrs, pat)

## [1] "<a>abc</a>"                       "<a><b>fgerger</a>"               
## [3] "<b>fgerger</b>"                   "<a><b>fgerger</b></a>"           
## [5] "<div 1-2> RR</aaa>RRR </div 1-2>"

Problem #9
The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

enCodedMessage <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
enCodedMessage

## [1] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo\nUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO\nd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5\nfy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

The hint is that some of the characters are more revealing than others. We can first extract the characters into different types: upper cases, lower cases, numbers and punctuations

paste(unlist(str_extract_all(enCodedMessage, "[:upper:]")), collapse = "")

## [1] "CONGRATULATIONSYOUAREASUPERNERD"

paste(unlist(str_extract_all(enCodedMessage, "[:lower:]")), collapse = "")

## [1] "clcopowzmstcdwnkigvdicpuggvhrynjuwczihqrfpxsjdwpnanwowisdijjkpfdrcocbtyczjataootjtjnecfekrwwwojigdvrfrbzbknbhzgvizcropwgnbqofaotfbwmktszqefyndtkcfgmcgxonhkgr"

paste(unlist(str_extract_all(enCodedMessage, "[:punct:]")), collapse = "")

## [1] "....!"

paste(unlist(str_extract_all(enCodedMessage, "[:digit:]")), collapse = "")

## [1] "1087792855078035307553364116224905651724639589659490545"

We can clearly see that all upper case letters form a message, but we need to add the punctuations back

deCodedMessage <- paste(unlist(str_extract_all(enCodedMessage, "[:upper:]|[:punct:]")), collapse = "")
deCodedMessage

## [1] "CONGRATULATIONS.YOU.ARE.A.SUPERNERD!"

Data_607_AS_3

Euclid Zhang

9/8/2019