Data 607 Assignment Week 3

R Markdown

References:

https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

https://www.hackerearth.com/zh/practice/machine-learning/advanced-techniques/regular-expressions-string-manipulation-r/tutorial/

https://rstudio-pubs-static.s3.amazonaws.com/74603_76cd14d5983f47408fdf0b323550b846.html

https://stackoverflow.com/questions/33826650/last-name-first-name-to-first-name-last-name

Problem 3: Copy the introductory example. The vector name stores the extracted names. Solution : (a) Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

library(stringr)
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

extractedNames <- unlist(str_extract_all(raw.data, "[[A-z]., ]{2,}"))
#For debugging: extractedNames

#Now, take the extracted names and convert the names to first and last names, but only to the ones where a comma is present.
#(\\w+) matches the names, ,\\s matches ", "(comma and space), \\2 \\1 returns the names in opposite order.

reverseNames <- unlist(sub("(\\w+),\\s(\\w+)","\\2 \\1", extractedNames))
#reverseNames
  
#Now, take care of anomalies as a result of the name reversal.  NOTE: If there was more than > 1 anomaly put this into a loop.  
position <- which(str_detect(reverseNames, "[:upper:]\\s[[:alpha:]]{1}.") == TRUE)[1]

if(position > 0){
  reverseNames[position] <- sub("(\\w+)\\s(\\w+).\\s(\\w+)","\\3 \\1. \\2", reverseNames[position])
}

reverseNames

## [1] "Moe Szyslak"          "Montgomery C. Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

titles <- str_detect(extractedNames, pattern="Rev|Dr[.]")
titles

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Construct a logical vector indicating whether a character has a second name.

middle <- str_detect(extractedNames, "[[:alpha:]]{1}\\, [:upper:]. ")
middle

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Problem 4: Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression. Solution: (a) [0-9]+$

[0-9] ==> Match any character in set; matches a character in the range of “0” to “9”. Case sensitive

‘+’ ==> Quantfier. Match 1 or more of the preceding token

$ ==> Matches the dollar sign at the end of a number set.

str <- "sadjfalskdfj$9348349$0934830adskjkallkjafklj"
str_extract(str,"[0-9]+\\$")

## [1] "9348349$"

==> Matches a word boundary position between a word character and non-word character or position (start / end of string).

[a-z] ==> Matches a character having a character code between the two specified characters inclusive.

{1,4} ==> Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {3,} will match 3 or more.

str <-  "ajay $ajay6+elephant dog5959"
str_extract(str,"\\b[a-z]{1,4}\\b")

## [1] "ajay"

.*?.txt$

. ===> Matches any character except linebreaks. Equivalent to [^\n\r].

===> Matches 0 or more of the preceding token.

? ===> Makes the preceding quantifier lazy, causing it to match as few characters as possible. By default, quantifiers are greedy, and will match as many characters as possible.

.txt===> Matches anything ends in ‘.txt’ characters.

str <- "394839sadksdafkj948490#$#$##^&^&^& file.txt"
unlist(str_extract_all(str, ".*?\\.txt$"))

## [1] "394839sadksdafkj948490#$#$##^&^&^& file.txt"

// ===> Matches any digit character (0-9). Equivalent to [0-9].

{2} ===> Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {3,} will match 3 or more.

/ ===> Matches a ‘/’ character.

str <- "afkljasdfkljsd09/04/2020jasdfkljasdf"
unlist(str_extract_all(str, "\\d{2}/\\d{2}/\\d{4}"))

## [1] "09/04/2020"

<(.+?)>.+?</> < ===> Matches the ‘<’ character.

(.+?) ===> Groups multiple tokens together and creates a capture group for extracting a substring or using a backreference.

.*? ===> Discussed in previous example.

/ ===> Matches the ‘/’ character.

/ ===> Matches the results of a capture group. For example matches the results of the first capture group & matches the third.

===> Matches the '>' character.

str_extract("afkljasd<tag>fkljsd09/04/2020ja</tag>sdfkljasdf","<(.+?)>.+?</\\1>")

## [1] "<tag>fkljsd09/04/2020ja</tag>"

Problem 9. The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com. Solution:

Observation: Investigate characters that are capitals.

encodedmessage <- c("clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr")



message <- str_extract_all(encodedmessage,"[[:upper:]]")
message

## [[1]]
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"

Data 607 Assignment Week 3

Ajay Arora

September 2, 2019

R Markdown