Expressions Homework

Initial Setup

For this homework the string library is needed to manipulate qualitative data

library(stringr)

Loading Data

Load the given raw data and extract the names into a vector called names

data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
 
names <- unlist(str_extract_all(data, "[[:alpha:]., ]{2,}"))

names

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Question 3

Copy the introductory example. The vector names stores the extracted names.

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

# utilize grepl function to test for camma
# g
#loop for the length of the vector 
for(i in 1:length(names)){
  #find if any of the name vector string have a comma    
  if(grepl(',',names[[i]])==TRUE){
    #split the String into two strings using the comma as separator
    get_str=unlist(str_split(names[[i]],","))
    #Swap the two string; then, join them into one string with    one-character space in between 
    names[[i]]=str_c(get_str[2]," ",get_str[1])
  }
}

names

## [1] "Moe Szyslak"          " C. Montgomery Burns" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         " Homer Simpson"       "Dr. Julius Hibbert"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

The grepl function will be used to indicate whether character has a title.

# utilize grepl function to logically detect Rev. OR Dr.
title <-grepl("Rev.|Dr.",names)
#create the logical vector 
names_title <- data.frame(names,title)
names_title

##                  names title
## 1          Moe Szyslak FALSE
## 2  C. Montgomery Burns FALSE
## 3 Rev. Timothy Lovejoy  TRUE
## 4         Ned Flanders FALSE
## 5        Homer Simpson FALSE
## 6   Dr. Julius Hibbert  TRUE

Construct a logical vector indicating whether a character has a second name.

mid_name <- str_detect(names,"[[:upper:]]\\.")
#create the logical vector
name.mid_name <- data.frame(names,mid_name)
name.mid_name

##                  names mid_name
## 1          Moe Szyslak    FALSE
## 2  C. Montgomery Burns     TRUE
## 3 Rev. Timothy Lovejoy    FALSE
## 4         Ned Flanders    FALSE
## 5        Homer Simpson    FALSE
## 6   Dr. Julius Hibbert    FALSE

Question 4

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$

It could be conformed as an alphanumeric. The pattern implies a query to extracting all numbers that end with character $

mystring <- "the cost of two t-shirt is 50$"

str_extract(mystring, "[0-9]+\\$")

## [1] "50$"

\b[a-z]{1,4}\b

It conforms to a lower alphabetic string. The pattern detects a one word that have 1 to 4 lower case characters within a string.

mystring=c("can","fatime","of","F","f","wxsvyz","with","abc popo 12c","d12c")
str_extract(mystring,"\\b[a-z]{1,4}\\b")

## [1] "can"  NA     "of"   NA     "f"    NA     "with" "abc"  NA

str_detect(mystring,"\\b[a-z]{1,4}\\b")

## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE

C. .*?\.txt$

It conforms to graphical string. The pattern detects any string that end with .txt and the string it contain only the .txt file name.

mystring=c("ali.txt","123&.txt"," the file has a.txt extension"," names.txt","alex.ipt")
str_extract(mystring,".*?\\.txt$")

## [1] "ali.txt"    "123&.txt"   NA           " names.txt" NA

str_detect(mystring,".*?\\.txt$")

## [1]  TRUE  TRUE FALSE  TRUE FALSE

\d{2}/\d{2}/\d{4}

It conforms to numbers and punctuation string. The pattern detects any string that have the format of dd/mm/yyyy

mystring <- c("02/15/2000","02-15-2000","born in 02/15/1975")
str_extract(mystring,"\\d{2}/\\d{2}/\\d{4}")

## [1] "02/15/2000" NA           "02/15/1975"

str_detect(mystring,"\\d{2}/\\d{2}/\\d{4}")

## [1]  TRUE FALSE  TRUE

<(.+?)>.+?</\1>

It conforms to graphical string or tag format string. The pattern detects and identifies any three different fields within the string that have the html tag format

mystring <- c("<html> Alex </html>","<xyz> some text</xyz>"," abc <xyz> some text</xyz>")
str_extract(mystring,"<(.+?)>.+?</\\1>")

## [1] "<html> Alex </html>"   "<xyz> some text</xyz>" "<xyz> some text</xyz>"

str_detect(mystring,"<(.+?)>.+?</\\1>")

## [1] TRUE TRUE TRUE

Question 9

The following code hides a secret message. Crack it with R and regular expressions.Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

There are many approaches and much simpler than the one I have to find the secret, but I choose my own solution.

mystring <-"clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

split_mystring=unlist(str_split(mystring,"\\."))

secret_message=""

for(i in 1:length(split_mystring)){
  
    get_secret_words<-unlist(str_extract_all(split_mystring[i], "[:upper:]"))
    
    get_secret_char=""
    for(j in 1: length(get_secret_words))
        {
          get_secret_char <- str_c(get_secret_char,get_secret_words[j])
    }
    
    secret_message <- str_c(secret_message," ", get_secret_char)
}

secret_message

## [1] " CONGRATULATIONS YOU ARE A SUPERNERD"