Stringr Package

library(stringr)

Raw Data

raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
raw.data

## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard.

Use the predefined character class [:alpha:] which searches for alphabetic characters. In addition, we add a period because we have three instances (Burns, Lovejoy, and Hibbert) where this occurs, and we have a comma; also Burns. Lastly, we had a quantifier which will track patterns for the special cases.

#unlist simplifies it to produce a vector which contains all the atomic components which occur in x.
#use str_extract_all which extracts matching patterns from a string
#str_extract_all(string, pattern, simplify = FALSE)
name <- unlist(str_extract_all(raw.data, "[[:alpha:],. ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Identify the listed names that already fit the pattern of first name and last name. Place them in a vector called name_good.

#create the vecto name_good
name_good <- c(name[1],name[3],name[4],name[6])
#view the results
name_good

## [1] "Moe Szyslak"          "Rev. Timothy Lovejoy" "Ned Flanders"        
## [4] "Dr. Julius Hibbert"

Out of the remaining names, create a vector entitled name_Homer which will correspond with Simpson, Homer and name_Burns which will correspond with Burns, C. Montgomery

#Convert Simpson, Homer to Homer Simpson
name_Homer <- c(name[5])
name_Homer

## [1] "Simpson, Homer"

The data element name_Homer needs to be split using str_split

#uselist simplifies a list by producing a vector which contains all the components
#unlist(x, recursive = TRUE, use.names = TRUE)
#use str_split to split up the name
#str_split(string, pattern, n = Inf, simplify = FALSE)
name_Homer <- unlist(str_split(name_Homer, ","))
name_Homer

## [1] "Simpson" " Homer"

name_Homer is split, however there is whitespace before the name Homer, this can be fixed using str_trim

#str_trim trims whiteplace from start and end of string
#str_trim(string, side = c("both", "left", "right"))
name_Homer <- str_trim(name_Homer)
name_Homer

## [1] "Simpson" "Homer"

Using the str_c function we will concatenate the names in their desired order.

#str_c(..., sep = "", collapse = NULL)
name_Homer <- str_c(name_Homer[2],name_Homer[1], sep = " ")
name_Homer

## [1] "Homer Simpson"

We will now convert Burns, C. Montgomery to C. Montgomery Burns

name_Burns <- c(name[2])
name_Burns

## [1] "Burns, C. Montgomery"

We will convert name_Burns by first utilizing the str_split function

name_Burns <- unlist(str_split(name_Burns, ","))
name_Burns

## [1] "Burns"          " C. Montgomery"

Using str_trim we will clean up name_Burns

name_Burns <- str_trim(name_Burns)
name_Burns

## [1] "Burns"         "C. Montgomery"

Using str_c we will now concatenate the first and last names

name_Burns <- str_c(name_Burns[2],name_Burns[1], sep = " ")
name_Burns

## [1] "C. Montgomery Burns"

Create a vector entitled SimpsonsCharacters by merging the name_good, name_Homer, and name_Burns vectors.

SimpsonsCharacters <- c(name_good,name_Homer,name_Burns)
SimpsonsCharacters

## [1] "Moe Szyslak"          "Rev. Timothy Lovejoy" "Ned Flanders"        
## [4] "Dr. Julius Hibbert"   "Homer Simpson"        "C. Montgomery Burns"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.)

Create a vector entitled title_chk which will use several stringr features. First, we will implement the [:alpha:] predefined character class. Second, we will add a period(.) character which matches any character after the (\). Lastly, we will apply a quantifier which will make sure the pattern is run at least twice. This will all lie in the str_detect fucntion.

#view the name vector
SimpsonsCharacters

## [1] "Moe Szyslak"          "Rev. Timothy Lovejoy" "Ned Flanders"        
## [4] "Dr. Julius Hibbert"   "Homer Simpson"        "C. Montgomery Burns"

#str_detect(string, pattern)
title_chk <- str_detect(SimpsonsCharacters, "[[:alpha:]]{2,}\\.")
title_chk

## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

As you can see str_detect returned a value of TRUE for Rev. Timothy Lovejoy and Dr. Julius Hibbert.

Construct a logical vector indicating whether a character has a second name.

By eyeballing the SimpsonsCharacters vector one can see that C. Montgomery Burns is the only character that has a second name. Programmatically this can be achieved by using str_detect again plus scanning uppercase alphabetic letters followed by a period that searches for matches and a quantifier of 1.

#view the name vector
SimpsonsCharacters

## [1] "Moe Szyslak"          "Rev. Timothy Lovejoy" "Ned Flanders"        
## [4] "Dr. Julius Hibbert"   "Homer Simpson"        "C. Montgomery Burns"

#str_detect(string, pattern)
secondary_name <- str_detect(SimpsonsCharacters, "[A-Z]\\.{1}")
secondary_name

## [1] FALSE FALSE FALSE FALSE FALSE  TRUE

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

Example 1

#assign the example 1 expression to the ex1 variable
ex1 = "[0-9]+\\$"
#create and view good example
ex1_good <-c("1219$","12$19","7$")
ex1_good

## [1] "1219$" "12$19" "7$"

#use str_detect to detect the presence or absence of ex1
str_detect(ex1_good,ex1)

## [1] TRUE TRUE TRUE

#create and view bad example
ex1_bad <-c("1219","0000","7")
ex1_bad

## [1] "1219" "0000" "7"

#use str_detect to detect the presence or absence of ex1
str_detect(ex1_bad,ex1)

## [1] FALSE FALSE FALSE

Example 2

#assign the example 2 expression to the ex2 variable
ex2="\\b[a-z]{1,4}\\b"
#create and view good example
ex2_good <-c("stop","four","lens")
ex2_good

## [1] "stop" "four" "lens"

#use str_detect to detect the presence or absence of ex1
str_detect (ex2_good,ex2)

## [1] TRUE TRUE TRUE

#create and view bad example
ex2_bad <-c("STOP","seven","Crafters")
ex2_bad

## [1] "STOP"     "seven"    "Crafters"

#use str_detect to detect the presence or absence of ex1
str_detect(ex2_bad,ex2)

## [1] FALSE FALSE FALSE

Example 3

#assign the example 3 expression to the ex3 variable
ex3=".*?\\.txt$"
#create and view good example
ex3_good <-c("MyNameIsBrianLiles.txt","Why did you leave early?.txt")
ex3_good

## [1] "MyNameIsBrianLiles.txt"       "Why did you leave early?.txt"

#use str_detect to detect the presence or absence of ex1
str_detect (ex3_good,ex3)

## [1] TRUE TRUE

#create and view bad example
ex3_bad <-c("MyNameIsBrianLiles","Why did you leave early?.text")
ex3_bad

## [1] "MyNameIsBrianLiles"            "Why did you leave early?.text"

#use str_detect to detect the presence or absence of ex1
str_detect(ex3_bad,ex3)

## [1] FALSE FALSE

Example 4

#assign the example 4 expression to the ex4 variable
ex4="\\d{2}/\\d{2}/\\d{4}"
#create and view good example
ex4_good <-c("12/19/2006","07/23/1975","99/99/9999")
ex4_good

## [1] "12/19/2006" "07/23/1975" "99/99/9999"

#use str_detect to detect the presence or absence of ex1
str_detect (ex4_good,ex4)

## [1] TRUE TRUE TRUE

#create and view bad example
ex4_bad <-c("December 19,2006","2006/12/19","99-99-9999")
ex4_bad

## [1] "December 19,2006" "2006/12/19"       "99-99-9999"

#use str_detect to detect the presence or absence of ex1
str_detect(ex4_bad,ex4)

## [1] FALSE FALSE FALSE

Example 5

#assign the example 5 expression to the ex5 variable
ex5="<(.+?)>.+?</\\1>"
#create and view good example
ex5_good <-c("<title>Black Panther</title>","<h1>Cooking with Laura</h1>",
             "<p>I believe in equality</p>")
ex5_good

## [1] "<title>Black Panther</title>" "<h1>Cooking with Laura</h1>" 
## [3] "<p>I believe in equality</p>"

#use str_detect to detect the presence or absence of ex1
str_detect (ex5_good,ex5)

## [1] TRUE TRUE TRUE

#create and view bad example
ex5_bad <-c("<title>Black Panther<title>","h1>Cooking with Laura<h1",
            "<p></p>")
ex5_bad

## [1] "<title>Black Panther<title>" "h1>Cooking with Laura<h1"   
## [3] "<p></p>"

#use str_detect to detect the presence or absence of ex1
str_detect(ex5_bad,ex5)

## [1] FALSE FALSE FALSE

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others!

secret <- c(
"clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr")

#the code is broken once all of the uppercase letters are extracted
secret <- unlist(str_extract_all(secret, "[[:upper:]]"))
secret

##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"

DATA 607 - Week 3 Assignment

Brian Liles

February 15, 2018