library(stringr)
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
raw.data
## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
Use the predefined character class [:alpha:] which searches for alphabetic characters. In addition, we add a period because we have three instances (Burns, Lovejoy, and Hibbert) where this occurs, and we have a comma; also Burns. Lastly, we had a quantifier which will track patterns for the special cases.
#unlist simplifies it to produce a vector which contains all the atomic components which occur in x.
#use str_extract_all which extracts matching patterns from a string
#str_extract_all(string, pattern, simplify = FALSE)
name <- unlist(str_extract_all(raw.data, "[[:alpha:],. ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
Identify the listed names that already fit the pattern of first name and last name. Place them in a vector called name_good.
#create the vecto name_good
name_good <- c(name[1],name[3],name[4],name[6])
#view the results
name_good
## [1] "Moe Szyslak" "Rev. Timothy Lovejoy" "Ned Flanders"
## [4] "Dr. Julius Hibbert"
Out of the remaining names, create a vector entitled name_Homer which will correspond with Simpson, Homer and name_Burns which will correspond with Burns, C. Montgomery
#Convert Simpson, Homer to Homer Simpson
name_Homer <- c(name[5])
name_Homer
## [1] "Simpson, Homer"
The data element name_Homer needs to be split using str_split
#uselist simplifies a list by producing a vector which contains all the components
#unlist(x, recursive = TRUE, use.names = TRUE)
#use str_split to split up the name
#str_split(string, pattern, n = Inf, simplify = FALSE)
name_Homer <- unlist(str_split(name_Homer, ","))
name_Homer
## [1] "Simpson" " Homer"
name_Homer is split, however there is whitespace before the name Homer, this can be fixed using str_trim
#str_trim trims whiteplace from start and end of string
#str_trim(string, side = c("both", "left", "right"))
name_Homer <- str_trim(name_Homer)
name_Homer
## [1] "Simpson" "Homer"
Using the str_c function we will concatenate the names in their desired order.
#str_c(..., sep = "", collapse = NULL)
name_Homer <- str_c(name_Homer[2],name_Homer[1], sep = " ")
name_Homer
## [1] "Homer Simpson"
We will now convert Burns, C. Montgomery to C. Montgomery Burns
name_Burns <- c(name[2])
name_Burns
## [1] "Burns, C. Montgomery"
We will convert name_Burns by first utilizing the str_split function
name_Burns <- unlist(str_split(name_Burns, ","))
name_Burns
## [1] "Burns" " C. Montgomery"
Using str_trim we will clean up name_Burns
name_Burns <- str_trim(name_Burns)
name_Burns
## [1] "Burns" "C. Montgomery"
Using str_c we will now concatenate the first and last names
name_Burns <- str_c(name_Burns[2],name_Burns[1], sep = " ")
name_Burns
## [1] "C. Montgomery Burns"
Create a vector entitled SimpsonsCharacters by merging the name_good, name_Homer, and name_Burns vectors.
SimpsonsCharacters <- c(name_good,name_Homer,name_Burns)
SimpsonsCharacters
## [1] "Moe Szyslak" "Rev. Timothy Lovejoy" "Ned Flanders"
## [4] "Dr. Julius Hibbert" "Homer Simpson" "C. Montgomery Burns"
Create a vector entitled title_chk which will use several stringr features. First, we will implement the [:alpha:] predefined character class. Second, we will add a period(.) character which matches any character after the (\). Lastly, we will apply a quantifier which will make sure the pattern is run at least twice. This will all lie in the str_detect fucntion.
#view the name vector
SimpsonsCharacters
## [1] "Moe Szyslak" "Rev. Timothy Lovejoy" "Ned Flanders"
## [4] "Dr. Julius Hibbert" "Homer Simpson" "C. Montgomery Burns"
#str_detect(string, pattern)
title_chk <- str_detect(SimpsonsCharacters, "[[:alpha:]]{2,}\\.")
title_chk
## [1] FALSE TRUE FALSE TRUE FALSE FALSE
As you can see str_detect returned a value of TRUE for Rev. Timothy Lovejoy and Dr. Julius Hibbert.
By eyeballing the SimpsonsCharacters vector one can see that C. Montgomery Burns is the only character that has a second name. Programmatically this can be achieved by using str_detect again plus scanning uppercase alphabetic letters followed by a period that searches for matches and a quantifier of 1.
#view the name vector
SimpsonsCharacters
## [1] "Moe Szyslak" "Rev. Timothy Lovejoy" "Ned Flanders"
## [4] "Dr. Julius Hibbert" "Homer Simpson" "C. Montgomery Burns"
#str_detect(string, pattern)
secondary_name <- str_detect(SimpsonsCharacters, "[A-Z]\\.{1}")
secondary_name
## [1] FALSE FALSE FALSE FALSE FALSE TRUE
#assign the example 1 expression to the ex1 variable
ex1 = "[0-9]+\\$"
#create and view good example
ex1_good <-c("1219$","12$19","7$")
ex1_good
## [1] "1219$" "12$19" "7$"
#use str_detect to detect the presence or absence of ex1
str_detect(ex1_good,ex1)
## [1] TRUE TRUE TRUE
#create and view bad example
ex1_bad <-c("1219","0000","7")
ex1_bad
## [1] "1219" "0000" "7"
#use str_detect to detect the presence or absence of ex1
str_detect(ex1_bad,ex1)
## [1] FALSE FALSE FALSE
#assign the example 2 expression to the ex2 variable
ex2="\\b[a-z]{1,4}\\b"
#create and view good example
ex2_good <-c("stop","four","lens")
ex2_good
## [1] "stop" "four" "lens"
#use str_detect to detect the presence or absence of ex1
str_detect (ex2_good,ex2)
## [1] TRUE TRUE TRUE
#create and view bad example
ex2_bad <-c("STOP","seven","Crafters")
ex2_bad
## [1] "STOP" "seven" "Crafters"
#use str_detect to detect the presence or absence of ex1
str_detect(ex2_bad,ex2)
## [1] FALSE FALSE FALSE
#assign the example 3 expression to the ex3 variable
ex3=".*?\\.txt$"
#create and view good example
ex3_good <-c("MyNameIsBrianLiles.txt","Why did you leave early?.txt")
ex3_good
## [1] "MyNameIsBrianLiles.txt" "Why did you leave early?.txt"
#use str_detect to detect the presence or absence of ex1
str_detect (ex3_good,ex3)
## [1] TRUE TRUE
#create and view bad example
ex3_bad <-c("MyNameIsBrianLiles","Why did you leave early?.text")
ex3_bad
## [1] "MyNameIsBrianLiles" "Why did you leave early?.text"
#use str_detect to detect the presence or absence of ex1
str_detect(ex3_bad,ex3)
## [1] FALSE FALSE
#assign the example 4 expression to the ex4 variable
ex4="\\d{2}/\\d{2}/\\d{4}"
#create and view good example
ex4_good <-c("12/19/2006","07/23/1975","99/99/9999")
ex4_good
## [1] "12/19/2006" "07/23/1975" "99/99/9999"
#use str_detect to detect the presence or absence of ex1
str_detect (ex4_good,ex4)
## [1] TRUE TRUE TRUE
#create and view bad example
ex4_bad <-c("December 19,2006","2006/12/19","99-99-9999")
ex4_bad
## [1] "December 19,2006" "2006/12/19" "99-99-9999"
#use str_detect to detect the presence or absence of ex1
str_detect(ex4_bad,ex4)
## [1] FALSE FALSE FALSE
#assign the example 5 expression to the ex5 variable
ex5="<(.+?)>.+?</\\1>"
#create and view good example
ex5_good <-c("<title>Black Panther</title>","<h1>Cooking with Laura</h1>",
"<p>I believe in equality</p>")
ex5_good
## [1] "<title>Black Panther</title>" "<h1>Cooking with Laura</h1>"
## [3] "<p>I believe in equality</p>"
#use str_detect to detect the presence or absence of ex1
str_detect (ex5_good,ex5)
## [1] TRUE TRUE TRUE
#create and view bad example
ex5_bad <-c("<title>Black Panther<title>","h1>Cooking with Laura<h1",
"<p></p>")
ex5_bad
## [1] "<title>Black Panther<title>" "h1>Cooking with Laura<h1"
## [3] "<p></p>"
#use str_detect to detect the presence or absence of ex1
str_detect(ex5_bad,ex5)
## [1] FALSE FALSE FALSE
secret <- c(
"clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr")
#the code is broken once all of the uppercase letters are extracted
secret <- unlist(str_extract_all(secret, "[[:upper:]]"))
secret
## [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"