Load libraries.
library(stringr)
name
stores the extracted names.raw.data <- "555-1239Moe Szyslak( 636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
first_name last_name
.# Several readjustments will need to be made to the name vector
# Will need to remove the titles and initials
name.2 <- str_trim(sub("[A-Za-z]{1,}\\.","",name))
name.2
## [1] "Moe Szyslak" "Burns, Montgomery" "Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Julius Hibbert"
# Now, will need to reverse the last names and first names.
# Will use the sub function again.
# Reference: http://stackoverflow.com/questions/33826650/last-name-first-name-to-first-name-last-name
name.3 <- sub("(\\w+),\\s+(\\w+)","\\2 \\1", name.2 )
name.3
## [1] "Moe Szyslak" "Montgomery Burns" "Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Julius Hibbert"
title <- str_detect(name, "[A-Za-z]{2,}\\.")
title.df <- data.frame(Name = name, Has_Title = title)
knitr::kable(title.df, caption = "Person has title?")
Name | Has_Title |
---|---|
Moe Szyslak | FALSE |
Burns, C. Montgomery | FALSE |
Rev. Timothy Lovejoy | TRUE |
Ned Flanders | FALSE |
Simpson, Homer | FALSE |
Dr. Julius Hibbert | TRUE |
# Will initially count how many spaces there are, as spaces separate names. If there are two first names, there will be a total of 2 spaces as opposed to 1. In this case, we will create a new vector to remove the titles.
name.4 <- str_trim(sub("[A-Za-z]{2,}\\.","",name))
space.count <- str_count(name.4, "\\s")
space.count
## [1] 1 2 1 1 1 1
#In this case, I will convert this into 1's and 0's so that this can be converted into a logical vector with an ifelse() function.
space.count <- space.count - 1
# Now to create a logical vector.
Two.Name <- ifelse(space.count,TRUE,FALSE)
# And now to create a data.frame and place it into a vector
Two.Name.df <- data.frame(Name = name, Two_Names = Two.Name)
knitr::kable(Two.Name.df, caption = "Does this person have more than 1 name?")
Name | Two_Names |
---|---|
Moe Szyslak | FALSE |
Burns, C. Montgomery | TRUE |
Rev. Timothy Lovejoy | FALSE |
Ned Flanders | FALSE |
Simpson, Homer | FALSE |
Dr. Julius Hibbert | FALSE |
This is an all digit expression. There is at all digits (given the [0-9]+) and the $ sign at the end of the digits. An example would be 534221097$.
x <- c("43234$", "adbd3222$", "$33432", "1$")
str_extract_all(x, "[0-9]+\\$")
## [[1]]
## [1] "43234$"
##
## [[2]]
## [1] "3222$"
##
## [[3]]
## character(0)
##
## [[4]]
## [1] "1$"
This Regex is looking for a 1 to 4 letter lower case word with empty string at either edge. An example would be " usa “.
x <- c(" usa ", "usa1", "United States")
str_extract_all(x, "\\b[a-z]{1,4}\\b")
## [[1]]
## [1] "usa"
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
This is looking for a .txt file. The .*? is looking for optional characters prior to the “.txt”. The file must end in a “.txt.” as the $ sign indicates the end of the string. An example would be “hello.txt”.
x <- c("hello.txt", "hello.txt!", "datascience.pdf")
str_extract_all(x, ".*?\\.txt$")
## [[1]]
## [1] "hello.txt"
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
This format is likely looking for a two digit month, two digit day, and four digit year. An example would be 01/01/2000.
x <- c("06/30/00", "01/01/2000", "June 1st, 2017")
str_extract_all(x, "\\d{2}/\\d{2}/\\d{4}")
## [[1]]
## character(0)
##
## [[2]]
## [1] "01/01/2000"
##
## [[3]]
## character(0)
This is looking for html tags. The \1 means that it is back referencing to the first segment.
x <- c("<tag>Hello</tag>", "<fire>house</fire>", "<oreo cream oreo>", "<fire>house<fire>", "<a>a</a>")
str_extract_all(x, "<(.+?)>.+?</\\1>")
## [[1]]
## [1] "<tag>Hello</tag>"
##
## [[2]]
## [1] "<fire>house</fire>"
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
##
## [[5]]
## [1] "<a>a</a>"
9 The following code hides a secret message. Crack it with R
and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.
clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk! gr
data.code <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk! gr"
secret <- unlist(str_extract_all(data.code, "[[:upper:].]+"))
secret
## [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "AT" "I" "O" "N" "S"
## [15] "." "Y" "O" "U" "." "A" "R" "E" "." "A" ".S" "U" "P" "E"
## [29] "R" "N" "E" "R" "D"
message <- str_replace_all(paste0(secret, collapse = ''), "[.]", " ")
message
## [1] "CONGRATULATIONS YOU ARE A SUPERNERD"