Copy the introductory example. The vector name stores the extracted names.
library(stringr)
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
raw.data
## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
first_name_raw <- unlist(str_extract_all(name,"\\w* |, \\w*(. \\w*|$)"))
first_name_raw <- first_name_raw[first_name_raw != " "]
first_name_cand <- gsub(",","",first_name_raw)
first_name <- str_trim(first_name_cand)
last_name_raw <- unlist(str_extract(name," \\w*$|\\w*,"))
last_name_cand <- gsub(",","",last_name_raw)
last_name <- str_trim(last_name_cand)
data.frame(first_name = first_name, last_name = last_name)
## first_name last_name
## 1 Moe Szyslak
## 2 C. Montgomery Burns
## 3 Timothy Lovejoy
## 4 Ned Flanders
## 5 Homer Simpson
## 6 Julius Hibbert
Definitely, not the best solution, but it does the job.
- For first_name, I extracted the string pattern for either any string with a space afterward OR a string that is preceded by a comma and a space. The intermediate result will return a vector of the first names but with commas and empty spaces. I used gsub to get rid of commas and str_trim to trim the leading and ending empty spaces for each first_name string.
- For last_name, I extracted the string pattern for either any string that ends with an empty space followed by a string or a string with a comma afterwards. The intermediate result will return a vector of the last names but with commas and empty spaces. I used gsub to get rid of commas and str_trim to trim the leading and ending empty spaces for each last_name string.
- The final result is a display of the data frame with first_name and last_name columns and their values.
title_vector <- grepl("\\w{2,}\\. ",name)
data.frame(first_name = first_name, last_name = last_name, title = title_vector)
## first_name last_name title
## 1 Moe Szyslak FALSE
## 2 C. Montgomery Burns FALSE
## 3 Timothy Lovejoy TRUE
## 4 Ned Flanders FALSE
## 5 Homer Simpson FALSE
## 6 Julius Hibbert TRUE
- For the logical vector title, I used grepl to find any word with at least 2 characters length and that ends with a period and a space. If the pattern exists, it will return true, otherwise, it will be false.
- The final result is a display of the data frame with first_name, last_name, and title columns and their values.
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
## first_name last_name title secondname
## 1 Moe Szyslak FALSE FALSE
## 2 C. Montgomery Burns FALSE TRUE
## 3 Timothy Lovejoy TRUE FALSE
## 4 Ned Flanders FALSE FALSE
## 5 Homer Simpson FALSE FALSE
## 6 Julius Hibbert TRUE FALSE
- For the logical vector secondname, I used grepl to find any word with exactly 1 characterlength and that ends with a period and a space. If the pattern exists, it will return true, otherwise, it will be false.
- The final result is a display of the data frame with first_name, last_name, title, and secondname columns and their values
Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
All answers were derived on based on experimentation and liberal usage of the grepl function which was pretty useful.
a <- c("Frank","3Reginald","Robert4","Lisa36","8","345","8$","A6S12$","85$")
Four_1_vector <- grepl("[0-9]+\\$",a)
Four_1_vector
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
The pattern being searched is any string that contains a number and ends with a $.
b <- c("Frank","3Reginald","Robert4","Lisa","Reginald abc Rupert", "Reginald abc","Reginald abcde","Reginald abc3")
Four_2_vector <- grepl("\\b[a-z]{1,4}\\b",b)
Four_2_vector
## [1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
The pattern being searched is any string that contains initially a word and a second word that must consist at least 1 alphabetical character and a maximum 4 alphabetical characters. In the example above, I demonstrated two FALSE returns which included a second word with 5 alphabetical characters and a second word with some alphabetical characters but ending with a number.
c <- c("Frank.txt","3Reginald.txt","Robert4","Lisa.txtx")
Four_3_vector <- grepl(".*?\\.txt$",c)
Four_3_vector
## [1] TRUE TRUE FALSE FALSE
The pattern being searched is string with initially any alphanumeric word pattern that ends with .txt. I added one that ended .txtx and it returned FALSE.
d <- c("Frank","February 22, 1965","3/4/18","03/04/2019","12/31/2018")
Four_4_vector <- grepl("\\d{2}/\\d{2}/\\d{4}",d)
Four_4_vector
## [1] FALSE FALSE FALSE TRUE TRUE
The pattern being searched is the date string with the pattern nn/nn/nnnn. I used long date format and short date format to demonstrate that the logical values returned are FALSE. Only examples following the proscribed pattern will return TRUE.
e <- c("Frank","<html>something</html>", "<docNumber>herewego</nothis>", "<docNumber></docNumber>","<docNumber> </docNumber>")
Four_5_vector <- grepl("<(.+?)>.+?</\\1>",e)
Four_5_vector
## [1] FALSE TRUE FALSE FALSE TRUE
The pattern being searched is any string that follows the XML-based format for tags:
someword . someword must not be empty. For example, in the example above,would return FALSE, but will return TRUE because there is a space in between opening and closing tags.
The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others!
clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr
rawstring <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
enc2native(rawstring)
## [1] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"