Regular Expressions and String Functions

Question 1

Copy the introductory example. The vector name stores the extracted names.

R> name [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert”

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
Construct a logical vector indicating whether a character has a second name.

Answer 1

# installing required package 
install.packages('stringr', repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/Anil Akyildirim/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## package 'stringr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Anil Akyildirim\AppData\Local\Temp\RtmpAHN0c4\downloaded_packages

library('stringr')

# creating the example raw data
raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
raw.data

## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

#creating the example name list
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Answer (a)

When we look at the name list we can see that we need to pay attention.

Title of a person
Spaces
Order of the first name and the last name
Middle Initial of an individual.
Commas

# removing the comma
names_without_comma <- str_replace_all(name, pattern = ",", replacement = "")
names_without_comma

## [1] "Moe Szyslak"          "Burns C. Montgomery"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson Homer"        "Dr. Julius Hibbert"

# removing the title
names_without_titles <- str_replace_all(names_without_comma, pattern = "[[:alpha:]]{2,}\\. ", replacement = "")
names_without_titles

## [1] "Moe Szyslak"         "Burns C. Montgomery" "Timothy Lovejoy"    
## [4] "Ned Flanders"        "Simpson Homer"       "Julius Hibbert"

# removing the middle names
names_without_middle_names <- str_replace_all(names_without_titles, pattern = "[[:alpha:]]{1,}\\. ", replacement = "" )
names_without_middle_names

## [1] "Moe Szyslak"      "Burns Montgomery" "Timothy Lovejoy" 
## [4] "Ned Flanders"     "Simpson Homer"    "Julius Hibbert"

# displaying the first_name _last_name
first_name_last_name <- str_replace(names_without_middle_names, pattern = "(\\w+)\\s+(\\w+)", replacement = "\\2 \\1")
first_name_last_name

## [1] "Szyslak Moe"      "Montgomery Burns" "Lovejoy Timothy" 
## [4] "Flanders Ned"     "Homer Simpson"    "Hibbert Julius"

Answer (b) Logical Vector indicating whether a character has a title

#using str_detect to give us True or False of title existence
str_detect(names_without_comma, "[[:alpha:]]{2,}\\.")

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Answer (c) Logical Vector indicating whether a character has a middle name

# using str_detect to give us True of False of middle name existence 
str_detect(names_without_titles, "[[:alpha:]]{1,}\\.")

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Question 2

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$
\b[a-z]{1,4}\b
.*?\.txt$
\d{2}/\d{2}/\d{4}
<(.+?)>.+?</\1>

Answer 2

Answer (a)

This string that has one or more numbers between 0 and 9 and will end with $ so something like this - 589$ -

web_text <- "In Europe they use $ sign at the end of the numbers, for example 435$"
web_text

## [1] "In Europe they use $ sign at the end of the numbers, for example 435$"

str_extract_all(web_text, "[0-9]+\\$")

## [[1]]
## [1] "435$"

Answer(b)

The string that has any character counts from 1 to 4. (words that has 1 to 4 letters/characters)

str_extract_all(web_text, "\\b[a-z]{1,4}\\b")

## [[1]]
## [1] "they" "use"  "sign" "at"   "the"  "end"  "of"   "the"  "for"

Answer(c)

The string that ends with .txt

file_formats <- "We have a lot of files named info.txt"
file_formats

## [1] "We have a lot of files named info.txt"

str_extract_all(file_formats, ".*?\\.txt$")

## [[1]]
## [1] "We have a lot of files named info.txt"

Answer(d)

The string consists of 2 digit number followed by “/”, 2 digit number follwed by “/” and 4 digit number. Looks like date/month/year or month/date/year.

web_date <- "The last day for enrollment to the class was 08/27/2019"

str_extract_all(web_date, "\\d{2}/\\d{2}/\\d{4}")

## [[1]]
## [1] "08/27/2019"

Answer(e)

The string consists of <…> and anything in between and ends with </> . Looks like html.

web_code <- "The way you construct the html body is that you start with <p>These pretzels are making me thirsty!</p>"
str_extract_all(web_code, "<(.+?)>.+?</\\1>")

## [[1]]
## [1] "<p>These pretzels are making me thirsty!</p>"

Question 9

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

Answer 9

# creating the message object

scripted_message <- "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

scripted_message

## [1] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

#lets see if we can find anything in the numbers.

str_extract_all(scripted_message, "[:digit:]")

## [[1]]
##  [1] "1" "0" "8" "7" "7" "9" "2" "8" "5" "5" "0" "7" "8" "0" "3" "5" "3"
## [18] "0" "7" "5" "5" "3" "3" "6" "4" "1" "1" "6" "2" "2" "4" "9" "0" "5"
## [35] "6" "5" "1" "7" "2" "4" "6" "3" "9" "5" "8" "9" "6" "5" "9" "4" "9"
## [52] "0" "5" "4" "5"

#lets see if we can find anything in the capital numbers.
str_extract_all(scripted_message, "[:upper:]")

## [[1]]
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "Y" "O"
## [18] "U" "A" "R" "E" "A" "S" "U" "P" "E" "R" "N" "E" "R" "D"

Well it says Congratulations You are a super nerd but all characters are individually shown,

# combining each separate character, adding period
cat(unlist(str_extract_all(scripted_message, "[[:upper:].]")))

## C O N G R A T U L A T I O N S . Y O U . A R E . A . S U P E R N E R D