Introduction

Using regular expression and string manipulation to extract relevant information from structured, un-structed text data using R. The below few examples from Week3 assignment, shows how we can use regular expression to extract strings, on which we can then apply R packages like stringr / concatenation / BaseR etc.. to manipulate the strings and construct something meangingfull.

library(stringr)
library(concatenate)

## Warning: package 'concatenate' was built under R version 3.5.2

Question 3:-

Copy the introductory example. The vector name stores the extracted names.

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.
Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
Construct a logical vector indicating whether a character has a second name.

Answer 3:-

raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer55536420Dr. Julius Hibbert"

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
phoneNum <- unlist(str_extract_all(raw.data, "\\(?(\\d{3})?\\)?(-| )?(\\d{3})(-| )?(\\d{4})"))

# a) 
name1 <- str_replace(name ,pattern = "C. " , replacement = "")
name1 <- str_replace(name1 ,pattern = "," , replacement = "")
unlist(str_extract_all(name1, "[[:alpha:],]{2,}[:space:]+[:alpha:]{2,}"))

## [1] "Moe Szyslak"      "Burns Montgomery" "Timothy Lovejoy" 
## [4] "Ned Flanders"     "Simpson Homer"    "Julius Hibbert"

# b)
str_detect(name,"[[:alpha:]]{2,3}[\\.]")

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

# c) 
str_detect(name,"[:SPACE:][[:alpha:]]{1,}[\\.][:SPACE:]")

## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Question 4:-

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
(a) [0-9]+\$
(b) \b[a-z]{1,4}\b
(c) .*?\.txt$
(d) \d{2}/\d{2}/\d{4}
(e) <(.+?)>.+?</\1>

Answer 4 :-

# (i)
str_4i <- c("ADGYRE67575$", "1245638ABCV$","123BAC45")
str_detect(str_4i,"[0-9]+\\$" )

## [1]  TRUE FALSE FALSE

# (ii)
str_4ii <- c( "test in the week","Cat", "1234")
str_detect(str_4ii,"\\b[a-z]{1,4}\\b")

## [1]  TRUE FALSE FALSE

# (iii)
strPattern <- ".*?\\.txt$"
str_4iii <- c("Assingment.txt","assignment")
str_detect(str_4iii,strPattern)

## [1]  TRUE FALSE

# (iv)
strPatt <- "\\d{2}/\\d{2}/\\d{4}"
str_4iv <-c("2/2/1977","02/02/1977","1977/03/29","03/29/86")
str_detect(str_4iv,strPatt)

## [1] FALSE  TRUE FALSE FALSE

# (v)
strPatt <- "<(.+?)>.+?</\\1>"
str_4v <- c("<h1> this is an paragraph header</h1>","<p>test</pt>","<test>b</test")
str_detect(str_4v,strPatt)

## [1]  TRUE FALSE FALSE

Question 9:-

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com. clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

Answer 9 :-

secret_mesg <- c("clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr")

secret_msg <- unlist(str_extract_all(secret_mesg,"[[:upper:].]{1,}"))
secret_msg <- cc(secret_msg)
secret_msg <- str_replace_all(secret_msg,",","")
secret_msg <- str_replace_all(secret_msg,"\\."," ")
secret_msg

## [1] "C O N G R A T U L AT I O N S   Y O U   A R E   A  S U P E R N E R D"

Week 3 Assignment

Vishal Arora

February 17, 2019