DATA 607 - HW 3

Copy the introductory example. The vector name stores the extracted names. R> name [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert”

name<-c("Moe Szyslak", "Burns, C. Montgomery", "Rev. Timothy Lovejoy","Ned Flanders", "Simpson, Homer", "Dr. Julius Hibbert")
name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

library(stringr)
name2<-str_split(name, ", ",simplify=TRUE)
name2

##      [,1]                   [,2]           
## [1,] "Moe Szyslak"          ""             
## [2,] "Burns"                "C. Montgomery"
## [3,] "Rev. Timothy Lovejoy" ""             
## [4,] "Ned Flanders"         ""             
## [5,] "Simpson"              "Homer"        
## [6,] "Dr. Julius Hibbert"   ""

name3<-str_c(name2[,2]," ",name2[,1])
name3

## [1] " Moe Szyslak"          "C. Montgomery Burns"   " Rev. Timothy Lovejoy"
## [4] " Ned Flanders"         "Homer Simpson"         " Dr. Julius Hibbert"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

title<-str_detect(name,"[[:alpha:]]{2,}\\.")
title

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Construct a logical vector indicating whether a character has a second name.

#titles are being considered part of names in the logical evaluation.
secondname<-str_detect(name3, "\\. ")
secondname

## [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$

This regular expression will extract any consecutive numbers that end with a $ sign.

x<-"this54444$ is example$ str77777ing another exa$5555"
a<-str_extract_all(x, "[0-9]+\\$")
a

## [[1]]
## [1] "54444$"

\b[a-z]{1,4}\b

This regualar expression will extract any words(continuous letter strings) that are 1-4 characters and also lower case. It looks like it ignores any special characters so if a period or $ sign is in the continuous string, it will split it.

x1<-c("cat cats dogs DOG cAt Dog CatDog dog 4dog DOGSS dogss catsss c.c c,c c-c")
a<-(str_extract_all(x1, "\\b[a-z]{1,4}\\b"))
a

## [[1]]
##  [1] "cat"  "cats" "dogs" "dog"  "c"    "c"    "c"    "c"    "c"    "c"

x2<-c("cat cats dogs DOG cAt Dog CatDog dog 4dog DOGSS dogss catsss c$c")
b<-(str_extract_all(x2, "\\b[a-z]{1,4}\\b"))
b

## [[1]]
## [1] "cat"  "cats" "dogs" "dog"  "c"    "c"

.*?\.txt$

This regular expression appears to extract anything that has a .txt file extension and would be very useful in that type of scenerio to extract files based on type.

x<- c("yankee77343.txt", "yankees33.xls", "53$%22-.txt", "side")
x<-(str_extract_all(x,".*?\\.txt$"))
x

## [[1]]
## [1] "yankee77343.txt"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "53$%22-.txt"
## 
## [[4]]
## character(0)

\d{2}/\d{2}/\d{4}

This regular expression will extract dates that are formatted a certain way which in this case appears to be 2 digit month/day, 2 digit month/day, and 4 digit year, all with /. Or any date structure like XX/XX/XXXX.

x<-"8/7/1985 06/12/1950 2018/25/10 30/03/2015 05-02-2018"
a<-(str_extract_all(x,"\\d{2}/\\d{2}/\\d{4}"))
a

## [[1]]
## [1] "06/12/1950" "30/03/2015"

<(.+?)>.+?</\1>

This regular expression will extract strings that appear to be wrapped with html coding. In html almost all the code has something called markup tags which the starting tag is usually <> and then ended with </>. This expression would be useful to parse code/text in html.

x<-c("<<body>Hello world</body> hello world")
a<-(str_extract_all(x,"<(.+?)>.+?</\\1>"))
a

## [[1]]
## [1] "<body>Hello world</body>"

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

From the hint, inspired to looked for upper case. Extracted punctuation characters for better readability.

Message reads : “CONGRATULATIONS YOU ARE A SUPERNERD!”

x<-"clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
x1<-str_extract_all(x,"[:upper:]|[:punct:]")
x1

## [[1]]
##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "." "Y"
## [18] "O" "U" "." "A" "R" "E" "." "A" "." "S" "U" "P" "E" "R" "N" "E" "R"
## [35] "D" "!"