607 Week3 ALiu-Ferrara

3. Copy the introductory example. The vector name stores the extracted names.

R> name [1] “Moe Szyslak” “Burns, C. Montgomery” “Rev. Timothy Lovejoy” [4] “Ned Flanders” “Simpson, Homer” “Dr. Julius Hibbert”

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

name <- c("Moe Szyslak", "Burns, C. Montgomery", "Rev. Timothy Lovejoy",
          "Ned Flanders", "Simpson, Homer", "Dr. Julius Hibbert")

 first <- unlist(str_extract_all(name, '\\w+ |\\, \\w+$|\\. \\w+$'))
 first <- unlist(str_extract_all(first, '\\w+'))
 last <- unlist(str_extract_all(name, '\\b \\w+|\\w+,'))
 last <- unlist(str_extract_all(last, '\\w+'))
 full_name <- paste0(first, ' ', last)
 full_name

## [1] "Moe Szyslak"      "Montgomery Burns" "Timothy Lovejoy" 
## [4] "Ned Flanders"     "Homer Simpson"    "Julius Hibbert"

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

title <- str_detect(name, 'Rev.|Dr.')

Construct a logical vector indicating whether a character has a second name.

  second_name <- str_detect(name, '\\ .\\. ')

33 4. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

[0-9]+\$

Digits ending with dollar sign

unlist(str_extract_all('212$13 33$', '[0-9]+\\$'))

## [1] "212$" "33$"

\b[a-z]{1,4}\b

Any lower case words with a word edge and repeated at least 1 and up to 4 times.

unlist(str_extract_all('a wordExample4 2 word', '\\b[a-z]{1,4}\\b'))

## [1] "a"    "word"

.*?\.txt$

Either none or any characters ending with .txt

unlist(str_extract_all('a wordExample4 2 word.txt', '.*?\\.txt$'))

## [1] "a wordExample4 2 word.txt"

unlist(str_extract_all('.txt', '.*?\\.txt$'))

## [1] ".txt"

\d{2}/\d{2}/\d{4}

Date format with ‘/’, dd/mm/year

unlist(str_extract_all('23/02/1902 23/12/1998', '\\d{2}/\\d{2}/\\d{4}'))

## [1] "23/02/1902" "23/12/1998"

<(.+?)>.+?</\1>

Search for html or xml tag pattern begin with and end with

unlist(str_extract_all('<html>content</html>, <tag>text</tag>', '<(.+?)>.+?</\\1>'))

## [1] "<html>content</html>" "<tag>text</tag>"

Reference: https://github.com/wwells/CUNY_DATA_607/blob/master/Week3/Regex_Week3.Rmd

607 Week3 ALiu-Ferrara

Ann Liu-Ferrara

February 18, 2017

3. Copy the introductory example. The vector name stores the extracted names.