Week 4

library(stringr)
obj <- "1$ 12$  1234$"
unlist(str_extract_all (obj, "[0-9]+\\$"))

## [1] "1$"    "12$"   "1234$"

“\b” match empty string at either edge (boundary) of a word
[a-z]: match any single lower case ascii character
{1,4}: match [a-z] at least 1 time but no more than 4 times
Example:

obj <- "a bc def ghij klmno"
unlist(str_extract_all (obj, "\\b[a-z]{1,4}\\b"))

## [1] "a"    "bc"   "def"  "ghij"

“.” match any single character
“*" repeats the “.” 0 or more times
“?” is a qualifier that matches the preceding pattern 0 times or once
“\.txt$” matches the literal “.txt” at the end of line
Example:

obj <- ".txt b 52.txt d&e.txt"
unlist(str_extract_all (obj, ".*?\\.txt$"))

## [1] ".txt b 52.txt d&e.txt"

This expression matches single digits exactly twice; followed by “/”, followed by single digits exactly twice, followed by “/”, followed by single digits exactly four times
Example:

obj <- "09/16/2015 99/99/1999  1234/56/78"
unlist(str_extract_all (obj, "\\d{2}/\\d{2}/\\d{4}"))

## [1] "09/16/2015" "99/99/1999"

#Note: This is not a good regex for dates. [01]\d[- /.][0-3]\d[- /.]\d\d is better, though not perfect.

This expression matches character “<”, followed by (.+?) grouping: match any single charcter, one or more times, and the preceding patten is optional.
This is followed by character “>”, followed by any single character, one or more times, optional.
This is followed by “<”, followed by the same pattern as enclosed in the parenthesis before, then “/>”.
Example:

obj <- "<b>Week 4</b> <b>2015</b> <title></title> <></>" 
unlist(str_extract_all (obj, "<(.+?)>.+?</\\1>")) #Note: The tag cannot be empty.

## [1] "<b>Week 4</b>" "<b>2015</b>"

[:digit:]{1,}[$]

obj <- "1$ 12$  1234$"
unlist(str_extract_all (obj, "[:digit:]{1,}[$]"))

## [1] "1$"    "12$"   "1234$"

mail <- "chunkylover53[at]aol[dot]com"
mail <- sub("[at]", "@", mail, fixed=TRUE) 
email <- sub("[dot]", ".", mail, fixed=TRUE) 
email

## [1] "chunkylover53@aol.com"

“[:digit:]” will extract a single digit. To extract the digits we need to match one or more times:

unlist(str_extract_all(email, "[:digit:]+"))

## [1] "53"

“\D” will extract non-digits. The correct predifined symbol is \d, matched one or more times.

unlist(str_extract_all(email, "\\d+"))

## [1] "53"