Chapter 8: Regular Expressions

4.

  1. +: match one or more times
    \$ match dollar sign
    Example:
library(stringr)
obj <- "1$ 12$  1234$"
unlist(str_extract_all (obj, "[0-9]+\\$"))
## [1] "1$"    "12$"   "1234$"
  1. “\b” match empty string at either edge (boundary) of a word
    [a-z]: match any single lower case ascii character
    {1,4}: match [a-z] at least 1 time but no more than 4 times
    Example:
obj <- "a bc def ghij klmno"
unlist(str_extract_all (obj, "\\b[a-z]{1,4}\\b"))
## [1] "a"    "bc"   "def"  "ghij"
  1. “.” match any single character
    “*" repeats the “.” 0 or more times
    “?” is a qualifier that matches the preceding pattern 0 times or once
    “\.txt$” matches the literal “.txt” at the end of line
    Example:
obj <- ".txt b 52.txt d&e.txt"
unlist(str_extract_all (obj, ".*?\\.txt$"))
## [1] ".txt b 52.txt d&e.txt"
  1. This expression matches single digits exactly twice; followed by “/”, followed by single digits exactly twice, followed by “/”, followed by single digits exactly four times
    Example:
obj <- "09/16/2015 99/99/1999  1234/56/78"
unlist(str_extract_all (obj, "\\d{2}/\\d{2}/\\d{4}"))
## [1] "09/16/2015" "99/99/1999"
#Note: This is not a good regex for dates. [01]\d[- /.][0-3]\d[- /.]\d\d is better, though not perfect.
  1. This expression matches character “<”, followed by (.+?) grouping: match any single charcter, one or more times, and the preceding patten is optional.
    This is followed by character “>”, followed by any single character, one or more times, optional.
    This is followed by “<”, followed by the same pattern as enclosed in the parenthesis before, then “/>”.
    Example:
obj <- "<b>Week 4</b> <b>2015</b> <title></title> <></>" 
unlist(str_extract_all (obj, "<(.+?)>.+?</\\1>")) #Note: The tag cannot be empty.
## [1] "<b>Week 4</b>" "<b>2015</b>"

5.

[:digit:]{1,}[$]

obj <- "1$ 12$  1234$"
unlist(str_extract_all (obj, "[:digit:]{1,}[$]"))
## [1] "1$"    "12$"   "1234$"

6.

mail <- "chunkylover53[at]aol[dot]com"
mail <- sub("[at]", "@", mail, fixed=TRUE) 
email <- sub("[dot]", ".", mail, fixed=TRUE) 
email
## [1] "chunkylover53@aol.com"
  1. “[:digit:]” will extract a single digit. To extract the digits we need to match one or more times:
unlist(str_extract_all(email, "[:digit:]+"))
## [1] "53"
  1. “\D” will extract non-digits. The correct predifined symbol is \d, matched one or more times.
unlist(str_extract_all(email, "\\d+"))
## [1] "53"