Source file ⇒ lec22.Rmd

Today

  1. Finish regular expressions (chapter 16)

Regular expressions

Here again is a useful cheat sheet

Tasks for you

Write a regular expression that matches

  1. Any number (ex -56.899 or 0)
x <- c(1, +12.4,-56.899,  0, -23, .000, -0, "4.5.3")
grepl("^[-+]?[0-9]+\\.?[0-9]*$",x)
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
#or
grepl("^[-+]?[[:digit:]]+\\.?[[:digit:]]*$",x)
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
  1. any line with exactly two words separated by any amount of whitespace (spaces or tabs). There may or may not be whitespace at the beginning or end of the line.
x <- c("   stats rocks", "I love stats")
grepl("^[[:blank:]]*[[:alpha:]]+[[:blank:]]+[[:alpha:]]+[[:blank:]]*$",x)
## [1]  TRUE FALSE

grepl(), gsub(), extractMatches()

Regular expressions are used for several purposes:

  • to detect whether a pattern is contained in a string. Use filter() and grepl()

details:

  1. grepl() searches for matches to regexp within each element of a character vector and returns a logical vector

  2. syntax: grepl(regex,vec)

example: Here is a list of names from the Bible:

BibleNames <- read.file("http://tiny.cc/dcf/BibleNames.csv")
head(BibleNames)
##      name                                 meaning
## 1   Aaron  a teacher; lofty; mountain of strength
## 2 Abaddon                           the destroyer
## 3 Abagtha                father of the wine-press
## 4   Abana               made of stone; a building
## 5  Abarim                    passages; passengers
## 6    Abba                                  father

Which names have any words in them: “bar”, “dam”, “lory”?

BibleNames %>%
  filter(grepl("(bar)|(dam)|(lory)", name)) %>%
  head()
##        name                     meaning
## 1    Abarim        passages; passengers
## 2  Aceldama              field of blood
## 3      Adam                 earthy; red
## 4    Adamah         red earth; of blood
## 5     Adami  my man; red; earthy; human
## 6 Bethabara     the house of confidence
  • to replace the elements of that pattern with something else. Use mutate() and gsub()

details:

  1. gsub() searches for matches to argument regexp within each element of a character vector and performs replacement of all matches

  2. syntax:gsub(regex, replacement, vec)

Clean up the Debt data table below so it only has numbers

head(Debt)
##         country            debt percentGDP
## 1         World $56,308 billion        64%
## 2 United States $17,607 billion     73.60%
## 3         Japan  $9,872 billion    214.30%
Debt %>% 
  mutate( debt=gsub("[$,]|billion","",debt),
          percentGDP=gsub("%", "", percentGDP))
##         country   debt percentGDP
## 1         World 56308          64
## 2 United States 17607       73.60
## 3         Japan  9872      214.30
  • to pull out the matches for marked regions of regular expressions use extractMatches() from the DataComputing package.

Details:

  1. input is a data table and output is a data table with one or more additional columns with the extractions

  2. syntax is df %>% extractMatches(regex, var,...)

  3. wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string

  4. when there is no match extractMatches() returns NA example

data.frame(string=c("hi there there","bye")) %>% extractMatches("r(e)",string)
##           string match1
## 1 hi there there      e
## 2            bye   <NA>

What are the most common end vowels for Bible names?

To answer this question, you have to extract the last vowel from the name. The extractMatches() transformation function can do this.

BibleNames %>% 
  extractMatches( "([aeiou])$", name, vowel=1) %>% 
  group_by(vowel ) %>% 
  summarise( total= n()) %>%
  arrange( vowel, desc(total) )
## Source: local data frame [6 x 2]
## 
##    vowel total
##   (fctr) (int)
## 1      a   233
## 2      e    55
## 3      i   207
## 4      o    24
## 5      u    18
## 6     NA  2089

What Bible Names start and end with a vowel?

BibleNames %>% 
  extractMatches( "^([AEIOU]).*([aeiou])$", name, beg_vowel=1, end_vowel=2) %>% 
  filter(!is.na(beg_vowel) & !is.na(end_vowel)) %>%
  select(-meaning) %>%
  head()
##       name beg_vowel end_vowel
## 1  Abagtha         A         a
## 2    Abana         A         a
## 3     Abba         A         a
## 4     Abda         A         a
## 5     Abdi         A         i
## 6 Abednego         A         o

Tasks for you

  1. Which Bible names in BibleNames have the word “man” in the meaning (not first or last word)?
BibleNames %>% 
  filter(grepl("[[:blank:]]+man[[:blank:]]+",meaning))
##         name                               meaning
## 1 Andronicus                a man excelling others
## 2  Elkoshite                     a man of Elkeshai
## 3   Iscariot           a man of murder; a hireling
## 4 Ishbosheth                        a man of shame
## 5    Lebbeus  a man of heart; praising; confessing
  1. Grab everything but the last letter of names in BibleNames that end with a vowel (example: Grab Ev from Eva)
BibleNames %>% 
  extractMatches(pattern="(.*)[aeiouy]$", name, root=1) %>% 
  filter(!is.na(root)) %>%
  select(-meaning) %>%
  head()
##       name    root
## 1  Abagtha  Abagth
## 2    Abana    Aban
## 3     Abba     Abb
## 4     Abda     Abd
## 5     Abdi     Abd
## 6 Abednego Abedneg