Source file ⇒ 2017-lec13.Rmd

Today

  1. Regular expressions (chapter 16 DC book) (end of material for midterm)

Regular expressions

Here again is a useful cheat sheet

or here are some basics:

example

match all of the strings in the vector c("a","abb", "abbb","abbbb") but not the strings c("ca","abbd")

Soln:

First try

grepl("ab*",c("a","ab", "abb","abbb", "ca","abbd"))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Second try

grepl("^ab*$",c("a","ab", "abb","abbb", "ca","abbd"))
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

or

grepl("(^a$|^ab$|^abb$|^abbb$)",c("a","ab", "abb","abbb", "ca","abbd"))
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

example

Sometimes we want to combine regular expressions with R expressions:

  1. match string with either Q or O but not both (exclusive OR):
x <- c("OQ", "Q","Or","5")
(O <- grepl("O",x))
## [1]  TRUE FALSE  TRUE FALSE
x <- c("OQ", "Q","Or","5")
(Q <- grepl("Q",x))
## [1]  TRUE  TRUE FALSE FALSE
x <- c("OQ", "Q","Or","5")
xor(O,Q)
## [1] FALSE  TRUE  TRUE FALSE
  1. Match string with exactly one Q and one O:

first match strings with exactly one Q

(Q <- grepl("^[^Q]*Q[^Q]*$",c("OQ", "Q","Or","QQO","5QOO")))
## [1]  TRUE  TRUE FALSE FALSE  TRUE

then match strings with exactly one O

(O <- grepl("^[^O]*O[^O]*$",c("OQ", "Q","Or","QQO","5QOO")))
## [1]  TRUE FALSE  TRUE  TRUE FALSE

Then use & logical operation in R.

#c("OQ", "Q","Or","QQO","5QOO")
O & Q
## [1]  TRUE FALSE FALSE FALSE FALSE

example:

In your work as a police detective, you have access to a database of car registrations. The data table CarRegistrations, looks like this:

CarRegistrations <- data.frame(c("Joe Smith","Tim Allen"),c("Ford","Toyota"),c("Taurus","Prius"),c(2007,2011),c("metalic gray","blue"),c("337 HBQ","843 OSS"),c(98739,93134))
names(CarRegistrations) <- c("owner","brand","model","model_year","color","plate","zip")
CarRegistrations
owner brand model model_year color plate zip
Joe Smith Ford Taurus 2007 metalic gray 337 HBQ 98739
Tim Allen Toyota Prius 2011 blue 843 OSS 93134

A hit-and-run collision has been reported. The witness reports seeing a blue car with a license starting with either 3 or 8, and the letter O or Q and another letter S or 8. You are in the Santa Barbara, CA area, and hit-and-run incidents tend to be local, so you’ll search first for vehicles registered in zip codes starting with 931.

Write a filter() expression that will extract matches to the information at hand.

Soln:

The expression will look like this

CarRegistrations %>%
  filter( grepl("blue", color), # blue car
          grepl("^[38].*[OQ].*[S8].*", plate), # license plate
          grepl("^931..$", as.character(zip)) # zip code
          )
owner brand model model_year color plate zip
Tim Allen Toyota Prius 2011 blue 843 OSS 93134

The order of the grepl() statements doesn’t matter; they must all be true for a case to pass through the filter.

In Class exercise

Do problem 1:

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-13-collection/

The metacharacters in Regular Expressions are:

. \ | ( ) [ ] { $ * + ?

The square brackets [], is called a character class. For example the character class [ab] means a or b. The only special characters or metacharacters inside a character class are the closing bracket ], the backslash , the caret ^, and the hyphen -. The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. The metacharacter ^ means the opposite, - means between, and \\ means escape.

Example:

grepl("[a$]", c("hello$bye", "hellobye$","adam"))  #$ has literal meaning inside []
## [1] TRUE TRUE TRUE
grepl("[a$]$", c("hello$bye","hellobye$", "adam")) #$ is metacharacter outside of []
## [1] FALSE  TRUE FALSE
grepl("[a\\^]", c("5^"))
## [1] TRUE

There are also built-in character sets for commonly used collections.

POSIX bracket expressions are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets. A hyphen creates a range, and a caret at the start negates the bracket expression.

Example:

x <- c ("a0b", "a1b","a12b","a&b")
grepl("a[[:digit:]]b",x)
## [1]  TRUE  TRUE FALSE FALSE
x <- c ("a0b", "a1b","a12b","a&b")
grepl("a[[:digit:]]+b",x)
## [1]  TRUE  TRUE  TRUE FALSE
x <- c ("a0b", "a1b","a12b","a&b")
grepl("a[^[:digit:]]+b",x)
## [1] FALSE FALSE FALSE  TRUE
x <- c (3,"5","a","b","?")
grepl("[[:digit:]abc]",x)
## [1]  TRUE  TRUE  TRUE  TRUE FALSE

example:

Write a regular expression that matches any string with exactly two words separated by any amount of whitespace (spaces or tabs). There may or may not be whitespace at the beginning or end of the line.

x <- c("   stats rocks", "I love stats")
grepl("^[[:blank:]]*[[:alpha:]]+[[:blank:]]+[[:alpha:]]+[[:blank:]]*$",x)
## [1]  TRUE FALSE

grepl(), gsub(), extractMatches()

Regular expressions are used for several purposes:

  • to detect whether a pattern is contained in a string. Use filter() and grepl()

details:

  1. grepl() searches for matches to regexp within each element of a character vector and returns a logical vector

  2. syntax: grepl(regex,vec)

example: Here is a list of names from the Bible:

BibleNames <- read.csv("http://tiny.cc/dcf/BibleNames.csv")
head(BibleNames)
name meaning
Aaron a teacher; lofty; mountain of strength
Abaddon the destroyer
Abagtha father of the wine-press
Abana made of stone; a building
Abarim passages; passengers
Abba father

Which names have any words in them: “bar”, “dam”, “lory”?

BibleNames %>%
  filter(grepl("(bar)|(dam)|(lory)", name)) %>%
  head()
name meaning
Abarim passages; passengers
Aceldama field of blood
Adam earthy; red
Adamah red earth; of blood
Adami my man; red; earthy; human
Bethabara the house of confidence
  • to replace the elements of that pattern with something else. Use mutate() and gsub()

details:

  1. gsub() searches for matches to argument regexp within each element of a character vector and performs replacement of all matches

  2. syntax:gsub(regex, replacement, vec)

Clean up the Debt data table below so it only has numbers

head(Debt)
country debt percentGDP
World $56,308 billion 64%
United States $17,607 billion 73.60%
Japan $9,872 billion 214.30%
Debt %>% 
  mutate( debt=gsub("[$,]|billion","",debt),
          percentGDP=gsub("%", "", percentGDP))
country debt percentGDP
World 56308 64
United States 17607 73.60
Japan 9872 214.30

extract matches for marked regions of regular expressions. Used with DataComputing::extractMatches()

Details:

  1. input is a data table and output is a data table with one or more additional columns with the extractions

  2. syntax is df %>% extractMatches(regex, var)

  3. wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string

  4. when there is no match extractMatches() returns NA

example:

data.frame(string=c("hi there there","bye")) %>% extractMatches("r(e)",string)
string match1
hi there there e
bye NA

example:

What are the most common end vowels for Bible names?

To answer this question, you have to extract the last vowel from the name. The extractMatches() transformation function can do this.

BibleNames %>% 
  extractMatches( "([aeiou])$", name, vowel=1) %>% 
  group_by(vowel ) %>% 
  summarise( total= n()) %>%
  arrange( vowel, desc(total) )
vowel total
a 233
e 55
i 207
o 24
u 18
NA 2089

What Bible Names start and end with a vowel?

BibleNames %>% 
  extractMatches( "^([AEIOU]).*([aeiou])$", name, beg_vowel=1, end_vowel=2) %>% 
  filter(!is.na(beg_vowel) & !is.na(end_vowel)) %>%
  select(-meaning) %>%
  head()
name beg_vowel end_vowel
Abagtha A a
Abana A a
Abba A a
Abda A a
Abdi A i
Abednego A o

i-clicker

Soln: both abababa and aabbaa are matches. Note that aa is a match