Regular expressions

Here again is a useful cheat sheet

or here are some basics:

Very simple patterns:
- A single . means “any character.”
- A character, e.g., b, means just that character.
- Characters enclosed in square brackets, e.g., [aeiou] means any of those characters. (So, [aeiou] is a pattern describing a vowel.)
- The ^ inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam matches Adam and adam
Repeats
- Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou] means a capital M followed by a lower-case vowel.
- A simple pattern followed by a + means “one or more times.” For example M(ab)+ means M followed by one or more ab.
- A simple pattern followed by a ? means “zero or one time.”
- A simple pattern followed by a * means “zero or more times.”
- A simple pattern followed by {2} means “exactly two times.” Similarly, {2,5} means between two and five times, {6,} means six times or more.
Start and end of strings. For instance, [aeiou]{2} means “exactly two vowels in a row.”
- ^ at the beginning of a regular expression means “the start of the string”
- $ at the end means “the end of the string.”

example

match all of the strings in the vector c("a","abb", "abbb","abbbb") but not the strings c("ca","abbd")

Soln:

First try

grepl("ab*",c("a","ab", "abb","abbb", "ca","abbd"))

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

Second try

grepl("^ab*$",c("a","ab", "abb","abbb", "ca","abbd"))

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

grepl("(^a$|^ab$|^abb$|^abbb$)",c("a","ab", "abb","abbb", "ca","abbd"))

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

example

Sometimes we want to combine regular expressions with R expressions:

match string with either Q or O but not both (exclusive OR):

x <- c("OQ", "Q","Or","5")
(O <- grepl("O",x))

## [1]  TRUE FALSE  TRUE FALSE

x <- c("OQ", "Q","Or","5")
(Q <- grepl("Q",x))

## [1]  TRUE  TRUE FALSE FALSE

x <- c("OQ", "Q","Or","5")
xor(O,Q)

## [1] FALSE  TRUE  TRUE FALSE

Match string with exactly one Q and one O:

first match strings with exactly one Q

(Q <- grepl("^[^Q]*Q[^Q]*$",c("OQ", "Q","Or","QQO","5QOO")))

## [1]  TRUE  TRUE FALSE FALSE  TRUE

then match strings with exactly one O

(O <- grepl("^[^O]*O[^O]*$",c("OQ", "Q","Or","QQO","5QOO")))

## [1]  TRUE FALSE  TRUE  TRUE FALSE

Then use & logical operation in R.

#c("OQ", "Q","Or","QQO","5QOO")
O & Q

## [1]  TRUE FALSE FALSE FALSE FALSE

example:

In your work as a police detective, you have access to a database of car registrations. The data table CarRegistrations, looks like this:

CarRegistrations <- data.frame(c("Joe Smith","Tim Allen"),c("Ford","Toyota"),c("Taurus","Prius"),c(2007,2011),c("metalic gray","blue"),c("337 HBQ","843 OSS"),c(98739,93134))
names(CarRegistrations) <- c("owner","brand","model","model_year","color","plate","zip")
CarRegistrations

owner	brand	model	model_year	color	plate	zip
Joe Smith	Ford	Taurus	2007	metalic gray	337 HBQ	98739
Tim Allen	Toyota	Prius	2011	blue	843 OSS	93134

A hit-and-run collision has been reported. The witness reports seeing a blue car with a license starting with either 3 or 8, and the letter O or Q and another letter S or 8. You are in the Santa Barbara, CA area, and hit-and-run incidents tend to be local, so you’ll search first for vehicles registered in zip codes starting with 931.

Write a filter() expression that will extract matches to the information at hand.

Soln:

The expression will look like this

CarRegistrations %>%
  filter( grepl("blue", color), # blue car
          grepl("^[38].*[OQ].*[S8].*", plate), # license plate
          grepl("^931..$", as.character(zip)) # zip code
          )

owner	brand	model	model_year	color	plate	zip
Tim Allen	Toyota	Prius	2011	blue	843 OSS	93134

The order of the grepl() statements doesn’t matter; they must all be true for a case to pass through the filter.

In Class exercise

Do problem 1:

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-13-collection/

The metacharacters in Regular Expressions are:

. \ | ( ) [ ] { $ * + ?

The square brackets [], is called a character class. For example the character class [ab] means a or b. The only special characters or metacharacters inside a character class are the closing bracket ], the backslash , the caret ^, and the hyphen -. The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. The metacharacter ^ means the opposite, - means between, and \\ means escape.

Example:

grepl("[a$]", c("hello$bye", "hellobye$","adam"))  #$ has literal meaning inside []

## [1] TRUE TRUE TRUE

grepl("[a$]$", c("hello$bye","hellobye$", "adam")) #$ is metacharacter outside of []

## [1] FALSE  TRUE FALSE

grepl("[a\\^]", c("5^"))

## [1] TRUE

There are also built-in character sets for commonly used collections.

POSIX bracket expressions are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets. A hyphen creates a range, and a caret at the start negates the bracket expression.

Example:

x <- c ("a0b", "a1b","a12b","a&b")
grepl("a[[:digit:]]b",x)

## [1]  TRUE  TRUE FALSE FALSE

x <- c ("a0b", "a1b","a12b","a&b")
grepl("a[[:digit:]]+b",x)

## [1]  TRUE  TRUE  TRUE FALSE

x <- c ("a0b", "a1b","a12b","a&b")
grepl("a[^[:digit:]]+b",x)

## [1] FALSE FALSE FALSE  TRUE

x <- c (3,"5","a","b","?")
grepl("[[:digit:]abc]",x)

## [1]  TRUE  TRUE  TRUE  TRUE FALSE

example:

Write a regular expression that matches any string with exactly two words separated by any amount of whitespace (spaces or tabs). There may or may not be whitespace at the beginning or end of the line.

x <- c("   stats rocks", "I love stats")
grepl("^[[:blank:]]*[[:alpha:]]+[[:blank:]]+[[:alpha:]]+[[:blank:]]*$",x)

## [1]  TRUE FALSE

`grepl()`, `gsub()`, `extractMatches()`

Regular expressions are used for several purposes:

to detect whether a pattern is contained in a string. Use filter() and grepl()

details:

grepl() searches for matches to regexp within each element of a character vector and returns a logical vector
syntax: grepl(regex,vec)

example: Here is a list of names from the Bible:

BibleNames <- read.csv("http://tiny.cc/dcf/BibleNames.csv")
head(BibleNames)

name	meaning
Aaron	a teacher; lofty; mountain of strength
Abaddon	the destroyer
Abagtha	father of the wine-press
Abana	made of stone; a building
Abarim	passages; passengers
Abba	father

Which names have any words in them: “bar”, “dam”, “lory”?

BibleNames %>%
  filter(grepl("(bar)|(dam)|(lory)", name)) %>%
  head()

name	meaning
Abarim	passages; passengers
Aceldama	field of blood
Adam	earthy; red
Adamah	red earth; of blood
Adami	my man; red; earthy; human
Bethabara	the house of confidence

to replace the elements of that pattern with something else. Use mutate() and gsub()

details:

gsub() searches for matches to argument regexp within each element of a character vector and performs replacement of all matches
syntax:gsub(regex, replacement, vec)

Clean up the Debt data table below so it only has numbers

head(Debt)

country	debt	percentGDP
World	$56,308 billion	64%
United States	$17,607 billion	73.60%
Japan	$9,872 billion	214.30%

Debt %>% 
  mutate( debt=gsub("[$,]|billion","",debt),
          percentGDP=gsub("%", "", percentGDP))

country	debt	percentGDP
World	56308	64
United States	17607	73.60
Japan	9872	214.30

extract matches for marked regions of regular expressions. Used with `DataComputing::extractMatches()`

Details:

input is a data table and output is a data table with one or more additional columns with the extractions
syntax is df %>% extractMatches(regex, var)
wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string
when there is no match extractMatches() returns NA

example:

data.frame(string=c("hi there there","bye")) %>% extractMatches("r(e)",string)

string	match1
hi there there	e
bye	NA

example:

What are the most common end vowels for Bible names?

To answer this question, you have to extract the last vowel from the name. The extractMatches() transformation function can do this.

BibleNames %>% 
  extractMatches( "([aeiou])$", name, vowel=1) %>% 
  group_by(vowel ) %>% 
  summarise( total= n()) %>%
  arrange( vowel, desc(total) )

vowel	total
a	233
e	55
i	207
o	24
u	18
NA	2089

What Bible Names start and end with a vowel?

BibleNames %>% 
  extractMatches( "^([AEIOU]).*([aeiou])$", name, beg_vowel=1, end_vowel=2) %>% 
  filter(!is.na(beg_vowel) & !is.na(end_vowel)) %>%
  select(-meaning) %>%
  head()

name	beg_vowel	end_vowel
Abagtha	A	a
Abana	A	a
Abba	A	a
Abda	A	a
Abdi	A	i
Abednego	A	o

i-clicker

Soln: both abababa and aabbaa are matches. Note that aa is a match

lec 13

Today

Regular expressions

example

example

example:

In Class exercise

example:

`grepl()`, `gsub()`, `extractMatches()`

extract matches for marked regions of regular expressions. Used with `DataComputing::extractMatches()`

i-clicker

lec 13

Today

Regular expressions

example

example

example:

In Class exercise

example:

grepl(), gsub(), extractMatches()

extract matches for marked regions of regular expressions. Used with DataComputing::extractMatches()

i-clicker

`grepl()`, `gsub()`, `extractMatches()`

extract matches for marked regions of regular expressions. Used with `DataComputing::extractMatches()`