Source file ⇒ 2017-lec13.Rmd
Here again is a useful cheat sheet
or here are some basics:
A single .
means “any character.”
A character, e.g., b
, means just that character.
Characters enclosed in square brackets, e.g., [aeiou]
means any of those characters. (So, [aeiou]
is a pattern describing a vowel.)
The ^
inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam
matches Adam
and adam
Repeats
Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou]
means a capital M followed by a lower-case vowel.
A simple pattern followed by a +
means “one or more times.” For example M(ab)+
means M
followed by one or more ab
.
A simple pattern followed by a ?
means “zero or one time.”
A simple pattern followed by a *
means “zero or more times.”
A simple pattern followed by {2}
means “exactly two times.” Similarly, {2,5}
means between two and five times, {6,}
means six times or more.
Start and end of strings. For instance, [aeiou]{2}
means “exactly two vowels in a row.”
^
at the beginning of a regular expression means “the start of the string”
$
at the end means “the end of the string.”
match all of the strings in the vector c("a","abb", "abbb","abbbb")
but not the strings c("ca","abbd")
Soln:
First try
grepl("ab*",c("a","ab", "abb","abbb", "ca","abbd"))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Second try
grepl("^ab*$",c("a","ab", "abb","abbb", "ca","abbd"))
## [1] TRUE TRUE TRUE TRUE FALSE FALSE
or
grepl("(^a$|^ab$|^abb$|^abbb$)",c("a","ab", "abb","abbb", "ca","abbd"))
## [1] TRUE TRUE TRUE TRUE FALSE FALSE
Sometimes we want to combine regular expressions with R expressions:
x <- c("OQ", "Q","Or","5")
(O <- grepl("O",x))
## [1] TRUE FALSE TRUE FALSE
x <- c("OQ", "Q","Or","5")
(Q <- grepl("Q",x))
## [1] TRUE TRUE FALSE FALSE
x <- c("OQ", "Q","Or","5")
xor(O,Q)
## [1] FALSE TRUE TRUE FALSE
first match strings with exactly one Q
(Q <- grepl("^[^Q]*Q[^Q]*$",c("OQ", "Q","Or","QQO","5QOO")))
## [1] TRUE TRUE FALSE FALSE TRUE
then match strings with exactly one O
(O <- grepl("^[^O]*O[^O]*$",c("OQ", "Q","Or","QQO","5QOO")))
## [1] TRUE FALSE TRUE TRUE FALSE
Then use &
logical operation in R.
#c("OQ", "Q","Or","QQO","5QOO")
O & Q
## [1] TRUE FALSE FALSE FALSE FALSE
In your work as a police detective, you have access to a database of car registrations. The data table CarRegistrations
, looks like this:
CarRegistrations <- data.frame(c("Joe Smith","Tim Allen"),c("Ford","Toyota"),c("Taurus","Prius"),c(2007,2011),c("metalic gray","blue"),c("337 HBQ","843 OSS"),c(98739,93134))
names(CarRegistrations) <- c("owner","brand","model","model_year","color","plate","zip")
CarRegistrations
owner | brand | model | model_year | color | plate | zip |
---|---|---|---|---|---|---|
Joe Smith | Ford | Taurus | 2007 | metalic gray | 337 HBQ | 98739 |
Tim Allen | Toyota | Prius | 2011 | blue | 843 OSS | 93134 |
A hit-and-run collision has been reported. The witness reports seeing a blue car with a license starting with either 3 or 8, and the letter O or Q and another letter S or 8. You are in the Santa Barbara, CA area, and hit-and-run incidents tend to be local, so you’ll search first for vehicles registered in zip codes starting with 931.
Write a filter()
expression that will extract matches to the information at hand.
Soln:
The expression will look like this
CarRegistrations %>%
filter( grepl("blue", color), # blue car
grepl("^[38].*[OQ].*[S8].*", plate), # license plate
grepl("^931..$", as.character(zip)) # zip code
)
owner | brand | model | model_year | color | plate | zip |
---|---|---|---|---|---|---|
Tim Allen | Toyota | Prius | 2011 | blue | 843 OSS | 93134 |
The order of the grepl()
statements doesn’t matter; they must all be true for a case to pass through the filter.
Do problem 1:
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-13-collection/
The metacharacters in Regular Expressions are:
.
\
|
(
)
[
]
{
$
*
+
?
The square brackets []
, is called a character class. For example the character class [ab]
means a
or b
. The only special characters or metacharacters inside a character class are the closing bracket ], the backslash , the caret ^, and the hyphen -. The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. The metacharacter ^
means the opposite, -
means between, and \\
means escape.
Example:
grepl("[a$]", c("hello$bye", "hellobye$","adam")) #$ has literal meaning inside []
## [1] TRUE TRUE TRUE
grepl("[a$]$", c("hello$bye","hellobye$", "adam")) #$ is metacharacter outside of []
## [1] FALSE TRUE FALSE
grepl("[a\\^]", c("5^"))
## [1] TRUE
There are also built-in character sets for commonly used collections.
POSIX bracket expressions are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets. A hyphen creates a range, and a caret at the start negates the bracket expression.
Example:
x <- c ("a0b", "a1b","a12b","a&b")
grepl("a[[:digit:]]b",x)
## [1] TRUE TRUE FALSE FALSE
x <- c ("a0b", "a1b","a12b","a&b")
grepl("a[[:digit:]]+b",x)
## [1] TRUE TRUE TRUE FALSE
x <- c ("a0b", "a1b","a12b","a&b")
grepl("a[^[:digit:]]+b",x)
## [1] FALSE FALSE FALSE TRUE
x <- c (3,"5","a","b","?")
grepl("[[:digit:]abc]",x)
## [1] TRUE TRUE TRUE TRUE FALSE
Write a regular expression that matches any string with exactly two words separated by any amount of whitespace (spaces or tabs). There may or may not be whitespace at the beginning or end of the line.
x <- c(" stats rocks", "I love stats")
grepl("^[[:blank:]]*[[:alpha:]]+[[:blank:]]+[[:alpha:]]+[[:blank:]]*$",x)
## [1] TRUE FALSE
grepl()
, gsub()
, extractMatches()
Regular expressions are used for several purposes:
filter()
and grepl()
details:
grepl() searches for matches to regexp within each element of a character vector and returns a logical vector
syntax: grepl(regex,vec)
example: Here is a list of names from the Bible:
BibleNames <- read.csv("http://tiny.cc/dcf/BibleNames.csv")
head(BibleNames)
name | meaning |
---|---|
Aaron | a teacher; lofty; mountain of strength |
Abaddon | the destroyer |
Abagtha | father of the wine-press |
Abana | made of stone; a building |
Abarim | passages; passengers |
Abba | father |
Which names have any words in them: “bar”, “dam”, “lory”?
BibleNames %>%
filter(grepl("(bar)|(dam)|(lory)", name)) %>%
head()
name | meaning |
---|---|
Abarim | passages; passengers |
Aceldama | field of blood |
Adam | earthy; red |
Adamah | red earth; of blood |
Adami | my man; red; earthy; human |
Bethabara | the house of confidence |
mutate()
and gsub()
details:
gsub()
searches for matches to argument regexp within each element of a character vector and performs replacement of all matches
syntax:gsub(regex, replacement, vec)
Clean up the Debt data table below so it only has numbers
head(Debt)
country | debt | percentGDP |
---|---|---|
World | $56,308 billion | 64% |
United States | $17,607 billion | 73.60% |
Japan | $9,872 billion | 214.30% |
Debt %>%
mutate( debt=gsub("[$,]|billion","",debt),
percentGDP=gsub("%", "", percentGDP))
country | debt | percentGDP |
---|---|---|
World | 56308 | 64 |
United States | 17607 | 73.60 |
Japan | 9872 | 214.30 |
DataComputing::extractMatches()
Details:
input is a data table and output is a data table with one or more additional columns with the extractions
syntax is df %>% extractMatches(regex, var)
wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string
when there is no match extractMatches()
returns NA
example:
data.frame(string=c("hi there there","bye")) %>% extractMatches("r(e)",string)
string | match1 |
---|---|
hi there there | e |
bye | NA |
example:
What are the most common end vowels for Bible names?
To answer this question, you have to extract the last vowel from the name. The extractMatches()
transformation function can do this.
BibleNames %>%
extractMatches( "([aeiou])$", name, vowel=1) %>%
group_by(vowel ) %>%
summarise( total= n()) %>%
arrange( vowel, desc(total) )
vowel | total |
---|---|
a | 233 |
e | 55 |
i | 207 |
o | 24 |
u | 18 |
NA | 2089 |
What Bible Names start and end with a vowel?
BibleNames %>%
extractMatches( "^([AEIOU]).*([aeiou])$", name, beg_vowel=1, end_vowel=2) %>%
filter(!is.na(beg_vowel) & !is.na(end_vowel)) %>%
select(-meaning) %>%
head()
name | beg_vowel | end_vowel |
---|---|---|
Abagtha | A | a |
Abana | A | a |
Abba | A | a |
Abda | A | a |
Abdi | A | i |
Abednego | A | o |
Soln: both
abababa
and aabbaa
are matches. Note that aa
is a match