Source file ⇒ 2017-lec14.Rmd
Finish Regex
Lazy evaluation
Here again is a useful cheat sheet
or here are some basics:
A single .
means “any character.”
A character, e.g., b
, means just that character.
Characters enclosed in square brackets, e.g., [aeiou]
means any of those characters. (So, [aeiou]
is a pattern describing a vowel.)
The ^
inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam
matches Adam
and adam
Repeats
Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou]
means a capital M followed by a lower-case vowel.
A simple pattern followed by a +
means “one or more times.” For example M(ab)+
means M
followed by one or more ab
.
A simple pattern followed by a ?
means “zero or one time.”
A simple pattern followed by a *
means “zero or more times.”
A simple pattern followed by {2}
means “exactly two times.” Similarly, {2,5}
means between two and five times, {6,}
means six times or more.
Start and end of strings. For instance, [aeiou]{2}
means “exactly two vowels in a row.”
^
at the beginning of a regular expression means “the start of the string”
$
at the end means “the end of the string.”
DataComputing::extractMatches()
Details:
input is a data table and output is a data table with one or more additional columns with the extractions
syntax is df %>% extractMatches(regex, var)
wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string
when there is no match extractMatches()
returns NA
example:
data.frame(string=c("hi there","bye!","5 kids")) %>% extractMatches("^([[:alpha:]])?.*([[:alpha:]])$",string, "first_letter"=1, "last_letter"=2)
string | first_letter | last_letter |
---|---|---|
hi there | h | e |
bye! | NA | NA |
5 kids | NA | s |
The City of Boston publishes various public-saftey data online. A data table listing almost 300,000 crime reports from February 2012 up through the present is available via https://data.cityofboston.gov/Public-Safety/Crime-Incident-Reports/7cdf-6fgx
A small extract of the Boston crime data is availble to you:
CrimeSample <- read.csv("http://tiny.cc/dcf/Boston-Crimes-50.csv") %>% select(Location, STREETNAME)
CrimeSample %>% head(3)
Location | STREETNAME |
---|---|
(42.33556635, -71.10772955) | VINING ST |
(42.26468806, -71.15577959) | ROCKINGHAM AV |
(42.32753635, -71.08322955) | WARREN ST |
The Location
variable contains information about latitude and longitude. Write a regular expression that will extract the latitude and longitude as numbers into separate variables.
my_regex <- "\\(([+-]?[0-9]+\\.[0-9]+), ([+-]?[0-9]+\\.[0-9]+)\\)"
Result <- CrimeSample %>%
extractMatches(my_regex, Location, "longitude"=1, "latitude"=2)
Result %>% head(3)
Location | STREETNAME | longitude | latitude |
---|---|---|---|
(42.33556635, -71.10772955) | VINING ST | 42.33556635 | -71.10772955 |
(42.26468806, -71.15577959) | ROCKINGHAM AV | 42.26468806 | -71.15577959 |
(42.32753635, -71.08322955) | WARREN ST | 42.32753635 | -71.08322955 |
Do problem 2: (Note: after class I modified these questions slightly)
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-13-collection/
Answ: d since ..man means that must have a space and two characters before man
Answ: true–see below. There was some confusion what happens if the name doesn’t end in a vowel. In that case a match isn’t made and nothing (i.e. NA) is grabbed
##Iclicker
BibleNames %>% extractMatches("^[A-Z]..(.*)[aeiou]$", name, match=1) %>% head(4)
name | meaning | match |
---|---|---|
Aaron | a teacher; lofty; mountain of strength | NA |
Abaddon | the destroyer | NA |
Abagtha | father of the wine-press | gth |
Abana | made of stone; a building | n |
When using Lazy evaluation the argument of a function call isn’t evaluated unless it is actually used in a function.
For example
f <- function(a) 5
f(2+2)
## [1] 5
will not evaluate 2+2
and assign it to a
since a
isn’t used in the definition of the function f
.
However,
f <- function(a) a+1
f(2+2)
will evaluate the argument 2+2
and assign it to a
since a
is used in the function f
.
An advangage of lazy evaluation is that there is a performance increase by avoiding needless calculations and a reduction in memory footprint since values are created only when needed. Scheme and R are two languages that have Lazy evaluation. Almost all other languages don’t have lazy evaluation. A disadvantage of lazy evalution is that there is a loss of control of when expressions are evaluated which can lead to errors.
What is the output of the following functions?
f <- function(x){
10
}
f(x=print("hi"))
## [1] 10
f <- function(x){
10
x
}
f(x=print("hi"))
## [1] "hi"
The first function doesn’t print out “hi”. This is because x
isn’t used in the definition of the function f
. This is lazy evaluation in R.
In lazy evaluation a promise object temporarily holds the expression x=print("hi")
, without evaluating x=print("hi")
, until x is used in the function f
. The goal is to save memory for as long as possible. You can see in the following example why this can be useful since rnorm(1000)
would take up a large chunk of memory to evaluate.
f <- function(x){
10
}
f(x=rnorm(1000))
## [1] 10
The vector of length 1000 isn’t stored in memory since x
is used in the function.
here is another example:
f <- function(a, b = d) {
d <- a
return(a*b) #when arguement b is used d is put in memory
}
f(2)
## [1] 4
This function makes sense because of lazy evaluation. We don’t need to know what d is in the argument b=d
. We only need to know what d
is when we write a*b
in the function.
Another example:
g <- function(i) i + 3
vec <- c()
for (i in 1:2){
vec[i] <- g(i)
}
vec[1]
## [1] 4
vec[2]
## [1] 5
Here i is used in g(i) so vec[1]=4
and vec[2]=5
Another example:
g <- function(i) function(x) x + i
ls <- list()
for (i in 1:2){
ls[[i]] <- g(i)
}
ls[[1]](10)
## [1] 12
ls[[2]](10)
## [1] 12
We need a list ls
to store our functions. Here i can’t be evaluated in g(i) until ls1 and ls2 are called. At this time i is 2.