Source file ⇒ 2017-lec14.Rmd
Finish Regex
Lazy evaluation
Here again is a useful cheat sheet
or here are some basics:
A single .
means “any character.”
A character, e.g., b
, means just that character.
Characters enclosed in square brackets, e.g., [aeiou]
means any of those characters. (So, [aeiou]
is a pattern describing a vowel.)
The ^
inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam
matches Adam
and adam
Repeats
Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou]
means a capital M followed by a lower-case vowel.
A simple pattern followed by a +
means “one or more times.” For example M(ab)+
means M
followed by one or more ab
.
A simple pattern followed by a ?
means “zero or one time.”
A simple pattern followed by a *
means “zero or more times.”
A simple pattern followed by {2}
means “exactly two times.” Similarly, {2,5}
means between two and five times, {6,}
means six times or more.
Start and end of strings. For instance, [aeiou]{2}
means “exactly two vowels in a row.”
^
at the beginning of a regular expression means “the start of the string”
$
at the end means “the end of the string.”
DataComputing::extractMatches()
Details:
input is a data table and output is a data table with one or more additional columns with the extractions
syntax is df %>% extractMatches(regex, var)
wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string
when there is no match extractMatches()
returns NA
example:
data.frame(string=c("hi there","bye!","5 kids")) %>% extractMatches("^([[:alpha:]])?.*([[:alpha:]])$",string, "first_letter"=1, "last_letter"=2)
string | first_letter | last_letter |
---|---|---|
hi there | h | e |
bye! | NA | NA |
5 kids | NA | s |
The City of Boston publishes various public-saftey data online. A data table listing almost 300,000 crime reports from February 2012 up through the present is available via https://data.cityofboston.gov/Public-Safety/Crime-Incident-Reports/7cdf-6fgx
A small extract of the Boston crime data is availble to you:
CrimeSample <- read.csv("http://tiny.cc/dcf/Boston-Crimes-50.csv") %>% select(Location, STREETNAME)
CrimeSample %>% head(3)
Location | STREETNAME |
---|---|
(42.33556635, -71.10772955) | VINING ST |
(42.26468806, -71.15577959) | ROCKINGHAM AV |
(42.32753635, -71.08322955) | WARREN ST |
The Location
variable contains information about latitude and longitude. Write a regular expression that will extract the latitude and longitude as numbers into separate variables.
my_regex <- "\\(([+-]?[0-9]+\\.[0-9]+), ([+-]?[0-9]+\\.[0-9]+)\\)"
Result <- CrimeSample %>%
extractMatches(my_regex, Location, "longitude"=1, "latitude"=2)
Result %>% head(3)
Location | STREETNAME | longitude | latitude |
---|---|---|---|
(42.33556635, -71.10772955) | VINING ST | 42.33556635 | -71.10772955 |
(42.26468806, -71.15577959) | ROCKINGHAM AV | 42.26468806 | -71.15577959 |
(42.32753635, -71.08322955) | WARREN ST | 42.32753635 | -71.08322955 |
Do problem 2: (Note: after class I modified these questions slightly)
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-13-collection/
Answ: d since ..man means that must have a space and two characters before man
Answ: true–see below. There was some confusion what happens if the name doesn’t end in a vowel. In that case a match isn’t made and nothing (i.e. NA) is grabbed
##Iclicker
BibleNames %>% extractMatches("^[A-Z]..(.*)[aeiou]$", name, match=1) %>% head(4)
name | meaning | match |
---|---|---|
Aaron | a teacher; lofty; mountain of strength | NA |
Abaddon | the destroyer | NA |
Abagtha | father of the wine-press | gth |
Abana | made of stone; a building | n |
When using Lazy evaluation an expression is not evaluated as soon as it gets bound to a variable, but when the evaluator is forced to produce the expression’s value. For example a <- 2+2
isn’t evaluated until the variable a
is actually used somewhere. An advangage of lazy evaluation is that there is a performance increase by avoiding needless calculations and a reduction in memory footprint since values are created only when needed. Scheme and R are two languages that have Lazy evaluation. Almost all other languages don’t have lazy evaluation. A disadvantage of lazy evalution is that there is a loss of control of when expressions are evaluated which can lead to errors.
Does the output of the following functions surprise you?
f <- function(x){
10
}
f(x=print("hi"))
## [1] 10
f <- function(x){
10
x
}
f(x=print("hi"))
## [1] "hi"
It may be surprising that the first function doesn’t print out “hi”. This is a consequence of lazy evaluation in R.
In lazy evaluation a promise object temporarily holds the expression x=print("hi")
, without evaluating x=print("hi")
, until x is used in the function f
. The goal is to save memory for as long as possible. You can see in the following example why this can be useful since rnorm(1000)
would take up a large chunk of memory to evaluate.
f <- function(x){
10
}
f(x=rnorm(1000))
## [1] 10
The vector of length 1000 isn’t stored in memory since x
is used in the function.
here is another example:
f <- function(a, b = d) {
d <- a
return(a*b) #when arguement b is used d is put in memory
}
f(2)
## [1] 4
This function makes since because of lazy evaluation. We don’t need to know what d is in the arguement b=d
. We only need to know what d
is when we write a*b
in the function.
#simple adding function
g <- function(rr) {(h <- function(x) rr + x)}
g(1)(2)
## [1] 3