Source file ⇒ 2017-lec14.Rmd
Finish Regex
Lazy evaluation
Here again is a useful cheat sheet
or here are some basics:
A single . means “any character.”
A character, e.g., b, means just that character.
Characters enclosed in square brackets, e.g., [aeiou] means any of those characters. (So, [aeiou] is a pattern describing a vowel.)
The ^ inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam matches Adam and adam
Repeats
Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou] means a capital M followed by a lower-case vowel.
A simple pattern followed by a + means “one or more times.” For example M(ab)+ means M followed by one or more ab.
A simple pattern followed by a ? means “zero or one time.”
A simple pattern followed by a * means “zero or more times.”
A simple pattern followed by {2} means “exactly two times.” Similarly, {2,5} means between two and five times, {6,} means six times or more.
Start and end of strings. For instance, [aeiou]{2} means “exactly two vowels in a row.”
^ at the beginning of a regular expression means “the start of the string”
$ at the end means “the end of the string.”
DataComputing::extractMatches()Details:
input is a data table and output is a data table with one or more additional columns with the extractions
syntax is df %>% extractMatches(regex, var)
wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string
when there is no match extractMatches() returns NA
example:
data.frame(string=c("hi there","bye!","5 kids")) %>% extractMatches("^([[:alpha:]])?.*([[:alpha:]])$",string, "first_letter"=1, "last_letter"=2)
| string | first_letter | last_letter |
|---|---|---|
| hi there | h | e |
| bye! | NA | NA |
| 5 kids | NA | s |
The City of Boston publishes various public-saftey data online. A data table listing almost 300,000 crime reports from February 2012 up through the present is available via https://data.cityofboston.gov/Public-Safety/Crime-Incident-Reports/7cdf-6fgx
A small extract of the Boston crime data is availble to you:
CrimeSample <- read.csv("http://tiny.cc/dcf/Boston-Crimes-50.csv") %>% select(Location, STREETNAME)
CrimeSample %>% head(3)
| Location | STREETNAME |
|---|---|
| (42.33556635, -71.10772955) | VINING ST |
| (42.26468806, -71.15577959) | ROCKINGHAM AV |
| (42.32753635, -71.08322955) | WARREN ST |
The Location variable contains information about latitude and longitude. Write a regular expression that will extract the latitude and longitude as numbers into separate variables.
my_regex <- "\\(([+-]?[0-9]+\\.[0-9]+), ([+-]?[0-9]+\\.[0-9]+)\\)"
Result <- CrimeSample %>%
extractMatches(my_regex, Location, "longitude"=1, "latitude"=2)
Result %>% head(3)
| Location | STREETNAME | longitude | latitude |
|---|---|---|---|
| (42.33556635, -71.10772955) | VINING ST | 42.33556635 | -71.10772955 |
| (42.26468806, -71.15577959) | ROCKINGHAM AV | 42.26468806 | -71.15577959 |
| (42.32753635, -71.08322955) | WARREN ST | 42.32753635 | -71.08322955 |
Do problem 2: (Note: after class I modified these questions slightly)
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-13-collection/
Answ: d since ..man means that must have a space and two characters before man
Answ: true–see below. There was some confusion what happens if the name doesn’t end in a vowel. In that case a match isn’t made and nothing (i.e. NA) is grabbed
##Iclicker
BibleNames %>% extractMatches("^[A-Z]..(.*)[aeiou]$", name, match=1) %>% head(4)
| name | meaning | match |
|---|---|---|
| Aaron | a teacher; lofty; mountain of strength | NA |
| Abaddon | the destroyer | NA |
| Abagtha | father of the wine-press | gth |
| Abana | made of stone; a building | n |
When using Lazy evaluation an expression is not evaluated as soon as it gets bound to a variable, but when the evaluator is forced to produce the expression’s value. For example a <- 2+2 isn’t evaluated until the variable a is actually used somewhere. An advangage of lazy evaluation is that there is a performance increase by avoiding needless calculations and a reduction in memory footprint since values are created only when needed. Scheme and R are two languages that have Lazy evaluation. Almost all other languages don’t have lazy evaluation. A disadvantage of lazy evalution is that there is a loss of control of when expressions are evaluated which can lead to errors.
Does the output of the following functions surprise you?
f <- function(x){
10
}
f(x=print("hi"))
## [1] 10
f <- function(x){
10
x
}
f(x=print("hi"))
## [1] "hi"
It may be surprising that the first function doesn’t print out “hi”. This is a consequence of lazy evaluation in R.
In lazy evaluation a promise object temporarily holds the expression x=print("hi"), without evaluating x=print("hi"), until x is used in the function f. The goal is to save memory for as long as possible. You can see in the following example why this can be useful since rnorm(1000) would take up a large chunk of memory to evaluate.
f <- function(x){
10
}
f(x=rnorm(1000))
## [1] 10
The vector of length 1000 isn’t stored in memory since x is used in the function.
here is another example:
f <- function(a, b = d) {
d <- a
return(a*b) #when arguement b is used d is put in memory
}
f(2)
## [1] 4
This function makes since because of lazy evaluation. We don’t need to know what d is in the arguement b=d. We only need to know what d is when we write a*b in the function.
#simple adding function
g <- function(rr) {(h <- function(x) rr + x)}
g(1)(2)
## [1] 3