Source file ⇒ 2017-lec14.Rmd

Announcements

I will post a discussion board for you to list the questions you want to discuss Tuesday.

Today

Finish Regex
Lazy evaluation

1. Finish up Regex

Here again is a useful cheat sheet

or here are some basics:

Very simple patterns:
- A single . means “any character.”
- A character, e.g., b, means just that character.
- Characters enclosed in square brackets, e.g., [aeiou] means any of those characters. (So, [aeiou] is a pattern describing a vowel.)
- The ^ inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” For example (A|a)dam matches Adam and adam
Repeats
- Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou] means a capital M followed by a lower-case vowel.
- A simple pattern followed by a + means “one or more times.” For example M(ab)+ means M followed by one or more ab.
- A simple pattern followed by a ? means “zero or one time.”
- A simple pattern followed by a * means “zero or more times.”
- A simple pattern followed by {2} means “exactly two times.” Similarly, {2,5} means between two and five times, {6,} means six times or more.
Start and end of strings. For instance, [aeiou]{2} means “exactly two vowels in a row.”
- ^ at the beginning of a regular expression means “the start of the string”
- $ at the end means “the end of the string.”

extract matches for marked regions of regular expressions. Used with `DataComputing::extractMatches()`

Details:

input is a data table and output is a data table with one or more additional columns with the extractions
syntax is df %>% extractMatches(regex, var)
wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string
when there is no match extractMatches() returns NA

example:

data.frame(string=c("hi there","bye!","5 kids")) %>% extractMatches("^([[:alpha:]])?.*([[:alpha:]])$",string, "first_letter"=1, "last_letter"=2)

string	first_letter	last_letter
hi there	h	e
bye!	NA	NA
5 kids	NA	s

example (data scraping and cleaning)

The City of Boston publishes various public-saftey data online. A data table listing almost 300,000 crime reports from February 2012 up through the present is available via https://data.cityofboston.gov/Public-Safety/Crime-Incident-Reports/7cdf-6fgx

A small extract of the Boston crime data is availble to you:

CrimeSample <- read.csv("http://tiny.cc/dcf/Boston-Crimes-50.csv")  %>% select(Location, STREETNAME)
CrimeSample %>% head(3)

Location	STREETNAME
(42.33556635, -71.10772955)	VINING ST
(42.26468806, -71.15577959)	ROCKINGHAM AV
(42.32753635, -71.08322955)	WARREN ST

The Location variable contains information about latitude and longitude. Write a regular expression that will extract the latitude and longitude as numbers into separate variables.

my_regex <- "\\(([+-]?[0-9]+\\.[0-9]+), ([+-]?[0-9]+\\.[0-9]+)\\)"
Result <- CrimeSample %>%
  extractMatches(my_regex, Location, "longitude"=1, "latitude"=2)
Result %>% head(3)

Location	STREETNAME	longitude	latitude
(42.33556635, -71.10772955)	VINING ST	42.33556635	-71.10772955
(42.26468806, -71.15577959)	ROCKINGHAM AV	42.26468806	-71.15577959
(42.32753635, -71.08322955)	WARREN ST	42.32753635	-71.08322955

In Class exercise

Do problem 2: (Note: after class I modified these questions slightly)

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-13-collection/

i-clicker

Answ: d since ..man means that must have a space and two characters before man

Answ: true–see below. There was some confusion what happens if the name doesn’t end in a vowel. In that case a match isn’t made and nothing (i.e. NA) is grabbed

##Iclicker
BibleNames %>% extractMatches("^[A-Z]..(.*)[aeiou]$", name, match=1) %>% head(4)

name	meaning	match
Aaron	a teacher; lofty; mountain of strength	NA
Abaddon	the destroyer	NA
Abagtha	father of the wine-press	gth
Abana	made of stone; a building	n

2 Lazy evalutation

When using Lazy evaluation the argument of a function call isn’t evaluated unless it is actually used in a function.

For example

f <- function(a) 5

f(2+2)

## [1] 5

will not evaluate 2+2 and assign it to a since a isn’t used in the definition of the function f.

However,

f <- function(a) a+1

f(2+2)

will evaluate the argument 2+2 and assign it to a since a is used in the function f.

An advangage of lazy evaluation is that there is a performance increase by avoiding needless calculations and a reduction in memory footprint since values are created only when needed. Scheme and R are two languages that have Lazy evaluation. Almost all other languages don’t have lazy evaluation. A disadvantage of lazy evalution is that there is a loss of control of when expressions are evaluated which can lead to errors.

What is the output of the following functions?

f <- function(x){
  
  10
}
f(x=print("hi"))

## [1] 10

f <- function(x){
  10
  x
}
f(x=print("hi"))

## [1] "hi"

The first function doesn’t print out “hi”. This is because x isn’t used in the definition of the function f. This is lazy evaluation in R.

In lazy evaluation a promise object temporarily holds the expression x=print("hi"), without evaluating x=print("hi"), until x is used in the function f. The goal is to save memory for as long as possible. You can see in the following example why this can be useful since rnorm(1000) would take up a large chunk of memory to evaluate.

f <- function(x){
  10
}
f(x=rnorm(1000))

## [1] 10

The vector of length 1000 isn’t stored in memory since x is used in the function.

here is another example:

f <- function(a, b = d) { 
  d <- a
  return(a*b)  #when arguement b is used d is put in memory 
}
  
  f(2)

## [1] 4

This function makes sense because of lazy evaluation. We don’t need to know what d is in the argument b=d. We only need to know what d is when we write a*b in the function.

Another example:

g <- function(i) i + 3

vec <- c()
for (i in 1:2){
  vec[i] <- g(i)
}
vec[1]

## [1] 4

vec[2]

## [1] 5

Here i is used in g(i) so vec[1]=4 and vec[2]=5 Another example:

g <- function(i) function(x) x + i
  

ls <- list()
for (i in 1:2){
  ls[[i]] <- g(i)
}
ls[[1]](10)

## [1] 12

ls[[2]](10)

## [1] 12

We need a list ls to store our functions. Here i can’t be evaluated in g(i) until ls1 and ls2 are called. At this time i is 2.

In Class exercise

Do problem 1:

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-14-collection/

lec14

Announcements

Today

1. Finish up Regex

extract matches for marked regions of regular expressions. Used with DataComputing::extractMatches()

example (data scraping and cleaning)

In Class exercise

i-clicker

2 Lazy evalutation

In Class exercise

extract matches for marked regions of regular expressions. Used with `DataComputing::extractMatches()`