Source file ⇒ 2017-lec14.Rmd

Announcements

  1. I will post a discussion board for you to list the questions you want to discuss Tuesday.

Today

  1. Finish Regex

  2. Lazy evaluation

1. Finish up Regex

Here again is a useful cheat sheet

or here are some basics:

  • Very simple patterns:
    • A single . means “any character.”

    • A character, e.g., b, means just that character.

    • Characters enclosed in square brackets, e.g., [aeiou] means any of those characters. (So, [aeiou] is a pattern describing a vowel.)

    • The ^ inside square brackets means “any except these.” So, a consonant is [^aeiou]

  • Alternatives. A vertical bar means “either.” For example (A|a)dam matches Adam and adam

  • Repeats

    • Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou] means a capital M followed by a lower-case vowel.

    • A simple pattern followed by a + means “one or more times.” For example M(ab)+ means M followed by one or more ab.

    • A simple pattern followed by a ? means “zero or one time.”

    • A simple pattern followed by a * means “zero or more times.”

    • A simple pattern followed by {2} means “exactly two times.” Similarly, {2,5} means between two and five times, {6,} means six times or more.

  • Start and end of strings. For instance, [aeiou]{2} means “exactly two vowels in a row.”

    • ^ at the beginning of a regular expression means “the start of the string”

    • $ at the end means “the end of the string.”

extract matches for marked regions of regular expressions. Used with DataComputing::extractMatches()

Details:

  1. input is a data table and output is a data table with one or more additional columns with the extractions

  2. syntax is df %>% extractMatches(regex, var)

  3. wrap part of the regexp in parentheses to signal that the matching content is to be extracted as a string

  4. when there is no match extractMatches() returns NA

example:

data.frame(string=c("hi there","bye!","5 kids")) %>% extractMatches("^([[:alpha:]])?.*([[:alpha:]])$",string, "first_letter"=1, "last_letter"=2)
string first_letter last_letter
hi there h e
bye! NA NA
5 kids NA s

example (data scraping and cleaning)

The City of Boston publishes various public-saftey data online. A data table listing almost 300,000 crime reports from February 2012 up through the present is available via https://data.cityofboston.gov/Public-Safety/Crime-Incident-Reports/7cdf-6fgx

A small extract of the Boston crime data is availble to you:

CrimeSample <- read.csv("http://tiny.cc/dcf/Boston-Crimes-50.csv")  %>% select(Location, STREETNAME)
CrimeSample %>% head(3)
Location STREETNAME
(42.33556635, -71.10772955) VINING ST
(42.26468806, -71.15577959) ROCKINGHAM AV
(42.32753635, -71.08322955) WARREN ST

The Location variable contains information about latitude and longitude. Write a regular expression that will extract the latitude and longitude as numbers into separate variables.

my_regex <- "\\(([+-]?[0-9]+\\.[0-9]+), ([+-]?[0-9]+\\.[0-9]+)\\)"
Result <- CrimeSample %>%
  extractMatches(my_regex, Location, "longitude"=1, "latitude"=2)
Result %>% head(3)
Location STREETNAME longitude latitude
(42.33556635, -71.10772955) VINING ST 42.33556635 -71.10772955
(42.26468806, -71.15577959) ROCKINGHAM AV 42.26468806 -71.15577959
(42.32753635, -71.08322955) WARREN ST 42.32753635 -71.08322955

In Class exercise

Do problem 2: (Note: after class I modified these questions slightly)

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-13-collection/

i-clicker

Answ: d since ..man means that must have a space and two characters before man

Answ: true–see below. There was some confusion what happens if the name doesn’t end in a vowel. In that case a match isn’t made and nothing (i.e. NA) is grabbed

##Iclicker
BibleNames %>% extractMatches("^[A-Z]..(.*)[aeiou]$", name, match=1) %>% head(4)
name meaning match
Aaron a teacher; lofty; mountain of strength NA
Abaddon the destroyer NA
Abagtha father of the wine-press gth
Abana made of stone; a building n

2 Lazy evalutation

When using Lazy evaluation an expression is not evaluated as soon as it gets bound to a variable, but when the evaluator is forced to produce the expression’s value. For example a <- 2+2 isn’t evaluated until the variable a is actually used somewhere. An advangage of lazy evaluation is that there is a performance increase by avoiding needless calculations and a reduction in memory footprint since values are created only when needed. Scheme and R are two languages that have Lazy evaluation. Almost all other languages don’t have lazy evaluation. A disadvantage of lazy evalution is that there is a loss of control of when expressions are evaluated which can lead to errors.

Does the output of the following functions surprise you?

f <- function(x){
  
  10
}
f(x=print("hi"))
## [1] 10
f <- function(x){
  10
  x
}
f(x=print("hi"))
## [1] "hi"

It may be surprising that the first function doesn’t print out “hi”. This is a consequence of lazy evaluation in R.

In lazy evaluation a promise object temporarily holds the expression x=print("hi"), without evaluating x=print("hi"), until x is used in the function f. The goal is to save memory for as long as possible. You can see in the following example why this can be useful since rnorm(1000) would take up a large chunk of memory to evaluate.

f <- function(x){
  10
}
f(x=rnorm(1000))
## [1] 10

The vector of length 1000 isn’t stored in memory since x is used in the function.

here is another example:

f <- function(a, b = d) { 
  d <- a
  return(a*b)  #when arguement b is used d is put in memory 
}
  
  f(2)
## [1] 4

This function makes since because of lazy evaluation. We don’t need to know what d is in the arguement b=d. We only need to know what d is when we write a*b in the function.

#simple adding function
g <- function(rr) {(h <- function(x) rr + x)}
g(1)(2)   
## [1] 3