Functions and program organization in R

Functions and program organization

The goal of these notes is to go over one way to organize your work within R.

Let's use an example: accessing NYC's 311 open data base from within R.

NYC OpenData

The City of New York has data sets available for all to use. One such is

(311 Service Requests from 2010 to Present)[https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9]

This entire data set is pretty large – over 300M compressed. As such, downloading the entire data set and working with it in memory is not so practical.

(Just updating a 500M data set, requires at least twice the storage, as copies are made.)

There are alternatives (data bases), but here we will discuss how to use a URL to pre-filter the data we fetch into R

the SODA consumer API

For more information on how to access this dataset via the Socrata Open Data API: Socrata

In brief, one can cook up a url that can be used to return matching records in JSON format. (Or others, such as XML or CSV).

Let's look at an example session (the script.R file in this directory records all these commands in one)

recent additions

Load some packages (jsonlite to parse JSON outputs; dplyr for manipulation of data frames; lubridate to work with dates):

require(jsonlite)

## Loading required package: jsonlite

require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

require(lubridate)

## Loading required package: lubridate

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

Grab the last two days

Our first query will narrow the time window for events to the past two days.

We can easily create the corresponding POSIXct times using lubridate functions:

d2 = now()
d1 = d2 - days(2)  ## go back two days
c(d1, d2)

## [1] "2016-08-11 13:12:29 EDT" "2016-08-13 13:12:29 EDT"

These dates can be converted into the time format Socrata wants as follows:

d1 = sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f",
     year(d1), month(d1), day(d1), hour(d1), minute(d1), second(d1))
d2 = sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f",
    year(d2), month(d2), day(d2), hour(d2), minute(d2), second(d2))

c(d1, d2)

## [1] "2016-08-11T13:12:29.334" "2016-08-13T13:12:29.334"

(These commands were done by copying and pasting with slight edits)

The data set has a created_date field we will work against. (All fields are listed here).

baseurl = "https://data.cityofnewyork.us/resource/fhrw-4uyv.json"
query = sprintf("%s?$where=created_date between '%s' and '%s'",
    baseurl, d1, d2)
query

## [1] "https://data.cityofnewyork.us/resource/fhrw-4uyv.json?$where=created_date between '2016-08-11T13:12:29.334' and '2016-08-13T13:12:29.334'"

We should encode the query. Within the browser it isn't necessary, but my version of R required this. There is a built in URLencode function:

query = URLencode(query)
query

## [1] "https://data.cityofnewyork.us/resource/fhrw-4uyv.json?$where=created_date%20between%20'2016-08-11T13:12:29.334'%20and%20'2016-08-13T13:12:29.334'"

But that doesn't handle the conversion of ' to %27:

query = gsub("'", "%27", query)
query

## [1] "https://data.cityofnewyork.us/resource/fhrw-4uyv.json?$where=created_date%20between%20%272016-08-11T13:12:29.334%27%20and%20%272016-08-13T13:12:29.334%27"

(not ideal – can't handle "can't" for example…)

grabbing the data

Requesting this URL returns a JSON file

query

To convert this JSON object into a data frame, the super convenient fromJSON function does all the work:

a = fromJSON(query)   ## large queries can take a while to download

head(a)

##   agency                                       agency_name       borough
## 1    HRA                      HRA Benefit Card Replacement   Unspecified
## 2    DHS Operations Unit - Department of Homeless Services     MANHATTAN
## 3    DOT                      Department of Transportation   Unspecified
## 4    DOT                      Department of Transportation         BRONX
## 5    DPR                Department of Parks and Recreation STATEN ISLAND
## 6    DOF                             Department of Finance   Unspecified
##               closed_date  community_board             complaint_type
## 1 2016-08-11T13:19:47.000    0 Unspecified   Benefit Card Replacement
## 2 2016-08-11T23:52:23.000     05 MANHATTAN Homeless Person Assistance
## 3                    <NA>    0 Unspecified     Street Light Condition
## 4                    <NA>         11 BRONX           Street Condition
## 5                    <NA> 01 STATEN ISLAND            Dead/Dying Tree
## 6                    <NA>    0 Unspecified     DOF Literature Request
##              created_date
## 1 2016-08-11T13:12:36.000
## 2 2016-08-11T13:12:43.000
## 3 2016-08-11T13:13:00.000
## 4 2016-08-11T13:13:14.000
## 5 2016-08-11T13:13:28.000
## 6 2016-08-11T13:13:42.000
##                                                descriptor facility_type
## 1                                                Medicaid           N/A
## 2                                                     N/A           N/A
## 3                                        Lamppost Damaged           N/A
## 4                                  Blocked - Construction           N/A
## 5                           Planted More Than 2 Years Ago           N/A
## 6 Property Tax Exemption Application for Owners - English           N/A
##        location_type  park_borough park_facility_name
## 1 NYC Street Address   Unspecified        Unspecified
## 2   Store/Commercial     MANHATTAN        Unspecified
## 3               <NA>   Unspecified        Unspecified
## 4             Street         BRONX        Unspecified
## 5             Street STATEN ISLAND        Unspecified
## 6               <NA>   Unspecified        Unspecified
##                                                                                                                                                                                                              resolution_description
## 1 The Human Resources Administration received your request. If you are eligible for a benefit card replacement, you will receive it in 7 to 10 business days. No further updates will be available by calling 311 or on 311 Online.
## 2                                                                                                              The mobile outreach response team went to the location provided but could not find the individual that you reported.
## 3                                                 Service Request status for this request is available on the Department of Transportationâ\u0080\u0099s website. Please click the â\u0080\u009cLearn Moreâ\u0080\u009d link below.
## 4                                                                            The Department of Transportation requires 10 days to respond to this type of complaint.  Please note your Service Request number for future reference.
## 5                         The Department of Parks and Recreation usually requires 7 days to inspect the issue and may take up to 23 days to resolve it, if warranted. Please note your Service Request number for future reference.
## 6                                               The appropriate agency will fulfill this literature request within 7 days. Please allow additional time for delivery. Please note your Service Request number for future reference.
##   school_address school_city school_code school_name school_not_found
## 1    Unspecified Unspecified Unspecified Unspecified                N
## 2    Unspecified Unspecified Unspecified Unspecified                N
## 3    Unspecified Unspecified Unspecified Unspecified             <NA>
## 4    Unspecified Unspecified Unspecified Unspecified                N
## 5    Unspecified Unspecified Unspecified Unspecified                N
## 6    Unspecified Unspecified Unspecified Unspecified                N
##   school_number school_phone_number school_region school_state  school_zip
## 1   Unspecified         Unspecified   Unspecified  Unspecified Unspecified
## 2   Unspecified         Unspecified   Unspecified  Unspecified Unspecified
## 3   Unspecified         Unspecified   Unspecified  Unspecified Unspecified
## 4   Unspecified         Unspecified   Unspecified  Unspecified Unspecified
## 5   Unspecified         Unspecified   Unspecified  Unspecified Unspecified
## 6   Unspecified         Unspecified   Unspecified  Unspecified Unspecified
##     status unique_key address_type          city      cross_street_1
## 1   Closed   34060837         <NA>          <NA>                <NA>
## 2   Closed   34058889      ADDRESS      NEW YORK      WEST 27 STREET
## 3 Assigned   34060044    BLOCKFACE          <NA>        VICTORY BLVD
## 4 Assigned   34065070      ADDRESS         BRONX        ARNOW AVENUE
## 5     Open   34060307      ADDRESS STATEN ISLAND METROPOLITAN AVENUE
## 6     Open   34061784         <NA>          <NA>                <NA>
##          cross_street_2                due_date   incident_address
## 1                  <NA>                    <NA>               <NA>
## 2        WEST 28 STREET 2016-08-12T01:50:01.000       305 7 AVENUE
## 3 STATEN ISL EXPY ET RP                    <NA>    STATEN ISL EXPY
## 4           ADEE AVENUE 2016-08-21T13:13:14.000 2915 THROOP AVENUE
## 5        CITY BOULEVARD 2016-08-18T13:13:28.000  15 SHAWNEE STREET
## 6                  <NA> 2016-08-18T13:13:42.000               <NA>
##   incident_zip           latitude location.type location.coordinates
## 1         <NA>               <NA>          <NA>                 NULL
## 2        10001 40.746816267937895         Point  -73.99366, 40.74682
## 3         <NA>               <NA>          <NA>                 NULL
## 4        10469  40.86804790517581         Point  -73.85094, 40.86805
## 5        10301 40.625245405328585         Point  -74.10320, 40.62524
## 6         <NA>               <NA>          <NA>                 NULL
##            longitude resolution_action_updated_date     street_name
## 1               <NA>                           <NA>            <NA>
## 2 -73.99365540195873        2016-08-11T23:52:23.000        7 AVENUE
## 3               <NA>        2016-08-10T13:13:00.000 STATEN ISL EXPY
## 4 -73.85093738114138                           <NA>   THROOP AVENUE
## 5 -74.10319833876885                           <NA>  SHAWNEE STREET
## 6               <NA>                           <NA>            <NA>
##   x_coordinate_state_plane y_coordinate_state_plane intersection_street_1
## 1                     <NA>                     <NA>                  <NA>
## 2                   986008                   211362                  <NA>
## 3                     <NA>                     <NA>                  <NA>
## 4                  1025478                   255566                  <NA>
## 5                   955603                   167087                  <NA>
## 6                     <NA>                     <NA>                  <NA>
##   intersection_street_2 taxi_pick_up_location taxi_company_borough
## 1                  <NA>                  <NA>                 <NA>
## 2                  <NA>                  <NA>                 <NA>
## 3                  <NA>                  <NA>                 <NA>
## 4                  <NA>                  <NA>                 <NA>
## 5                  <NA>                  <NA>                 <NA>
## 6                  <NA>                  <NA>                 <NA>
##   bridge_highway_segment ferry_terminal_name bridge_highway_direction
## 1                   <NA>                <NA>                     <NA>
## 2                   <NA>                <NA>                     <NA>
## 3                   <NA>                <NA>                     <NA>
## 4                   <NA>                <NA>                     <NA>
## 5                   <NA>                <NA>                     <NA>
## 6                   <NA>                <NA>                     <NA>
##   bridge_highway_name road_ramp ferry_direction
## 1                <NA>      <NA>            <NA>
## 2                <NA>      <NA>            <NA>
## 3                <NA>      <NA>            <NA>
## 4                <NA>      <NA>            <NA>
## 5                <NA>      <NA>            <NA>
## 6                <NA>      <NA>            <NA>

The default display for data frames is a bit messy. We could use the newer tibble format which is loaded with dplyr:

tbl_df(a)

Some analysis

What different things do people call 311 about? We look at the complaint_type variable's greatest hits. Top 6 issues are:

table(a$complaint_type) %>% sort %>% tail

## 
## Noise - Street/Sidewalk        Street Condition        Blocked Driveway 
##                     129                     150                     164 
##     Noise - Residential         Illegal Parking            Water System 
##                     233                     236                     418

Let's try and visualize the most common complaint: Noise - Residential

First we try to filter the data to get just those events:

filter(a, complaint_type == "Noise - Residential")

Error: Each variable must be a 1d atomic vector or list.
Problem variables: 'location'

An error

What gives? Let's look

class(a$location)

## [1] "data.frame"

Clean the data

We can get rid of complicated data frame variable, location, as it is rendundant here anyways:

a$location <- NULL
noisy_neighbors <- filter(a, complaint_type == "Noise - Residential")
tbl_df(noisy_neighbors)

## # A tibble: 233 x 48
##    agency                     agency_name       borough
##     <chr>                           <chr>         <chr>
## 1    NYPD New York City Police Department STATEN ISLAND
## 2    NYPD New York City Police Department     MANHATTAN
## 3    NYPD New York City Police Department      BROOKLYN
## 4    NYPD New York City Police Department        QUEENS
## 5    NYPD New York City Police Department      BROOKLYN
## 6    NYPD New York City Police Department        QUEENS
## 7    NYPD New York City Police Department         BRONX
## 8    NYPD New York City Police Department        QUEENS
## 9    NYPD New York City Police Department        QUEENS
## 10   NYPD New York City Police Department      BROOKLYN
## # ... with 223 more rows, and 45 more variables: closed_date <chr>,
## #   community_board <chr>, complaint_type <chr>, created_date <chr>,
## #   descriptor <chr>, facility_type <chr>, location_type <chr>,
## #   park_borough <chr>, park_facility_name <chr>,
## #   resolution_description <chr>, school_address <chr>, school_city <chr>,
## #   school_code <chr>, school_name <chr>, school_not_found <chr>,
## #   school_number <chr>, school_phone_number <chr>, school_region <chr>,
## #   school_state <chr>, school_zip <chr>, status <chr>, unique_key <chr>,
## #   address_type <chr>, city <chr>, cross_street_1 <chr>,
## #   cross_street_2 <chr>, due_date <chr>, incident_address <chr>,
## #   incident_zip <chr>, latitude <chr>, longitude <chr>,
## #   resolution_action_updated_date <chr>, street_name <chr>,
## #   x_coordinate_state_plane <chr>, y_coordinate_state_plane <chr>,
## #   intersection_street_1 <chr>, intersection_street_2 <chr>,
## #   taxi_pick_up_location <chr>, taxi_company_borough <chr>,
## #   bridge_highway_segment <chr>, ferry_terminal_name <chr>,
## #   bridge_highway_direction <chr>, bridge_highway_name <chr>,
## #   road_ramp <chr>, ferry_direction <chr>

Where are these complaints?

Mapping is always a fun visualization. For mapping we have RgoogleMaps (also ggmap, …). Let's see if we can just borrow and adjust someone's example

require(RgoogleMaps)

## Loading required package: RgoogleMaps

noisy_neighbors[["longitude"]] = as.numeric(noisy_neighbors[["longitude"]])
noisy_neighbors[["latitude"]] = as.numeric(noisy_neighbors[["latitude"]])

center <- sapply(select(noisy_neighbors, latitude, longitude),
    mean, na.rm=TRUE)
map <- GetMap(center=center, zoom=11)

Translate original data to that based on the map object

coords <- with(noisy_neighbors,
    LatLon2XY.centered(map, latitude, longitude, 11))
coords <- data.frame(coords)

Make a plot

PlotOnStaticMap(map)
points(coords$newX, coords$newY, pch=16, cex=0.3)

As usual – we tweak.

Giving a default zoom level is an issue. We'd rather have the map zoom to just include our data. For that we have the bounding box concept. Looking into the RgoogleMaps documentation we find there is a GetMap.bbox function – we just need to make a bounding box and go:

bb <- with(noisy_neighbors, qbbox(latitude, longitude))
map <- GetMap.bbox(bb$latR, bb$lonR)

And again

coords <- with(noisy_neighbors,
    LatLon2XY.centered(map, latitude, longitude)) # no zoom level here
coords <- data.frame(coords)

PlotOnStaticMap(map)
with(coords, points(newX, newY, pch=16, cex=0.3))

What the?

Well, we see that GetMap.bbox uses the order lon, lat – not lat, lon like the others! This should be better:

map <- GetMap.bbox(bb$lonR, bb$latR)
coords <- with(noisy_neighbors,
    LatLon2XY.centered(map, latitude, longitude))
coords <- data.frame(coords)

PlotOnStaticMap(map)
with(coords, points(newX, newY, pch=16, cex=0.3))

Narrowing a search

Hmm, lots of noise in Brooklyn, not so much in SI. Maybe people new to the city complain more? Let's peek by narrowing down a neighborhood – Williamsburg.

A quick peek at GoogleMaps shows that a center has basically these coordinates:

lat = 40.7104541
lon = -73.9644729

From usgs we see that one minute of latitude at 38 degrees N is basially 1.15 miles. So 1/60 is about 1.15 * 12 blocks, so a single block is basically \(1/60/(1.15*12)\) which we round down to \(0.001\).

lats = lat + 0.001 * c(-1, 1)
lons = lon + 0.001 * c(-1, 1)

`within_box`

The previously annoying location variable is actually valuable server side. The within_box query can be used to pick out only those records that happened within a box:

condition_1 <- sprintf("within_box(location, %2.6f, %2.6f, %2.6f, %2.6f)",
    lats[1], lons[1], lats[2], lons[2])
condition_2 <- sprintf("complaint_type='Noise - Residential'")

We put this into a query:

query = sprintf("%s?$where=%s and %s", baseurl, condition_1, condition_2)
query = URLencode(query)
query = gsub("'", "%27", query)
query

## [1] "https://data.cityofnewyork.us/resource/fhrw-4uyv.json?$where=within_box(location,%2040.709454,%20-73.965473,%2040.711454,%20-73.963473)%20and%20complaint_type=%27Noise%20-%20Residential%27"

a = fromJSON(query)

a$location <- NULL
tbl_df(a)

## # A tibble: 108 x 40
##    address_type agency                     agency_name  borough     city
## *         <chr>  <chr>                           <chr>    <chr>    <chr>
## 1       ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## 2       ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## 3       ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## 4       ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## 5       ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## 6       ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## 7       ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## 8       ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## 9       ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## 10      ADDRESS   NYPD New York City Police Department BROOKLYN BROOKLYN
## # ... with 98 more rows, and 35 more variables: closed_date <chr>,
## #   community_board <chr>, complaint_type <chr>, created_date <chr>,
## #   cross_street_1 <chr>, cross_street_2 <chr>, descriptor <chr>,
## #   due_date <chr>, facility_type <chr>, incident_address <chr>,
## #   incident_zip <chr>, latitude <chr>, location_type <chr>,
## #   longitude <chr>, park_borough <chr>, park_facility_name <chr>,
## #   resolution_action_updated_date <chr>, resolution_description <chr>,
## #   school_address <chr>, school_city <chr>, school_code <chr>,
## #   school_name <chr>, school_not_found <chr>, school_number <chr>,
## #   school_phone_number <chr>, school_region <chr>, school_state <chr>,
## #   school_zip <chr>, status <chr>, street_name <chr>, unique_key <chr>,
## #   x_coordinate_state_plane <chr>, y_coordinate_state_plane <chr>,
## #   intersection_street_1 <chr>, intersection_street_2 <chr>

when

Let's look at the creation dates of these complaints. We have:

head(a$created_date)  # characters

## [1] "2016-06-08T23:37:42.000" "2016-04-18T16:23:49.000"
## [3] "2016-07-03T22:27:49.000" "2016-08-07T11:36:03.000"
## [5] "2016-04-03T23:34:15.000" "2016-04-04T22:17:14.000"

Coercion is easy – lubridate does all the hard work:

a$created_date <- ymd_hms(a$created_date) # R date-time format

visualize

Let's try a simple visualization a count by month and year:

xtabs(~ month(created_date) + year(created_date), a)

##                    year(created_date)
## month(created_date) 2010 2011 2012 2013 2014 2015 2016
##                  1     9    0    0    0    1    3    0
##                  2     1    3    0    0    0    0    0
##                  3     0    0    0    0    2    4    6
##                  4     1    0    2    1    0    0    4
##                  5     0    1    3    0    2    2    5
##                  6     0    1    6    3    2    1    4
##                  7     0    0    3    0    2    4    1
##                  8     3    0    1    2    2    4    1
##                  9     0    0    3    1    0    0    0
##                  10    0    0    2    3    1    1    0
##                  11    0    1    2    0    0    0    0
##                  12    0    0    1    0    0    3    0

Tidying up work using functions

This was meant to simulate how one might work interactively at the command line or with a script. We bounce along thought to thought. This is natural, but it is helpful to go back over and think how we could do things more efficiently/easily.

Usually reflection this involves creating functions to

remove redundancies
organize our work
streamline our thought processes

For example, we had these two nearly identical commands to convert a POSIXct time into the timestamp format expected by Socrata:

d1 = sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f", year(d1), month(d1), day(d1), hour(d1), minute(d1), second(d1))
d2 = sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f", year(d2), month(d2), day(d2), hour(d2), minute(d2), second(d2))

We clearly see a template (the d1 and d2 values are all that are replaced.) When we use copy-and-paste to create new commands, just slightly modifying things we should be thinking: write a function – the variable will form the template:

posix_to_timestamp <- function(d) {
    sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f", year(d), month(d), day(d),
        hour(d), minute(d), second(d))
}

Better, but there should be a helper here. The sprintf function is of great utility, but we can see there is room for improvement. The strftime function is designed for this conversion. It is similar – create a template and pass in values – but here the placeholders are time-and-date centric and the values are all contained in d:

posix_to_timestamp <- function(d) {
    strftime(d, "%Y-%m-%dT%H:%M:%OS3")        # %OS3 is possible on some architectures
}

Testing our function to see it is vectorized:

posix_to_timestamp(c(now()-days(2), now()))

## [1] "2016-08-11T13:12:33.685" "2016-08-13T13:12:33.689"

(An improvement would be to ensure that indeed d is a time object.)

Functions – the template

Functions are defined by the function keyword. They have two visible things: a set of arguments and the body:

function(...arguments...) {
    ...body...
}

The return value of a function is the last expression evaluated. This can be implicit, or explicit by using the return function. It need not be the last line in the body, as control flow operations may effect this.

A third part of the function is an environment where variables within the body are found. This is a more advanced topic, but in short this is what R uses to map symbols in the body to values so that the function can be evaluated.

Some more functions

In our example, we repeated the following code:

query <- URLencode(query)
query <- gsub("'", "%27", query)

Repeated code should also lend itself to a function:

encodeURL <- function(query) {
    query <- URLencode(query)
    query <- gsub("'", "%27", query)  # XXX fails on "can't" ...
    query
}

(This one can also be improved – but that takes us a bit a field.)

More on arguments

As mentioned, in the past two examples, we have functions that take a single argument, so the usage is pretty clear – when the body refers to the variable it uses the value passed to the function when it is called.

Functions in R are not limited to single arguments, in fact many functions have too many arguments (IMHO :)

Arguments can be passed to functions in R in various ways:

by position: log(10, 2) and log(10, 10) give different answers – the second argument is for the base
by name: log(10, base=2) and log(10, base=10) (base is the name of the second argument, so these are identical

However, we can also name both arguments and mix order:

c(log(10, 2), log(10, base=2), log(base=2, x=10))

## [1] 3.321928 3.321928 3.321928

(As well, names are partially matched, so we need not type the entire name.)

Default arguments

The signature of log is log(x, base = exp(1)).

What does the base = exp(1) do? It provides a default if no base value is specified – hence log can take 1 or 2 arguments.

These defaults are evaluated lazily – not when the function is defined, not when the function is called, rather when the value is needed. This allows one to make defaults that depend on values:

Log = function(x, base=x)  log(x, base)
Log(10)        # in fact log_x(x) = 1 always when x > 0

## [1] 1

Variadic arguments

R also has a custom to allow for an arbitrary number of arguments, the cutely named variable .... Here is a pattern of use:

f <- function(...) list(...)
f(1,2,three=3)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## $three
## [1] 3

This is helpful in many circumstances, but is maybe most widely used for passing arguments onto other function calls.

Another function

We have queries and the base url. A function to combine them might be:

query_to_url <- function(query) {
   url <- sprintf("%s?$where=%s", baseurl, query)
   encodeURL(url)
}

Seems simple enough – but where is the value of baseurl found if it isn't passed in?

The environment

R uses nested environments to look up values for variable names. When a function is defined, the body has an environment consisting of the local variables for the function call.

There is a distinction between the environment the function is defined in and the environment where the function is called. This can be of importance when functions call other functions, but we won't see that here.

What happend in query_to_url is baseurl is not found in the environment of local variables, but rather it must be found in an enclosing environment. So we must have defined it. Using a constant is one way to do this, as was done in the script.

A closure is a formal name for the function's arguments, body and environment. Closures are useful for certain programming tasks, but not for this task at hand.

Example of closure

x = 0
f1 <- function() x

g1 <- function() {
    x = 1
    f1()
}

g2 <- function()  {
    x = 1
    f2 <- function() x
    f2()
}

c(g1(), g2())

## [1] 0 1

x <- 2
g3 <- function(x=3) {
    f1()
}

g4 <- function(x=4) {
    f4 <- function() x
    f4()
}

c(g3(), g4())

## [1] 2 4

The enclosing environment is the one where the function is defined, not where it is called.

SODA queries

We saw three types of queries: a range of times, within a bounding box, matching a level for a field. There are many more. Making a nicer interface for that might be helpful.

(Had we downloaded the entire data set, we could just use subsetting and filtering to slice and dice the data R side, but here we want to filter our request to cut down on data transwer, so use the SODA commands to filter.)

Three possible helpful functions might be:

soda_character <- function(var, val) {
    sprintf("%s='%s'", var, val)  ## could use shQuote(val) and just %s ...
}

soda_date <- function(var, from, to=NULL) {
    if (is.null(to)) {
        sprintf("%s>='%s'", var, posix_to_timestamp(from))
    } else {
        sprintf("%s between '%s' and '%s'", var,
            posix_to_timestamp(from), posix_to_timestamp(to))
    }
}

(Here var is a some type of object, location is a Point)

soda_bbox <- function(var, lats, lons) {
    sprintf("within_box(%s, %2.6f, %2.6f, %2.6f, %2.6f)", var, lats[1], lons[1], lats[2], lons[2])
}

Then we have, say:

soda_character("complaint_type", "Noise - Residential")

## [1] "complaint_type='Noise - Residential'"

Not as slick as indexing with `complaint_type == "Noise - Residential", but not too bad.

But…

This is kind of ugly:

we have 3 functions that basically do the same action
we need to mentally keep track of the names for the class of each value we want to use.
shouldn't the computer help here?

S3 functions, polymorphism

Computer languages are now designed to minimize the taxing of the brain, if they can. In this case, we can use one function name to call three similar but different functionalities and let the computer decide which based on the class of the value:

character data should use soda_character
date data should use soda_date
bounding box data should use soda_bbox

Polymorphism in R

R has many different programming styles that implement polymorphism (one name, many actions):

S3 – the oldest (early 90s). Easiest by far, but requires following conventions
S4 – more modern (early 00s). More rigrously enforced convention, but more tedious to program
Reference Classes – more like classic OO programming of other languages
R6 – faster reference classes
proto – older, one of first with different semantics

S3 conventions

If you bless a function name, then the following pattern will be used when the variable x matches the specified class:

fn.class_name <- function(x, ...) { ... }

So we specialize this function defintion to calls where x is of the specified class.

This is widely used in R. For example, "tab-completing" plot has these by default

plot.default         plot.new             plot.ts
plot.design          plot.spec.coherency  plot.window
plot.ecdf            plot.spec.phase      plot.xy
plot.function        plot.stepfun

Among others, we see:

default: the catch all definition
new: actually shows how the naming is just a convention. There is no new class.

Tab completion just lists visible functions by name, we should instead call methods:

methods(plot)

##  [1] plot.acf*           plot.data.frame*    plot.decomposed.ts*
##  [4] plot.default        plot.dendrogram*    plot.density*      
##  [7] plot.ecdf           plot.factor*        plot.formula*      
## [10] plot.function       plot.hclust*        plot.histogram*    
## [13] plot.HoltWinters*   plot.isoreg*        plot.lm*           
## [16] plot.medpolish*     plot.mlm*           plot.ppr*          
## [19] plot.prcomp*        plot.princomp*      plot.profile.nls*  
## [22] plot.raster*        plot.spec*          plot.stepfun       
## [25] plot.stl*           plot.table*         plot.ts            
## [28] plot.tskernel*      plot.TukeyHSD*     
## see '?methods' for accessing help and source code

Okay, let's use this programming style to simplify our soda calls:

Bless the function name:

Soda <- function(x,...) UseMethod("Soda")

UseMethod is the circus master that controls the dispatching of which method to call based on the class of the first argument.

Then we have:

Soda.character <-  function(val, var) {
    sprintf("%s='%s'", var, val)  ## could use shQuote(val) and just %s ...
}

Very similar – but val and var are switched (a bit unnaturally!) as we need to dispatch on the class of the value, so it goes first.

Using `in`

Thinking in terms of vectorization, we might get passed a vector of values for val. Let's do a bit better – R wise – and use the in feature of SoQL instead:

Soda.character <- function(val, var) {
  vals = paste(shQuote(val), collapse = ", ")
  if (length(val) == 1) {
    sprintf("%s = %s", var, vals)
  } else {
    sprintf("%s in(%s)", var, vals)
  }
}

and

Soda.POSIXct <- function(from, var, to=NULL) {
    if (is.null(to)) {
        out <- sprintf("%s>='%s'", var, posix_to_timestamp(from))
    } else {
        out <- sprintf("%s between '%s' and '%s'", var,
          posix_to_timestamp(from), posix_to_timestamp(to))
    }
    out
}

(Or we might also have written this using from as a length 1 or 2 vector. Then we would have something like:

sprintf("%s %s %s", var, ifelse(length(from) = 1, ">=", "between"),
   paste(shQuote(from), collapse=" and "))

This gives a cleaned up version:

Soda.POSIXct <- function(vals, var) {
  
  if (length(vals) > 1) 
    vals <- vals[1:2]

  compare_with <- ifelse(length(vals) == 1, ">=", "between")
   val <- paste(shQuote(vals), collapse=" and ")

sprintf("%s %s %s", var, compare_with, val)
}

The advantage here is the consistent calling pattern makes it easier for users to not make mistakes.

Bounding box

What to do about the bounding box. A bounding box is just a set of 4 numbers. We'd like some way to mark these 4 numbers as a "bounding box," rather than just a set of 4 numbers (which might otherwise be used for numeric comparisons, say).

We need to make a bounding box class. Is that hard?

Well with R's S3 classes, it is almost too easy – classes are just a naming convention. Objects have an attribute class, a character vector, that can be adjusted. Below, we just push a name to the top. This constructor, bbox, creates a BBOX object:

bbox <- function(lats, lons) {
    lats <- range(as.numeric(lats), na.rm=TRUE)
    lons <- range(as.numeric(lons), na.rm=TRUE)
    
    out <- c(lats[1], lons[1], lats[2], lons[2])
    class(out) <- c("BBOX", class(out))  ## <--- the key
    out
}

With this, we can control the dispatch:

Soda.BBOX <- function(b, var) {
    sprintf("within_box(%s, %2.6f, %2.6f, %2.6f, %2.6f)", var, b[1], b[2], b[3], b[4])
}

Add we have, for example:

Soda(bbox(a$latitude, a$longitude))

(We compose our two functions here.)

Queries

The above helps us make one query. But it is much more natural to combine queries. In indexing notation, we use & and | to combine logical expressions. Can we do the same?

Well, R lets us define & and | using the same S3 conventions. So we just need a class to control dispatch:

Query <- function(x) {
    class(x) <- c("Query", class(x))
    x
}

Great, we could go back and edit each of our Soda functions to return a Query object, but instead we use a wrapper function. This also allows us to verify the name along the way.

For that task, we take advantage of R's partial matching of names mechanism:

varnames <- c('address_type', 'agency', 'agency_name', 'borough', 'city', 'closed_date', 'community_board', 'complaint_type', 'created_date', 'cross_street_1', 'cross_street_2', 'descriptor', 'due_date', 'facility_type', 'incident_address', 'incident_zip', 'latitude', 'location.type', 'location', 'location_type', 'longitude', 'park_borough', 'park_facility_name', 'resolution_action_updated_date', 'resolution_description', 'school_address', 'school_city', 'school_code', 'school_name', 'school_not_found', 'school_number', 'school_phone_number', 'school_region', 'school_state', 'school_zip', 'status', 'street_name', 'unique_key', 'x_coordinate_state_plane', 'y_coordinate_state_plane') # hard code list
varnames <- sapply(varnames, identity, simplify=FALSE)

verify_name <- function(nm) {
    out <- varnames[[nm, exact=FALSE]]  ## use partial matching. More directly, could use `charmatch`.
    if (is.null(out))
        stop(nm, " does not match")
    out
}

and then

soda <- function(x, var, ...) {
    if (length(var) > 1) warning("Only one variable at a time, first one being used.")
    Query(Soda(x, verify_name(var[1]), ...))
}

"&.Query" <- function(x, y)  Query(paste(x, y, sep=" and "))
"|.Query" <- function(x, y)  Query(paste(x, y, sep=" or "))

And we have:

soda(bbox(a$latitude, a$longitude), "location") & soda("Noise - Residential", "created_d")

## [1] "within_box(location, 40.709534, -73.965395, 40.711445, -73.963508) and created_date = 'Noise - Residential'"
## attr(,"class")
## [1] "Query"     "character"

The request

There is a package for querying RSocrata, but it is more concerned about getting the data, than constructing the query. In that work there is a more robust means to download files, but we will stick with the magical fromJSON function which returns a data frame from the query. This next function tidies up our downloaded files through some conversions:

tidy_up <- function(d) {
    d$location <- NULL # clean up
    for (nm in c("created_date", "due_date"))
        d[[nm]] <- ymd_hms(d[[nm]])
    for (nm in c("latitude", "longitude"))
        d[[nm]] <- as.numeric(d[[nm]])
    d
}

We can make a request with the following. (There is a bit more here than is needed.)

request <- function(query) {
    url <- query_to_url(query)
    out <- Reduce(rbind, lapply(url, fromJSON))  # vectorized, just df_request(url)  otherwise
    out <- tidy_up(out)
    class(out) <- c("Response", class(out))
    out
}

We can see how we have broken each step into a separate function call.

Each line takes care of something: the first maps a query to a url, the second requests the url (which may possibly be vectorized), the third tidies up this data frame, and the fourth adds to the class attribute – leaving us a hook in case we want to specialize some function on these data sets.

Visualize the output

The use of GoogleMaps to visualize the request is nice. We can make it the default plotting method for our requests as follows:

plot.Response <- function(a, pch=16, cex=0.3, ...) {
    bb <- with(a, qbbox(latitude, longitude))
    map <- GetMap.bbox(bb$lonR, bb$latR)
    
    ## Translate original data
    coords <- with(a, LatLon2XY.centered(map, latitude, longitude))
    coords <- data.frame(coords)
    
    ## Plot
    PlotOnStaticMap(map)
    with(coords, points(newX, newY, pch=pch, cex=cex, ...))
    
}

In action

lat = 40.7104541
lon = -73.9644729
query = soda(bbox(lat + 0.001 * c(-1,1), lon + 0.001 * c(-1,1)), "location") &
  soda("Noise - Residential", "complaint_type")
a = request(query)
plot(a)

More formally organizing code into packages

This example naturally lends itself to more generalizations:

more visualizations
taking advantage of the different query types available

If this were to be pursued with the intent of sharing it with others, we would want to "package" up the work.

R packages

We have seen how to

install external packages in R
load external package in R

Both tasks are fairly painless. Now we see how to make an external package.

The hardest part is — a good name! We will use the311.

Devtools

RStudio's Projects make it easy to create a new package when working on a new project. Basically it puts a call into

devtools::create("the311")

This makes a directory with minimal directory structure to be a package.

Package development in R is greatly assisted by the devtools package, which can be used outside of RStudio.

First things first

After calling create we need to do some customization:

edit the DESCRIPTION file
navigate to the R subdirectory and create the necessary files

Tougher than it looks

Some packages use one big file to hold everything, others use small files arranged around different tasks.

The latter is better for packages that may grow, the former maybe better as a package is being developed and how it will be best organized is being worked out.

Our package, the311, uses the latter.

order

By default, R uses the alphabetical order of the files to determine which file is loaded first. This can lead to files named aaa.R or zzz.R. Kinda ugly.

The devtools package gives directives which can make one file depend on another, hence the dependent file will load later. Looking at the example package we have lines like:

##' @include utils.R
NULL

This would force utils.R to be loaded before the file containing this line is.

The comment character is trailed by a ' which is important. This uses the roxygen documentation format, but here we aren't actually documenting anything, so we put in the NULL as a trick.

Loading a package during development

Packages must be loaded. One development cycle is: load the package, test the package, edit the package, install the package, restart R, load the package … lather, rinse, repeat …

This can get tedious. The devtools package provides load_all to temporarily install the package – avoiding the need to restart R.

load_all(pkgname)

Testing it out

load_all("the311")

lat = 40.7104541
lon = -73.9644729 

lats = lat + 0.001 * c(-1,1)
lons = lon + 0.001 * c(-1,1)

query = soda(bbox(lats, lons), "location") &
  soda("Noise - Residential", "complaint_type")

a = request(query)
dim(a)

## [1] 108  40

plot(a, cex=2)

Public or Private

When using load_all every function in the package is available.

This is not the case if we had installed the311 and loaded it using require.

Then only exported functions are immediately visible to the user.

As exported functions can mask other packages functions or names, package developers should be conservative in what they export, and should re-use common names within the generic spirit they represent. (Such as plot for the default plot, summary for a short summary, …)

Marking a function for export can be done when using roxygen with the @export directive.

roxygen

After mentioning it twice, let's describe it. The roxygen package gives a relatively simple way to document the functions in your package so that it is accessible to users of the package. The format for a function is fairly simple:

document what the function does
document the arguments (parameters)
indicate if the function is exported
optionally give details, examples, notes, references.

Documentation isn't limited to functions.

Good documentation is hard to write, but there should be at least bad documentation.

For example,

##' Request a subset of the 311 data set
##' @param query A query built up using `soda` calls
##' @note A more robust method is here https://github.com/Chicago/RSocrata/
##' @export
request <- function(query) {
  ...
}