The goal of these notes is to go over one way to organize your work within R.
Let's use an example: accessing NYC's 311 open data base from within R.
The goal of these notes is to go over one way to organize your work within R.
Let's use an example: accessing NYC's 311 open data base from within R.
The City of New York has data sets available for all to use. One such is
(311 Service Requests from 2010 to Present)[https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9]
This entire data set is pretty large – over 300M compressed. As such, downloading the entire data set and working with it in memory is not so practical.
(Just updating a 500M data set, requires at least twice the storage, as copies are made.)
There are alternatives (data bases), but here we will discuss how to use a URL to pre-filter the data we fetch into R
For more information on how to access this dataset via the Socrata Open Data API: Socrata
In brief, one can cook up a url that can be used to return matching records in JSON format. (Or others, such as XML or CSV).
Let's look at an example session (the script.R file in this directory records all these commands in one)
Load some packages (jsonlite to parse JSON outputs; dplyr for manipulation of data frames; lubridate to work with dates):
require(jsonlite)
## Loading required package: jsonlite
require(dplyr)
## Loading required package: dplyr
## ## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats': ## ## filter, lag
## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
require(lubridate)
## Loading required package: lubridate
## ## Attaching package: 'lubridate'
## The following object is masked from 'package:base': ## ## date
Our first query will narrow the time window for events to the past two days.
We can easily create the corresponding POSIXct times using lubridate functions:
d2 = now() d1 = d2 - days(2) ## go back two days c(d1, d2)
## [1] "2016-08-11 13:12:29 EDT" "2016-08-13 13:12:29 EDT"
These dates can be converted into the time format Socrata wants as follows:
d1 = sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f",
year(d1), month(d1), day(d1), hour(d1), minute(d1), second(d1))
d2 = sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f",
year(d2), month(d2), day(d2), hour(d2), minute(d2), second(d2))
c(d1, d2)
## [1] "2016-08-11T13:12:29.334" "2016-08-13T13:12:29.334"
(These commands were done by copying and pasting with slight edits)
The data set has a created_date field we will work against. (All fields are listed here).
baseurl = "https://data.cityofnewyork.us/resource/fhrw-4uyv.json"
query = sprintf("%s?$where=created_date between '%s' and '%s'",
baseurl, d1, d2)
query
## [1] "https://data.cityofnewyork.us/resource/fhrw-4uyv.json?$where=created_date between '2016-08-11T13:12:29.334' and '2016-08-13T13:12:29.334'"
We should encode the query. Within the browser it isn't necessary, but my version of R required this. There is a built in URLencode function:
query = URLencode(query) query
## [1] "https://data.cityofnewyork.us/resource/fhrw-4uyv.json?$where=created_date%20between%20'2016-08-11T13:12:29.334'%20and%20'2016-08-13T13:12:29.334'"
But that doesn't handle the conversion of ' to %27:
query = gsub("'", "%27", query)
query
## [1] "https://data.cityofnewyork.us/resource/fhrw-4uyv.json?$where=created_date%20between%20%272016-08-11T13:12:29.334%27%20and%20%272016-08-13T13:12:29.334%27"
(not ideal – can't handle "can't" for example…)
Requesting this URL returns a JSON file
To convert this JSON object into a data frame, the super convenient fromJSON function does all the work:
a = fromJSON(query) ## large queries can take a while to download
head(a)
## agency agency_name borough ## 1 HRA HRA Benefit Card Replacement Unspecified ## 2 DHS Operations Unit - Department of Homeless Services MANHATTAN ## 3 DOT Department of Transportation Unspecified ## 4 DOT Department of Transportation BRONX ## 5 DPR Department of Parks and Recreation STATEN ISLAND ## 6 DOF Department of Finance Unspecified ## closed_date community_board complaint_type ## 1 2016-08-11T13:19:47.000 0 Unspecified Benefit Card Replacement ## 2 2016-08-11T23:52:23.000 05 MANHATTAN Homeless Person Assistance ## 3 <NA> 0 Unspecified Street Light Condition ## 4 <NA> 11 BRONX Street Condition ## 5 <NA> 01 STATEN ISLAND Dead/Dying Tree ## 6 <NA> 0 Unspecified DOF Literature Request ## created_date ## 1 2016-08-11T13:12:36.000 ## 2 2016-08-11T13:12:43.000 ## 3 2016-08-11T13:13:00.000 ## 4 2016-08-11T13:13:14.000 ## 5 2016-08-11T13:13:28.000 ## 6 2016-08-11T13:13:42.000 ## descriptor facility_type ## 1 Medicaid N/A ## 2 N/A N/A ## 3 Lamppost Damaged N/A ## 4 Blocked - Construction N/A ## 5 Planted More Than 2 Years Ago N/A ## 6 Property Tax Exemption Application for Owners - English N/A ## location_type park_borough park_facility_name ## 1 NYC Street Address Unspecified Unspecified ## 2 Store/Commercial MANHATTAN Unspecified ## 3 <NA> Unspecified Unspecified ## 4 Street BRONX Unspecified ## 5 Street STATEN ISLAND Unspecified ## 6 <NA> Unspecified Unspecified ## resolution_description ## 1 The Human Resources Administration received your request. If you are eligible for a benefit card replacement, you will receive it in 7 to 10 business days. No further updates will be available by calling 311 or on 311 Online. ## 2 The mobile outreach response team went to the location provided but could not find the individual that you reported. ## 3 Service Request status for this request is available on the Department of Transportationâ\u0080\u0099s website. Please click the â\u0080\u009cLearn Moreâ\u0080\u009d link below. ## 4 The Department of Transportation requires 10 days to respond to this type of complaint. Please note your Service Request number for future reference. ## 5 The Department of Parks and Recreation usually requires 7 days to inspect the issue and may take up to 23 days to resolve it, if warranted. Please note your Service Request number for future reference. ## 6 The appropriate agency will fulfill this literature request within 7 days. Please allow additional time for delivery. Please note your Service Request number for future reference. ## school_address school_city school_code school_name school_not_found ## 1 Unspecified Unspecified Unspecified Unspecified N ## 2 Unspecified Unspecified Unspecified Unspecified N ## 3 Unspecified Unspecified Unspecified Unspecified <NA> ## 4 Unspecified Unspecified Unspecified Unspecified N ## 5 Unspecified Unspecified Unspecified Unspecified N ## 6 Unspecified Unspecified Unspecified Unspecified N ## school_number school_phone_number school_region school_state school_zip ## 1 Unspecified Unspecified Unspecified Unspecified Unspecified ## 2 Unspecified Unspecified Unspecified Unspecified Unspecified ## 3 Unspecified Unspecified Unspecified Unspecified Unspecified ## 4 Unspecified Unspecified Unspecified Unspecified Unspecified ## 5 Unspecified Unspecified Unspecified Unspecified Unspecified ## 6 Unspecified Unspecified Unspecified Unspecified Unspecified ## status unique_key address_type city cross_street_1 ## 1 Closed 34060837 <NA> <NA> <NA> ## 2 Closed 34058889 ADDRESS NEW YORK WEST 27 STREET ## 3 Assigned 34060044 BLOCKFACE <NA> VICTORY BLVD ## 4 Assigned 34065070 ADDRESS BRONX ARNOW AVENUE ## 5 Open 34060307 ADDRESS STATEN ISLAND METROPOLITAN AVENUE ## 6 Open 34061784 <NA> <NA> <NA> ## cross_street_2 due_date incident_address ## 1 <NA> <NA> <NA> ## 2 WEST 28 STREET 2016-08-12T01:50:01.000 305 7 AVENUE ## 3 STATEN ISL EXPY ET RP <NA> STATEN ISL EXPY ## 4 ADEE AVENUE 2016-08-21T13:13:14.000 2915 THROOP AVENUE ## 5 CITY BOULEVARD 2016-08-18T13:13:28.000 15 SHAWNEE STREET ## 6 <NA> 2016-08-18T13:13:42.000 <NA> ## incident_zip latitude location.type location.coordinates ## 1 <NA> <NA> <NA> NULL ## 2 10001 40.746816267937895 Point -73.99366, 40.74682 ## 3 <NA> <NA> <NA> NULL ## 4 10469 40.86804790517581 Point -73.85094, 40.86805 ## 5 10301 40.625245405328585 Point -74.10320, 40.62524 ## 6 <NA> <NA> <NA> NULL ## longitude resolution_action_updated_date street_name ## 1 <NA> <NA> <NA> ## 2 -73.99365540195873 2016-08-11T23:52:23.000 7 AVENUE ## 3 <NA> 2016-08-10T13:13:00.000 STATEN ISL EXPY ## 4 -73.85093738114138 <NA> THROOP AVENUE ## 5 -74.10319833876885 <NA> SHAWNEE STREET ## 6 <NA> <NA> <NA> ## x_coordinate_state_plane y_coordinate_state_plane intersection_street_1 ## 1 <NA> <NA> <NA> ## 2 986008 211362 <NA> ## 3 <NA> <NA> <NA> ## 4 1025478 255566 <NA> ## 5 955603 167087 <NA> ## 6 <NA> <NA> <NA> ## intersection_street_2 taxi_pick_up_location taxi_company_borough ## 1 <NA> <NA> <NA> ## 2 <NA> <NA> <NA> ## 3 <NA> <NA> <NA> ## 4 <NA> <NA> <NA> ## 5 <NA> <NA> <NA> ## 6 <NA> <NA> <NA> ## bridge_highway_segment ferry_terminal_name bridge_highway_direction ## 1 <NA> <NA> <NA> ## 2 <NA> <NA> <NA> ## 3 <NA> <NA> <NA> ## 4 <NA> <NA> <NA> ## 5 <NA> <NA> <NA> ## 6 <NA> <NA> <NA> ## bridge_highway_name road_ramp ferry_direction ## 1 <NA> <NA> <NA> ## 2 <NA> <NA> <NA> ## 3 <NA> <NA> <NA> ## 4 <NA> <NA> <NA> ## 5 <NA> <NA> <NA> ## 6 <NA> <NA> <NA>
The default display for data frames is a bit messy. We could use the newer tibble format which is loaded with dplyr:
tbl_df(a)
What different things do people call 311 about? We look at the complaint_type variable's greatest hits. Top 6 issues are:
table(a$complaint_type) %>% sort %>% tail
## ## Noise - Street/Sidewalk Street Condition Blocked Driveway ## 129 150 164 ## Noise - Residential Illegal Parking Water System ## 233 236 418
First we try to filter the data to get just those events:
filter(a, complaint_type == "Noise - Residential")
Error: Each variable must be a 1d atomic vector or list. Problem variables: 'location'
What gives? Let's look
class(a$location)
## [1] "data.frame"
We can get rid of complicated data frame variable, location, as it is rendundant here anyways:
a$location <- NULL noisy_neighbors <- filter(a, complaint_type == "Noise - Residential") tbl_df(noisy_neighbors)
## # A tibble: 233 x 48 ## agency agency_name borough ## <chr> <chr> <chr> ## 1 NYPD New York City Police Department STATEN ISLAND ## 2 NYPD New York City Police Department MANHATTAN ## 3 NYPD New York City Police Department BROOKLYN ## 4 NYPD New York City Police Department QUEENS ## 5 NYPD New York City Police Department BROOKLYN ## 6 NYPD New York City Police Department QUEENS ## 7 NYPD New York City Police Department BRONX ## 8 NYPD New York City Police Department QUEENS ## 9 NYPD New York City Police Department QUEENS ## 10 NYPD New York City Police Department BROOKLYN ## # ... with 223 more rows, and 45 more variables: closed_date <chr>, ## # community_board <chr>, complaint_type <chr>, created_date <chr>, ## # descriptor <chr>, facility_type <chr>, location_type <chr>, ## # park_borough <chr>, park_facility_name <chr>, ## # resolution_description <chr>, school_address <chr>, school_city <chr>, ## # school_code <chr>, school_name <chr>, school_not_found <chr>, ## # school_number <chr>, school_phone_number <chr>, school_region <chr>, ## # school_state <chr>, school_zip <chr>, status <chr>, unique_key <chr>, ## # address_type <chr>, city <chr>, cross_street_1 <chr>, ## # cross_street_2 <chr>, due_date <chr>, incident_address <chr>, ## # incident_zip <chr>, latitude <chr>, longitude <chr>, ## # resolution_action_updated_date <chr>, street_name <chr>, ## # x_coordinate_state_plane <chr>, y_coordinate_state_plane <chr>, ## # intersection_street_1 <chr>, intersection_street_2 <chr>, ## # taxi_pick_up_location <chr>, taxi_company_borough <chr>, ## # bridge_highway_segment <chr>, ferry_terminal_name <chr>, ## # bridge_highway_direction <chr>, bridge_highway_name <chr>, ## # road_ramp <chr>, ferry_direction <chr>
Mapping is always a fun visualization. For mapping we have RgoogleMaps (also ggmap, …). Let's see if we can just borrow and adjust someone's example
require(RgoogleMaps)
## Loading required package: RgoogleMaps
noisy_neighbors[["longitude"]] = as.numeric(noisy_neighbors[["longitude"]])
noisy_neighbors[["latitude"]] = as.numeric(noisy_neighbors[["latitude"]])
center <- sapply(select(noisy_neighbors, latitude, longitude),
mean, na.rm=TRUE)
map <- GetMap(center=center, zoom=11)
coords <- with(noisy_neighbors,
LatLon2XY.centered(map, latitude, longitude, 11))
coords <- data.frame(coords)
PlotOnStaticMap(map) points(coords$newX, coords$newY, pch=16, cex=0.3)
Giving a default zoom level is an issue. We'd rather have the map zoom to just include our data. For that we have the bounding box concept. Looking into the RgoogleMaps documentation we find there is a GetMap.bbox function – we just need to make a bounding box and go:
bb <- with(noisy_neighbors, qbbox(latitude, longitude)) map <- GetMap.bbox(bb$latR, bb$lonR)
And again
coords <- with(noisy_neighbors,
LatLon2XY.centered(map, latitude, longitude)) # no zoom level here
coords <- data.frame(coords)
PlotOnStaticMap(map) with(coords, points(newX, newY, pch=16, cex=0.3))
Well, we see that GetMap.bbox uses the order lon, lat – not lat, lon like the others! This should be better:
map <- GetMap.bbox(bb$lonR, bb$latR)
coords <- with(noisy_neighbors,
LatLon2XY.centered(map, latitude, longitude))
coords <- data.frame(coords)
PlotOnStaticMap(map) with(coords, points(newX, newY, pch=16, cex=0.3))
Hmm, lots of noise in Brooklyn, not so much in SI. Maybe people new to the city complain more? Let's peek by narrowing down a neighborhood – Williamsburg.
A quick peek at GoogleMaps shows that a center has basically these coordinates:
lat = 40.7104541 lon = -73.9644729
From usgs we see that one minute of latitude at 38 degrees N is basially 1.15 miles. So 1/60 is about 1.15 * 12 blocks, so a single block is basically \(1/60/(1.15*12)\) which we round down to \(0.001\).
lats = lat + 0.001 * c(-1, 1) lons = lon + 0.001 * c(-1, 1)
within_boxThe previously annoying location variable is actually valuable server side. The within_box query can be used to pick out only those records that happened within a box:
condition_1 <- sprintf("within_box(location, %2.6f, %2.6f, %2.6f, %2.6f)",
lats[1], lons[1], lats[2], lons[2])
condition_2 <- sprintf("complaint_type='Noise - Residential'")
We put this into a query:
query = sprintf("%s?$where=%s and %s", baseurl, condition_1, condition_2)
query = URLencode(query)
query = gsub("'", "%27", query)
query
## [1] "https://data.cityofnewyork.us/resource/fhrw-4uyv.json?$where=within_box(location,%2040.709454,%20-73.965473,%2040.711454,%20-73.963473)%20and%20complaint_type=%27Noise%20-%20Residential%27"
a = fromJSON(query)
a$location <- NULL tbl_df(a)
## # A tibble: 108 x 40 ## address_type agency agency_name borough city ## * <chr> <chr> <chr> <chr> <chr> ## 1 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## 2 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## 3 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## 4 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## 5 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## 6 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## 7 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## 8 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## 9 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## 10 ADDRESS NYPD New York City Police Department BROOKLYN BROOKLYN ## # ... with 98 more rows, and 35 more variables: closed_date <chr>, ## # community_board <chr>, complaint_type <chr>, created_date <chr>, ## # cross_street_1 <chr>, cross_street_2 <chr>, descriptor <chr>, ## # due_date <chr>, facility_type <chr>, incident_address <chr>, ## # incident_zip <chr>, latitude <chr>, location_type <chr>, ## # longitude <chr>, park_borough <chr>, park_facility_name <chr>, ## # resolution_action_updated_date <chr>, resolution_description <chr>, ## # school_address <chr>, school_city <chr>, school_code <chr>, ## # school_name <chr>, school_not_found <chr>, school_number <chr>, ## # school_phone_number <chr>, school_region <chr>, school_state <chr>, ## # school_zip <chr>, status <chr>, street_name <chr>, unique_key <chr>, ## # x_coordinate_state_plane <chr>, y_coordinate_state_plane <chr>, ## # intersection_street_1 <chr>, intersection_street_2 <chr>
Let's look at the creation dates of these complaints. We have:
head(a$created_date) # characters
## [1] "2016-06-08T23:37:42.000" "2016-04-18T16:23:49.000" ## [3] "2016-07-03T22:27:49.000" "2016-08-07T11:36:03.000" ## [5] "2016-04-03T23:34:15.000" "2016-04-04T22:17:14.000"
Coercion is easy – lubridate does all the hard work:
a$created_date <- ymd_hms(a$created_date) # R date-time format
Let's try a simple visualization a count by month and year:
xtabs(~ month(created_date) + year(created_date), a)
## year(created_date) ## month(created_date) 2010 2011 2012 2013 2014 2015 2016 ## 1 9 0 0 0 1 3 0 ## 2 1 3 0 0 0 0 0 ## 3 0 0 0 0 2 4 6 ## 4 1 0 2 1 0 0 4 ## 5 0 1 3 0 2 2 5 ## 6 0 1 6 3 2 1 4 ## 7 0 0 3 0 2 4 1 ## 8 3 0 1 2 2 4 1 ## 9 0 0 3 1 0 0 0 ## 10 0 0 2 3 1 1 0 ## 11 0 1 2 0 0 0 0 ## 12 0 0 1 0 0 3 0
This was meant to simulate how one might work interactively at the command line or with a script. We bounce along thought to thought. This is natural, but it is helpful to go back over and think how we could do things more efficiently/easily.
Usually reflection this involves creating functions to
For example, we had these two nearly identical commands to convert a POSIXct time into the timestamp format expected by Socrata:
d1 = sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f", year(d1), month(d1), day(d1), hour(d1), minute(d1), second(d1))
d2 = sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f", year(d2), month(d2), day(d2), hour(d2), minute(d2), second(d2))
We clearly see a template (the d1 and d2 values are all that are replaced.) When we use copy-and-paste to create new commands, just slightly modifying things we should be thinking: write a function – the variable will form the template:
posix_to_timestamp <- function(d) {
sprintf("%4d-%02d-%02dT%02d:%02d:%3.3f", year(d), month(d), day(d),
hour(d), minute(d), second(d))
}
Better, but there should be a helper here. The sprintf function is of great utility, but we can see there is room for improvement. The strftime function is designed for this conversion. It is similar – create a template and pass in values – but here the placeholders are time-and-date centric and the values are all contained in d:
posix_to_timestamp <- function(d) {
strftime(d, "%Y-%m-%dT%H:%M:%OS3") # %OS3 is possible on some architectures
}
Testing our function to see it is vectorized:
posix_to_timestamp(c(now()-days(2), now()))
## [1] "2016-08-11T13:12:33.685" "2016-08-13T13:12:33.689"
(An improvement would be to ensure that indeed d is a time object.)
Functions are defined by the function keyword. They have two visible things: a set of arguments and the body:
function(...arguments...) {
...body...
}
The return value of a function is the last expression evaluated. This can be implicit, or explicit by using the return function. It need not be the last line in the body, as control flow operations may effect this.
A third part of the function is an environment where variables within the body are found. This is a more advanced topic, but in short this is what R uses to map symbols in the body to values so that the function can be evaluated.
In our example, we repeated the following code:
query <- URLencode(query)
query <- gsub("'", "%27", query)
Repeated code should also lend itself to a function:
encodeURL <- function(query) {
query <- URLencode(query)
query <- gsub("'", "%27", query) # XXX fails on "can't" ...
query
}
(This one can also be improved – but that takes us a bit a field.)
As mentioned, in the past two examples, we have functions that take a single argument, so the usage is pretty clear – when the body refers to the variable it uses the value passed to the function when it is called.
Functions in R are not limited to single arguments, in fact many functions have too many arguments (IMHO :)
Arguments can be passed to functions in R in various ways:
by position: log(10, 2) and log(10, 10) give different answers – the second argument is for the base
by name: log(10, base=2) and log(10, base=10) (base is the name of the second argument, so these are identical
However, we can also name both arguments and mix order:
c(log(10, 2), log(10, base=2), log(base=2, x=10))
## [1] 3.321928 3.321928 3.321928
(As well, names are partially matched, so we need not type the entire name.)
The signature of log is log(x, base = exp(1)).
What does the base = exp(1) do? It provides a default if no base value is specified – hence log can take 1 or 2 arguments.
These defaults are evaluated lazily – not when the function is defined, not when the function is called, rather when the value is needed. This allows one to make defaults that depend on values:
Log = function(x, base=x) log(x, base) Log(10) # in fact log_x(x) = 1 always when x > 0
## [1] 1
R also has a custom to allow for an arbitrary number of arguments, the cutely named variable .... Here is a pattern of use:
f <- function(...) list(...) f(1,2,three=3)
## [[1]] ## [1] 1 ## ## [[2]] ## [1] 2 ## ## $three ## [1] 3
This is helpful in many circumstances, but is maybe most widely used for passing arguments onto other function calls.
We have queries and the base url. A function to combine them might be:
query_to_url <- function(query) {
url <- sprintf("%s?$where=%s", baseurl, query)
encodeURL(url)
}
Seems simple enough – but where is the value of baseurl found if it isn't passed in?
R uses nested environments to look up values for variable names. When a function is defined, the body has an environment consisting of the local variables for the function call.
There is a distinction between the environment the function is defined in and the environment where the function is called. This can be of importance when functions call other functions, but we won't see that here.
What happend in query_to_url is baseurl is not found in the environment of local variables, but rather it must be found in an enclosing environment. So we must have defined it. Using a constant is one way to do this, as was done in the script.
A closure is a formal name for the function's arguments, body and environment. Closures are useful for certain programming tasks, but not for this task at hand.
x = 0
f1 <- function() x
g1 <- function() {
x = 1
f1()
}
g2 <- function() {
x = 1
f2 <- function() x
f2()
}
c(g1(), g2())
## [1] 0 1
x <- 2
g3 <- function(x=3) {
f1()
}
g4 <- function(x=4) {
f4 <- function() x
f4()
}
c(g3(), g4())
## [1] 2 4
The enclosing environment is the one where the function is defined, not where it is called.
We saw three types of queries: a range of times, within a bounding box, matching a level for a field. There are many more. Making a nicer interface for that might be helpful.
(Had we downloaded the entire data set, we could just use subsetting and filtering to slice and dice the data R side, but here we want to filter our request to cut down on data transwer, so use the SODA commands to filter.)
Three possible helpful functions might be:
soda_character <- function(var, val) {
sprintf("%s='%s'", var, val) ## could use shQuote(val) and just %s ...
}
soda_date <- function(var, from, to=NULL) {
if (is.null(to)) {
sprintf("%s>='%s'", var, posix_to_timestamp(from))
} else {
sprintf("%s between '%s' and '%s'", var,
posix_to_timestamp(from), posix_to_timestamp(to))
}
}
(Here var is a some type of object, location is a Point)
soda_bbox <- function(var, lats, lons) {
sprintf("within_box(%s, %2.6f, %2.6f, %2.6f, %2.6f)", var, lats[1], lons[1], lats[2], lons[2])
}
Then we have, say:
soda_character("complaint_type", "Noise - Residential")
## [1] "complaint_type='Noise - Residential'"
Not as slick as indexing with `complaint_type == "Noise - Residential", but not too bad.
This is kind of ugly:
we have 3 functions that basically do the same action
we need to mentally keep track of the names for the class of each value we want to use.
shouldn't the computer help here?
Computer languages are now designed to minimize the taxing of the brain, if they can. In this case, we can use one function name to call three similar but different functionalities and let the computer decide which based on the class of the value:
soda_charactersoda_datesoda_bboxR has many different programming styles that implement polymorphism (one name, many actions):
proto – older, one of first with different semanticsIf you bless a function name, then the following pattern will be used when the variable x matches the specified class:
fn.class_name <- function(x, ...) { ... }
So we specialize this function defintion to calls where x is of the specified class.
This is widely used in R. For example, "tab-completing" plot has these by default
plot.default plot.new plot.ts plot.design plot.spec.coherency plot.window plot.ecdf plot.spec.phase plot.xy plot.function plot.stepfun
Among others, we see:
default: the catch all definitionnew: actually shows how the naming is just a convention. There is no new class.Tab completion just lists visible functions by name, we should instead call methods:
methods(plot)
## [1] plot.acf* plot.data.frame* plot.decomposed.ts* ## [4] plot.default plot.dendrogram* plot.density* ## [7] plot.ecdf plot.factor* plot.formula* ## [10] plot.function plot.hclust* plot.histogram* ## [13] plot.HoltWinters* plot.isoreg* plot.lm* ## [16] plot.medpolish* plot.mlm* plot.ppr* ## [19] plot.prcomp* plot.princomp* plot.profile.nls* ## [22] plot.raster* plot.spec* plot.stepfun ## [25] plot.stl* plot.table* plot.ts ## [28] plot.tskernel* plot.TukeyHSD* ## see '?methods' for accessing help and source code
Bless the function name:
Soda <- function(x,...) UseMethod("Soda")
UseMethod is the circus master that controls the dispatching of which method to call based on the class of the first argument.
Then we have:
Soda.character <- function(val, var) {
sprintf("%s='%s'", var, val) ## could use shQuote(val) and just %s ...
}
Very similar – but val and var are switched (a bit unnaturally!) as we need to dispatch on the class of the value, so it goes first.
inThinking in terms of vectorization, we might get passed a vector of values for val. Let's do a bit better – R wise – and use the in feature of SoQL instead:
Soda.character <- function(val, var) {
vals = paste(shQuote(val), collapse = ", ")
if (length(val) == 1) {
sprintf("%s = %s", var, vals)
} else {
sprintf("%s in(%s)", var, vals)
}
}
Soda.POSIXct <- function(from, var, to=NULL) {
if (is.null(to)) {
out <- sprintf("%s>='%s'", var, posix_to_timestamp(from))
} else {
out <- sprintf("%s between '%s' and '%s'", var,
posix_to_timestamp(from), posix_to_timestamp(to))
}
out
}
(Or we might also have written this using from as a length 1 or 2 vector. Then we would have something like:
sprintf("%s %s %s", var, ifelse(length(from) = 1, ">=", "between"),
paste(shQuote(from), collapse=" and "))
This gives a cleaned up version:
Soda.POSIXct <- function(vals, var) {
if (length(vals) > 1)
vals <- vals[1:2]
compare_with <- ifelse(length(vals) == 1, ">=", "between")
val <- paste(shQuote(vals), collapse=" and ")
sprintf("%s %s %s", var, compare_with, val)
}
The advantage here is the consistent calling pattern makes it easier for users to not make mistakes.
What to do about the bounding box. A bounding box is just a set of 4 numbers. We'd like some way to mark these 4 numbers as a "bounding box," rather than just a set of 4 numbers (which might otherwise be used for numeric comparisons, say).
We need to make a bounding box class. Is that hard?
Well with R's S3 classes, it is almost too easy – classes are just a naming convention. Objects have an attribute class, a character vector, that can be adjusted. Below, we just push a name to the top. This constructor, bbox, creates a BBOX object:
bbox <- function(lats, lons) {
lats <- range(as.numeric(lats), na.rm=TRUE)
lons <- range(as.numeric(lons), na.rm=TRUE)
out <- c(lats[1], lons[1], lats[2], lons[2])
class(out) <- c("BBOX", class(out)) ## <--- the key
out
}
With this, we can control the dispatch:
Soda.BBOX <- function(b, var) {
sprintf("within_box(%s, %2.6f, %2.6f, %2.6f, %2.6f)", var, b[1], b[2], b[3], b[4])
}
Add we have, for example:
Soda(bbox(a$latitude, a$longitude))
(We compose our two functions here.)
The above helps us make one query. But it is much more natural to combine queries. In indexing notation, we use & and | to combine logical expressions. Can we do the same?
Well, R lets us define & and | using the same S3 conventions. So we just need a class to control dispatch:
Query <- function(x) {
class(x) <- c("Query", class(x))
x
}
Great, we could go back and edit each of our Soda functions to return a Query object, but instead we use a wrapper function. This also allows us to verify the name along the way.
For that task, we take advantage of R's partial matching of names mechanism:
varnames <- c('address_type', 'agency', 'agency_name', 'borough', 'city', 'closed_date', 'community_board', 'complaint_type', 'created_date', 'cross_street_1', 'cross_street_2', 'descriptor', 'due_date', 'facility_type', 'incident_address', 'incident_zip', 'latitude', 'location.type', 'location', 'location_type', 'longitude', 'park_borough', 'park_facility_name', 'resolution_action_updated_date', 'resolution_description', 'school_address', 'school_city', 'school_code', 'school_name', 'school_not_found', 'school_number', 'school_phone_number', 'school_region', 'school_state', 'school_zip', 'status', 'street_name', 'unique_key', 'x_coordinate_state_plane', 'y_coordinate_state_plane') # hard code list
varnames <- sapply(varnames, identity, simplify=FALSE)
verify_name <- function(nm) {
out <- varnames[[nm, exact=FALSE]] ## use partial matching. More directly, could use `charmatch`.
if (is.null(out))
stop(nm, " does not match")
out
}
soda <- function(x, var, ...) {
if (length(var) > 1) warning("Only one variable at a time, first one being used.")
Query(Soda(x, verify_name(var[1]), ...))
}
"&.Query" <- function(x, y) Query(paste(x, y, sep=" and "))
"|.Query" <- function(x, y) Query(paste(x, y, sep=" or "))
And we have:
soda(bbox(a$latitude, a$longitude), "location") & soda("Noise - Residential", "created_d")
## [1] "within_box(location, 40.709534, -73.965395, 40.711445, -73.963508) and created_date = 'Noise - Residential'" ## attr(,"class") ## [1] "Query" "character"
There is a package for querying RSocrata, but it is more concerned about getting the data, than constructing the query. In that work there is a more robust means to download files, but we will stick with the magical fromJSON function which returns a data frame from the query. This next function tidies up our downloaded files through some conversions:
tidy_up <- function(d) {
d$location <- NULL # clean up
for (nm in c("created_date", "due_date"))
d[[nm]] <- ymd_hms(d[[nm]])
for (nm in c("latitude", "longitude"))
d[[nm]] <- as.numeric(d[[nm]])
d
}
We can make a request with the following. (There is a bit more here than is needed.)
request <- function(query) {
url <- query_to_url(query)
out <- Reduce(rbind, lapply(url, fromJSON)) # vectorized, just df_request(url) otherwise
out <- tidy_up(out)
class(out) <- c("Response", class(out))
out
}
We can see how we have broken each step into a separate function call.
Each line takes care of something: the first maps a query to a url, the second requests the url (which may possibly be vectorized), the third tidies up this data frame, and the fourth adds to the class attribute – leaving us a hook in case we want to specialize some function on these data sets.
The use of GoogleMaps to visualize the request is nice. We can make it the default plotting method for our requests as follows:
plot.Response <- function(a, pch=16, cex=0.3, ...) {
bb <- with(a, qbbox(latitude, longitude))
map <- GetMap.bbox(bb$lonR, bb$latR)
## Translate original data
coords <- with(a, LatLon2XY.centered(map, latitude, longitude))
coords <- data.frame(coords)
## Plot
PlotOnStaticMap(map)
with(coords, points(newX, newY, pch=pch, cex=cex, ...))
}
lat = 40.7104541
lon = -73.9644729
query = soda(bbox(lat + 0.001 * c(-1,1), lon + 0.001 * c(-1,1)), "location") &
soda("Noise - Residential", "complaint_type")
a = request(query)
plot(a)
This example naturally lends itself to more generalizations:
more visualizations
taking advantage of the different query types available
If this were to be pursued with the intent of sharing it with others, we would want to "package" up the work.
We have seen how to
install external packages in R
load external package in R
Both tasks are fairly painless. Now we see how to make an external package.
The hardest part is — a good name! We will use the311.
RStudio's Projects make it easy to create a new package when working on a new project. Basically it puts a call into
devtools::create("the311")
This makes a directory with minimal directory structure to be a package.
Package development in R is greatly assisted by the devtools package, which can be used outside of RStudio.
After calling create we need to do some customization:
edit the DESCRIPTION file
navigate to the R subdirectory and create the necessary files
Some packages use one big file to hold everything, others use small files arranged around different tasks.
The latter is better for packages that may grow, the former maybe better as a package is being developed and how it will be best organized is being worked out.
Our package, the311, uses the latter.
By default, R uses the alphabetical order of the files to determine which file is loaded first. This can lead to files named aaa.R or zzz.R. Kinda ugly.
The devtools package gives directives which can make one file depend on another, hence the dependent file will load later. Looking at the example package we have lines like:
##' @include utils.R NULL
This would force utils.R to be loaded before the file containing this line is.
The comment character is trailed by a ' which is important. This uses the roxygen documentation format, but here we aren't actually documenting anything, so we put in the NULL as a trick.
Packages must be loaded. One development cycle is: load the package, test the package, edit the package, install the package, restart R, load the package … lather, rinse, repeat …
This can get tedious. The devtools package provides load_all to temporarily install the package – avoiding the need to restart R.
load_all(pkgname)
load_all("the311")
lat = 40.7104541
lon = -73.9644729
lats = lat + 0.001 * c(-1,1)
lons = lon + 0.001 * c(-1,1)
query = soda(bbox(lats, lons), "location") &
soda("Noise - Residential", "complaint_type")
a = request(query)
dim(a)
## [1] 108 40
plot(a, cex=2)
When using load_all every function in the package is available.
This is not the case if we had installed the311 and loaded it using require.
Then only exported functions are immediately visible to the user.
As exported functions can mask other packages functions or names, package developers should be conservative in what they export, and should re-use common names within the generic spirit they represent. (Such as plot for the default plot, summary for a short summary, …)
Marking a function for export can be done when using roxygen with the @export directive.
After mentioning it twice, let's describe it. The roxygen package gives a relatively simple way to document the functions in your package so that it is accessible to users of the package. The format for a function is fairly simple:
document what the function does
document the arguments (parameters)
indicate if the function is exported
optionally give details, examples, notes, references.
Documentation isn't limited to functions.
Good documentation is hard to write, but there should be at least bad documentation.
For example,
##' Request a subset of the 311 data set
##' @param query A query built up using `soda` calls
##' @note A more robust method is here https://github.com/Chicago/RSocrata/
##' @export
request <- function(query) {
...
}