A domain specific language for efficient data reformation in R.

Hadley Wickham, Rice University.

Overview

R is in sweet spot for data analysis. It provides the tools needed for fluid, interactive, exploratory data analysis: tools for visualisation, modelling and aggregation. If your data fits in memory, R provides an unparalleled environment for exploration.

But what if your data doesn’t fit in memory? R provides few out of memory algorithms, and R, while getting faster, is not a suitable implementation environment for high-performance algorithms. Nevertheless, R is still widely used for exploring big data - it is used in Google, eBay, Facebook and LinkedIn, among others; companies with terabytes to petabytes of data. R is still useful for big data because most of the time you don’t look at all the data at once: you look at subsets and aggregations. Collectively, we’ll call these operations data reformation.

The current infrastructure for big data in R is unwieldy. Most reformation is done in the datastore, and requires writing code in another language, e.g. SQL for relational databases or HQL for hive, a hadoop-based datastore. This is problematic because one data retrieval iteration is never enough - you almost always realise that you need a different level of aggregation (finer or coarser), or a different subsample. Stepping out to another language breaks flow, and dramatically increases iteration time. This is suboptimal, because it hampers the cycle of aggregation, modelling and visualisation necessary for good data exploration and analysis.

In this project, I propose building an efficient domain specific language (DSL) for data reformation in R. This extends my existing work on the plyr package, which provides an elegant, but slow, tool for in-memory aggregation, and my research on the process of data analysis, which suggests the basic building blocks of the language. The goal is to create a DSL that seamlessly abstracts away the data storage format - if the data is in R already, it will work with it there, otherwise it will fetch efficiently from a remote datastore. The challenge is to create an expressive language for data analysis tasks that pushes the hard computational work off to tools that already do it well.

Prior work

There is much existing work on the topic of building domain specific languages to support various components of the data analysis cycle. There has been a particularly large effort in visualisation: ggplot2 in R, protovis and D3 in javascript, to name a few. R provides a unified DSL for modelling, the formula interface, which is used in many packages including zelig and VGAM which extend it to deal with large class of models. My plyr package is one attempt to develop a DSL for data reformation, but it is slow and rather procedural. data.table provides efficient implementation, but implemented as extension of subsetting, and not generalisable.

Outside of R, there are many tools that map relational databases to objects: object-relational mapping (ORMs). Arel, a ruby library, is a particularly nice example of the species because it explicitly considers the relational algebra underlying sql and is designed to be closed under composition, and to work with appropriate low level data structures (not just strings). Compared to these existing efforts, the proposed work will focus on the tools needed for data analysis (filter, arrange, transform, summarise), not for arbitrary database manipulation, and will abstract tasks across multiple types of data store, not just SQL.

Another

Overview of DSL

Fluid data analysis requires specific tools to support data reformation. This section introduces a common data analysis scenario, and the verbs, adverbs and conjunctions, needed to support to it. To motivate the ideas, below I show some code from a recent data analysis, presented as part of a google tech talk. It counts deaths by cause of death (cod) and hour of death (hod), subsets to remove missing values, and then transforms by calculating proportion of deaths per hour for each cause of death:

hod2 <- count(deaths, c("cod", "hod"))
hod2 <- subset(hod2, !is.na(hod))
hod2 <- ddply(hod2, "cod", transform, prop = freq / sum(freq))

This code is very procedural, which makes it hard to separate the operations away from the data they are performed on. A declarative DSL might look more like this:

deaths +
  count(c("cod", "hod")) +
  subset(is.na(hod)) +
  by("cod") + 
    tranform(prop = freq / sum(freq)) + 
  collapse()

Here we use operator overloading to allow + to progressively build up a set of operations. The key properties of the DSL is that it is:

lazy, in that no computation is done until needed
chainable, so that complex combinations of aggregation and filtering can be quickly produced
closed under composition, so that each object updates an internal data structure that efficiently represents
storage independent, so that it doesn’t matter if deaths is an R data frame, and SQL database or some other efficient on-disk storage format.
efficient, laziness ensures that as little work is done in R as possible, and a compilation step ensures that as little work is done in the data store as possible.

The following section describes the putative components of the grammar in more detail.

Verbs

Verbs operate on the current dataset and return a data set. I’m currently aware of four very verbs that seem to cover a very large proportion of data analysis reformations:

filter for extracting a subset of the rows and/or columns
transform for adding new columns or replacing existing columns
arrange for reordering the rows
summarise for reducing many numbers down to a single number

To ensure that users of this framework are not hampered by my possibly impartial understanding of what verbs are necessary for data analysis, the verbs will be implemented using a standard API that can easily be extended for more cases.

Adverbs

Adverbs modify the operation of verbs. I’m currently aware of one pair: by and collapse. The by operator breaks up a data frame into a list of data frames, and the collapse operator collapses a list back down into a single data frame. Together with the verbs described above, this makes group-wise aggregation and transformations possible. These are extremely common operations. Note, that this is a virtual operation and may not actually require the data to be divided into pieces (and when done efficiently almost certainly does not).

Conjunctions

Conjunctions combine multiple datasets, combing the active data with supplemental data. The join conjunction is familiar to users of SQL, and has inner, left, right and outer variants, which control what happens when there are no matching rows on the active or additional data. Another useful conjunction is match, which operates like join but does not add columns: it just restricts rows in the active that match rows in a supplemental dataset. Left and right variants control whether it’s the active or supplemental data that is returned.

Challenges

A big advantage of a declarative approach is that we do much more in a single step, and so more information is available for implementing the operations efficiently. A compilation phase will rearrange these steps into the most efficient form. A simple example is that whenever a summary is used, we can work back to figure out to minimal set of variables needed to complete the summary.

A substantial challenge is to do this efficiently regardless of the backend. This is challenging for individual backends (e.g.) SQL, let alone for multiple backends which may have radically different performance characteristics. It will be a challenge to develop a data structure that captures the essence of the reformation operations and can compile it to an efficient call for diverse backends. I hope to tackle this challenge by recruiting an advanced CS student with an interest in statistics.

Outcomes and results

The output of this project will be an open source R package that implements a DSL that describes the common data transformations needed for data analysis. We will also implement two backends, one for data frames in R and one for data in relational databases using SQL.

We will document the connector API to make sure it is easy for others to contribute backends, and identify collaborators for future backends. Future work could also provide backends for:

data.table
RevoScaleR
NoSQL (Redis, CouchDB)

This is an ambitious project and will not be fully complete in one year. The google grant will allow me to explore the idea fully, implementing one or two backends and providing evidence whether or not this approach can be successful. Will use as a pilot study for applying for future grant funds.