In every data analysis you have to string together many tools. You need data munging, visualisation and modelling tools to understand what’s going on in your data. To be effective, you need to be able to easily flow from one tool to the next, focussing on asking and answering questions of the data, not struggling to get the output from one function into right format to feed to the next step. This is a problem I spend a lot of time thinking about for R, and today I want to show you a new technique that I’ve found particularly useful.

R, at its heart, is a functional programming language: you do data analysis in R by composing functions. The problem, however, is that lots of function composition makes it hard to read your code. For example, here’s some R code that munges flight delay data from New York City in 2013. What does it do?

library(nycflights13)
library(dplyr)

arrange(
  summarise(
    group_by(
      filter(
        flights,
        !is.na(dep_delay)
      ),
      year, month, day
    ),
    delay = mean(dep_delay)
  ),
  desc(delay)
)

It’s hard to read this code because:

  1. The innermost operation is performed first, so to read the operations in sequence you have to read from inside-out, from right-to-left.

  2. Function arguments become increasing spread apart (this is know at the Dagwood sandwich problem).

We can make the code easier to read by using the new pipe operator, %>%, provided by the magrittr package. It turns function composition into a sequence of imperative commands: “Start here, then do this, then do that, then do something else”. Here’s what the previous code looks like if we use %>% (when reading, pronounce %>% as “then”):

flights %>%
  filter(!is.na(dep_delay)) %>%
  group_by(year, month, day) %>%
  summarise(delay = mean(dep_delay)) %>%
  arrange(desc(delay)) %>%
  head(5)
#> Source: local data frame [5 x 4]
#> Groups: year, month
#> 
#>   year month day delay
#> 1 2013     3   8 83.54
#> 2 2013     7   1 56.23
#> 3 2013     9   2 53.03
#> 4 2013     7  10 52.86
#> 5 2013    12   5 52.33

Even if you’ve never used R before, you should be able to make sense of this: we start with flights data, then remove all flights with missing departure delay. Next we group by date and summarise to work out the average daily delay. Finally we arrange in descending order of delay and look at the five worst delays. That shows us the worst day, by a substantial margin, was March 8. This code does the same thing as the first snippet, but it’s much easier to read!

%>% has analogues in many other languages: it’s similar to the pipe operator in F#, method chaining in JS and python, clojure’s thread-first macro, and Haskell’s ->. It’s certainly not a new idea, but the implementation in R is novel and it’s a very powerful idea. It’s really transformed the way that I code new data analyses.

It’s easy to write functions that work in a pipeline. A function only needs to adhere to a simple contract: the first argument should be the same type of thing as the last argument. This makes it easy to write pipelines for different types of thing. That’s a big theme of the R day at StrataNYC. If you come along, you’ll see:

Another theme of the day is ensuring your work in R engages seamlessly with the rest of your company. You’ll learn how to intermingle code and text into static reports with rmarkdown, and how to create engaging analytic experiences with shiny. You’ll also learn about packrat.Have you ever upgraded a package so you could use the latest and greatest feature, and then cursed when it broke old code in another project? Packrat solves that problem by managing an independet set of packages for each project.