The dplyr package

Rob Schick
27 March 2014

Outline

Brief intro to plyr
- Philosophy
- Examples
Why dplyr is better
- Speed up from base R, to plyr to dplyr
- 5 “verbs” in dplyr
- Chaining syntax together
- Other advantages

History of dplyr

Origins in the plyr package (Wickham 2011)
split-apply-combine philosophy
Similar to the map-reduce algorithm from functional programming
- Originally primitives in Lisp
- Made massively scalable by Google
- map step: master node takes the input and divides it into smaller sub-problems for sub-nodes
- reduce step: master node collects answers and reassembles them to form the answer
In R: base::split(), base::*apply(), base::rbind()

plyr's Aims

replaces loops, not because they are slow, but because they mask intent
- Within loops, important code is mixed with bookkeeping code
- plyr eliminates extra code & illuminates key components of the code

Apply functions vs plyr

pieces <- split(ozonedf, list(ozonedf$lat, ozonedf$long))

models <- lapply(pieces, deseasf_df)

results <- mapply(function(model, df) {
cbind(df[rep(1, 72), c("lat", "long")], resid(model))
}, models, pieces)

deseasdf <- do.call("rbind", results)

Apply functions vs plyr, II

models <- dlply(ozonedf, .(lat, long), deseasf_df)

deseas <- ldply(models, resid)

plyr functions

Input\Output	Array	Data frame	List	Discarded
Array	aaply	adply	alply	a_ply
Data frame	daply	ddply	dlply	d_ply
List	laply	ldply	llply	l_ply

ddply example - summarise

# calculate mean runs batted in (that's baseball speak) for each year

tail(ddply(baseball, .(year), summarise, mean_rbi = mean(rbi, na.rm = TRUE)))

    year mean_rbi
132 2002    28.24
133 2003    24.44
134 2004    24.89
135 2005    21.33
136 2006    18.07
137 2007    18.83

Advantages of dplyr to plyr

It works exclusively with data in R data frames
It can process data in remote databases
It introduces a “grammar of data manipulation”, allowing you to string together operations with the %.% operator
It's much, much faster than plyr or standard R operations (most processing is done in parallel in C++)
Speed comparison:

Top 3 baseball hitters in base R - for loop

n.b. there are 17908 players in the Lahman package with 96600 rows

inds <- unique(Batting$playerID)
totals <- data.frame(ID = inds, runs = rep(NA, length(inds)))

for(i in 1:length(inds)){
  totals$runs[i] <- sum(Batting[Batting$playerID == inds[i], 'R'], na.rm=TRUE)
}
head(runs[order(-totals$runs),],3)

Top 3 baseball hitters in base R - aggregate

library(Lahman)

totals <- aggregate(. ~ playerID, data=Batting[,c("playerID","R")], sum)

ranks <- sort.list(-totals$R)

totals[ranks[1:3],]

      playerID    R
6733 henderi01 2295
2869  cobbty01 2246
1398 bondsba01 2227

Top 3 baseball hitters in plyr

runs <- ddply(Batting, .(playerID), summarise, total_runs = sum(R, na.rm = TRUE))

head(runs[order(-runs$total_runs),],3)

      playerID total_runs
6967 henderi01       2295
2973  cobbty01       2246
1446 bondsba01       2227

Top 3 baseball hitters in dplyr

Batting %.%
  group_by(playerID) %.%
  summarise(total = sum(R)) %.%
  arrange(desc(total)) %.%
  head(3)

Source: local data frame [3 x 2]

   playerID total
1 henderi01  2295
2  cobbty01  2246
3 bondsba01  2227

Speed Comparison

Method	Proc Time (s)	Speed Up
`for(i in ...)`	80.74
`plyr`	8.191	9-10x faster
`aggregate`	0.298	270x faster
`dplyr`	0.044	1835x faster

Top 3 baseball hitters in the flesh

alt text

dplyr's 5 verbs

Filter rows with filter()
Arrange rows with arrange()
Select columns with select()
Add new columns with mutate()
Summarise values with summarise()
Commonalities
- First argument is a data frame
- Subsequent args describe what to do with it
- The result is a new data frame

filter()

M$-Excel's filter tool:

filter()

filter(hflights_df, Month == 12, DayofMonth == 25)

filter() vs which()

filter(hflights_df, Month == 12, DayofMonth == 25)

hflights_df[which(hflights_df$Month == 12 & hflights_df$DayofMonth == 25),]

filter(): 0.03119 seconds
which(): 0.04858 seconds
Difference: 0.01739 seconds

arrange()

It's like filter, but reorders rows instead of filtering them

arrange(hflights_df, desc(ArrDelay))

arrange()

It's a straightforward wrapper around order() with less typing. Compare the first and second lines

hflights[order(hflights$ArrDelay, hflights$DayofMonth, hflights$Month),]

arrange(hflights_df, ArrDelay, DayofMonth, Month)

select()

select() allows you to rapidly zoom in on a useful subset of a data frame. This works similarly to base::subset()

select(hflights_df, Month, DayofMonth)

mutate()

Add new columns that are functions of existing columns

mutate(hflights_df, 
       gain = ArrDelay - DepDelay,
       speed = Distance / AirTime * 60)

summarise()

Collapsed a data frame to a single row

summarise(hflights_df, 
          delay = mean(DepDelay, na.rm = TRUE))

Source: local data frame [1 x 1]

  delay
1 9.445

Let's see the verbs in action

Data are from the hflights package, which has all flights from Houston, TX in 2011 (n = 227496)
Demo the Code

Grouped Operations

The verbs become powerful when combined with the idea of “group by”
In dplyr the group_by() function tells R how to break a dataset down into groups of rows
Use the new object in exactly the same functions as above
Example to find distance by delay for individual planes:

Grouped Operations - Code

planes <- group_by(hflights_df, TailNum)

delay <- summarise(planes,
  count = n(),
  dist = mean(Distance, na.rm = TRUE),
  delay = mean(ArrDelay, na.rm = TRUE))

delay <- filter(delay, count > 20, dist < 2000)

Chaining

dplyr API is functional in the sense that function calls don't have side-effects & you must always save the results
With many operations at once –> inelegant code
Three options
1. one by one
2. nested
3. chained

One by One

a1 <- group_by(hflights, Year, Month, DayofMonth)

a2 <- select(a1, Year:DayofMonth, ArrDelay, DepDelay)

a3 <- summarise(a2,
  arr = mean(ArrDelay, na.rm = TRUE),
  dep = mean(DepDelay, na.rm = TRUE))

a4 <- filter(a3, arr > 30 | dep > 30)

Nested

filter(
  summarise(
    select(
      group_by(hflights, Year, Month, DayofMonth),
      Year:DayofMonth, ArrDelay, DepDelay
    ),
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ),
  arr > 30 | dep > 30
)

Difficult to read because it's inside out
Arguments are a long way from the function

Chained

To get around these, dplyr offers the %.% operator

hflights %.%
  group_by() %.%
  select() %.%
  summarise() %.%
  filter()

See demoCode.R

Other Advantages

Aggregate functions (see demoCode.R)
window-functions (see demoCode.R)

Hammerin' Hank & The Babe

alt text

Other Advantages

Works with data tables
Allows processing in remote databases
- with the transformations done in-database
- only the result is returned to R

Summary

Fast, compact, and elegant
Makes it easy to stay with the data
Makes it easy to get more out of the data

data %.%
  group_by() %.%
  select() %.%
  summarise() %.%
  mutate() %.% 
  filter() %.%
  arrange()