The dplyr package

Rob Schick
27 March 2014

Outline

  • Brief intro to plyr
    • Philosophy
    • Examples
  • Why dplyr is better
    • Speed up from base R, to plyr to dplyr
    • 5 “verbs” in dplyr
    • Chaining syntax together
    • Other advantages

History of dplyr

  • Origins in the plyr package (Wickham 2011)
  • split-apply-combine philosophy
  • Similar to the map-reduce algorithm from functional programming
    • Originally primitives in Lisp
    • Made massively scalable by Google
    • map step: master node takes the input and divides it into smaller sub-problems for sub-nodes
    • reduce step: master node collects answers and reassembles them to form the answer
  • In R: base::split(), base::*apply(), base::rbind()

plyr's Aims

  • replaces loops, not because they are slow, but because they mask intent
    • Within loops, important code is mixed with bookkeeping code
    • plyr eliminates extra code & illuminates key components of the code

Apply functions vs plyr

pieces <- split(ozonedf, list(ozonedf$lat, ozonedf$long))

models <- lapply(pieces, deseasf_df)

results <- mapply(function(model, df) {
cbind(df[rep(1, 72), c("lat", "long")], resid(model))
}, models, pieces)

deseasdf <- do.call("rbind", results)

Apply functions vs plyr, II

models <- dlply(ozonedf, .(lat, long), deseasf_df)

deseas <- ldply(models, resid)

plyr functions

Input\Output Array Data frame List Discarded
Array aaply adply alply a_ply
Data frame daply ddply dlply d_ply
List laply ldply llply l_ply

ddply example - summarise

# calculate mean runs batted in (that's baseball speak) for each year

tail(ddply(baseball, .(year), summarise, mean_rbi = mean(rbi, na.rm = TRUE)))
    year mean_rbi
132 2002    28.24
133 2003    24.44
134 2004    24.89
135 2005    21.33
136 2006    18.07
137 2007    18.83

Advantages of dplyr to plyr

  • It works exclusively with data in R data frames
  • It can process data in remote databases
  • It introduces a “grammar of data manipulation”, allowing you to string together operations with the %.% operator
  • It's much, much faster than plyr or standard R operations (most processing is done in parallel in C++)
  • Speed comparison:

Top 3 baseball hitters in base R - for loop

n.b. there are 17908 players in the Lahman package with 96600 rows

inds <- unique(Batting$playerID)
totals <- data.frame(ID = inds, runs = rep(NA, length(inds)))

for(i in 1:length(inds)){
  totals$runs[i] <- sum(Batting[Batting$playerID == inds[i], 'R'], na.rm=TRUE)
}
head(runs[order(-totals$runs),],3)

Top 3 baseball hitters in base R - aggregate

library(Lahman)

totals <- aggregate(. ~ playerID, data=Batting[,c("playerID","R")], sum)

ranks <- sort.list(-totals$R)

totals[ranks[1:3],]
      playerID    R
6733 henderi01 2295
2869  cobbty01 2246
1398 bondsba01 2227

Top 3 baseball hitters in plyr

runs <- ddply(Batting, .(playerID), summarise, total_runs = sum(R, na.rm = TRUE))

head(runs[order(-runs$total_runs),],3)
      playerID total_runs
6967 henderi01       2295
2973  cobbty01       2246
1446 bondsba01       2227

Top 3 baseball hitters in dplyr

Batting %.%
  group_by(playerID) %.%
  summarise(total = sum(R)) %.%
  arrange(desc(total)) %.%
  head(3)
Source: local data frame [3 x 2]

   playerID total
1 henderi01  2295
2  cobbty01  2246
3 bondsba01  2227

Speed Comparison

Method Proc Time (s) Speed Up
for(i in ...) 80.74
plyr 8.191 9-10x faster
aggregate 0.298 270x faster
dplyr 0.044 1835x faster

Top 3 baseball hitters in the flesh

alt text alt text alt text

dplyr's 5 verbs

  • Filter rows with filter()
  • Arrange rows with arrange()
  • Select columns with select()
  • Add new columns with mutate()
  • Summarise values with summarise()
  • Commonalities
    • First argument is a data frame
    • Subsequent args describe what to do with it
    • The result is a new data frame

filter()

M$-Excel's filter tool:

filter()

filter(hflights_df, Month == 12, DayofMonth == 25)

filter() vs which()

filter(hflights_df, Month == 12, DayofMonth == 25)

hflights_df[which(hflights_df$Month == 12 & hflights_df$DayofMonth == 25),]
  • filter(): 0.03119 seconds

  • which(): 0.04858 seconds

  • Difference: 0.01739 seconds

arrange()

  • It's like filter, but reorders rows instead of filtering them
arrange(hflights_df, desc(ArrDelay))

arrange()

  • It's a straightforward wrapper around order() with less typing. Compare the first and second lines
hflights[order(hflights$ArrDelay, hflights$DayofMonth, hflights$Month),]

arrange(hflights_df, ArrDelay, DayofMonth, Month)

select()

  • select() allows you to rapidly zoom in on a useful subset of a data frame. This works similarly to base::subset()
select(hflights_df, Month, DayofMonth)

mutate()

  • Add new columns that are functions of existing columns
mutate(hflights_df, 
       gain = ArrDelay - DepDelay,
       speed = Distance / AirTime * 60)

summarise()

  • Collapsed a data frame to a single row
summarise(hflights_df, 
          delay = mean(DepDelay, na.rm = TRUE))
Source: local data frame [1 x 1]

  delay
1 9.445

Let's see the verbs in action

  • Data are from the hflights package, which has all flights from Houston, TX in 2011 (n = 227496)
  • Demo the Code

Grouped Operations

  • The verbs become powerful when combined with the idea of “group by”
  • In dplyr the group_by() function tells R how to break a dataset down into groups of rows
  • Use the new object in exactly the same functions as above
  • Example to find distance by delay for individual planes:

Grouped Operations - Code

planes <- group_by(hflights_df, TailNum)

delay <- summarise(planes,
  count = n(),
  dist = mean(Distance, na.rm = TRUE),
  delay = mean(ArrDelay, na.rm = TRUE))

delay <- filter(delay, count > 20, dist < 2000)

Chaining

  • dplyr API is functional in the sense that function calls don't have side-effects & you must always save the results
  • With many operations at once –> inelegant code
  • Three options
    1. one by one
    2. nested
    3. chained

One by One

a1 <- group_by(hflights, Year, Month, DayofMonth)

a2 <- select(a1, Year:DayofMonth, ArrDelay, DepDelay)

a3 <- summarise(a2,
  arr = mean(ArrDelay, na.rm = TRUE),
  dep = mean(DepDelay, na.rm = TRUE))

a4 <- filter(a3, arr > 30 | dep > 30)

Nested

filter(
  summarise(
    select(
      group_by(hflights, Year, Month, DayofMonth),
      Year:DayofMonth, ArrDelay, DepDelay
    ),
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ),
  arr > 30 | dep > 30
)

  • Difficult to read because it's inside out
  • Arguments are a long way from the function

Chained

  • To get around these, dplyr offers the %.% operator
hflights %.%
  group_by() %.%
  select() %.%
  summarise() %.%
  filter()
  • See demoCode.R

Other Advantages

  • Aggregate functions (see demoCode.R)
  • window-functions (see demoCode.R)

Hammerin' Hank & The Babe

alt text alt text

Other Advantages

  • Works with data tables
  • Allows processing in remote databases
    • with the transformations done in-database
    • only the result is returned to R

Summary

  • Fast, compact, and elegant
  • Makes it easy to stay with the data
  • Makes it easy to get more out of the data
data %.%
  group_by() %.%
  select() %.%
  summarise() %.%
  mutate() %.% 
  filter() %.%
  arrange()