Benchmark on baseball data:

dplyr (0.1) and data.table (1.8.10)

http://arunsrinivasan.github.io/dplyr_benchmark/

dplyr looks like a fantastic initiative. Congratulations to Hadley and Romain.

The recent CRAN release of dplyr showcased a benchmarking vignette using baseball data. Following that, we noticed under the comments section of the Rstudio blog post:

Comment from a reader: From the introduction vignette: “dplyr also provides data.table methods for all verbs. While data.table is extremely, fast, the current benchmarks suggest that dplyr is 2-3x faster for most single operations, and up to 10x faster for grouped summaries (see the benchmark-baseball vignette for more details).” This is really really really exciting. I can’t wait to try this package out!! Yay for Hadley and collaborators!!!!

Hadley: Ooops, I forgot to remove that statement from the vignette. That was based on some initial benchmarks when I didn’t understand data.table that well. I’d now claim that dplyr and data.table are pretty similar for most operations. Sometimes data.table is faster, sometimes dplyr is faster.

Hadley had asked us in advance what we thought of the baseball vignette and we had several rounds of emails to correct it. But unfortunately due to an oversight the vignette went to CRAN unchanged.

Hadley: It was an oversight, and I’ve filed a bug to remind me to fix it for the next version.

Since the vignette is on CRAN though and the comparison is to data.table, and since our users have been asking us about the comparison made to data.table, we feel it is appropriate to respond. Therefore for the sake of completeness, here’s the benchmark for the relevant cases.

Note that we perform the benchmark with data.table 1.8.10 as on CRAN currently. Many improvements are in 1.8.11 which we will ignore here.

Initial setup

Note 1 An important point is that dplyr’s group_by should be included in the benchmarks as it is an essential precursor for grouping operations in dplyr, whereas setkey is not mandatory for data.table for grouping. If setkey is to be used in the analysis, of course it must be timed as well.

Note 2 A common method to benchmark big-data is to run system.time 3 times and choose the minimum (to report timings without effects of cache). However the Batting data size here is so small (10MB) that it almost fits in cache! Relative speed at such small granularity seems utterly pointless to us. But people may believe it is informative. We’ll therefore use microbenchmark as well on the same 10MB dataset.

So, let’s start with initial setup as follows. Note that we don't pre-calculate group_by operations here.

require(Lahman)
## Loading required package: Lahman
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(data.table)
## Loading required package: data.table
## 
## Attaching package: 'data.table'
## 
## The following object is masked from 'package:dplyr':
## 
##     last
require(microbenchmark)
## Loading required package: microbenchmark

You can also embed plots, for example:

batting_df = tbl_df(Batting)  # for dplyr's data.frame
batting_dt = tbl_dt(Batting)  # for dplyr's data.table
DT = data.table(Batting)  # for native data.table

1. Summarise

Compute the average number of bats for each player:

microbenchmark(dplyr_df = group_by(batting_df, playerID) %.% summarise(ab = mean(AB)), 
    dplyr_dt = group_by(batting_dt, playerID) %.% summarise(ab = mean(AB)), 
    dt_raw = DT[, list(ab = mean(AB)), by = playerID], times = 100)
## Unit: milliseconds
##      expr   min    lq median    uq   max neval
##  dplyr_df 16.69 17.01  17.18 17.64 48.95   100
##  dplyr_dt 22.82 25.05  27.19 30.13 60.85   100
##    dt_raw 18.49 19.04  19.51 20.56 51.30   100

Arun Srinivasan's results:

Unit: milliseconds
     expr      min       lq   median        uq      max neval
 dplyr_df 37.51670 38.61942 39.36357  41.69157 109.1514   100
 dplyr_dt 68.30773 75.87789 79.31867 135.32441 162.7321   100
   dt_raw 38.22706 39.66757 43.07817  45.80393 117.2201   100

It shows a very different picture now. The time to group_by was excluded in the vignette, but summarise can’t run without group_by having been run first.

2. Arrange

Arrange by year within each player:

microbenchmark(dplyr_df = arrange(batting_df, playerID, yearID), dplyr_dt = arrange(batting_dt, 
    playerID, yearID), dt_raw = setkey(copy(DT), playerID, yearID), times = 100)
## Unit: milliseconds
##      expr    min     lq median     uq    max neval
##  dplyr_df  28.84  29.71  30.77  32.49  61.40   100
##  dplyr_dt 447.57 453.32 458.29 480.43 493.83   100
##    dt_raw  10.43  11.61  12.85  14.06  43.71   100

Arun Srinivasan's results:

Unit: milliseconds
     expr      min       lq   median        uq      max neval
 dplyr_df 37.51670 38.61942 39.36357  41.69157 109.1514   100
 dplyr_dt 68.30773 75.87789 79.31867 135.32441 162.7321   100
   dt_raw 38.22706 39.66757 43.07817  45.80393 117.2201   100

Once again, a very different picture. This is because in the benchmarks from dplyr vignette, the ordering by playerID was already done in players_dt. So basically, the ordering here had to be done only on yearID for the dplyr_* cases!

Note that dplyr_dt performs much worse than data.table run directly. Hadley has said that he considers such differences to be bugs in dplyr.

3. Filter

Find the year for which each player played the most games:

microbenchmark(dplyr_df = group_by(batting_df, playerID) %.% filter(G == max(G)), 
    dplyr_dt = group_by(batting_dt, playerID) %.% filter(G == max(G)), dt_raw = DT[DT[, 
        .I[G == max(G)], by = playerID]$V1], times = 100)
## Unit: milliseconds
##      expr   min    lq median    uq   max neval
##  dplyr_df 45.59 46.93  47.56 48.56 80.08   100
##  dplyr_dt 38.19 39.30  39.96 42.29 72.02   100
##    dt_raw 30.94 32.88  33.33 33.64 63.90   100

Arun Srinivasan's results:

Unit: milliseconds
     expr       min        lq    median        uq      max neval
 dplyr_df  84.75893  88.29261  90.86771  96.03549 181.3004   100
 dplyr_dt 105.82635 112.10389 116.55127 124.82832 202.0836   100
   dt_raw  68.33252  73.67968  74.88609  77.91639 165.2599   100

4. Mutate

The benchmark in dplyr vignette is incorrect because it doesn’t add a column (doesn’t use := at all), rather it aggregates. We'll just highlight one case here - rank.

microbenchmark(dplyr_df = group_by(batting_df, playerID) %.% mutate(rank = rank(desc(AB))), 
    dplyr_dt = group_by(batting_dt, playerID) %.% mutate(rank = rank(desc(AB))), 
    dt_raw = DT[, `:=`(rank, rank(desc(AB))), by = playerID], times = 10)
## Unit: milliseconds
##      expr   min    lq median    uq   max neval
##  dplyr_df 602.3 610.2  624.5 632.9 636.4    10
##  dplyr_dt 573.3 581.5  596.5 607.4 608.6    10
##    dt_raw 562.7 566.1  576.5 590.6 594.9    10

Arun Srinivasan's results:

Unit: seconds
     expr      min       lq   median       uq      max neval
 dplyr_df 1.173205 1.183997 1.195302 1.266843 1.518752    10
 dplyr_dt 1.205231 1.255513 1.262523 1.279185 1.328591    10
   dt_raw 1.101572 1.111967 1.120616 1.130896 1.155168    10
##         playerID yearID stint teamID lgID   G G_batting  AB  R   H X2B X3B
##     1: aardsda01   2004     1    SFN   NL  11        11   0  0   0   0   0
##     2: aardsda01   2006     1    CHN   NL  45        43   2  0   0   0   0
##     3: aardsda01   2007     1    CHA   AL  25         2   0  0   0   0   0
##     4: aardsda01   2008     1    BOS   AL  47         5   1  0   0   0   0
##     5: aardsda01   2009     1    SEA   AL  73         3   0  0   0   0   0
##    ---                                                                    
## 96596: zuverge01   1959     1    BAL   AL   6         6   0  0   0   0   0
## 96597: zwilldu01   1910     1    CHA   AL  27        27  87  7  16   5   0
## 96598: zwilldu01   1914     1    CHF   FL 154       154 592 91 185  38   8
## 96599: zwilldu01   1915     1    CHF   FL 150       150 548 65 157  32   7
## 96600: zwilldu01   1916     1    CHN   NL  35        35  53  4   6   1   0
##        HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old
##     1:  0   0  0  0  0  0   0   0  0  0    0    11
##     2:  0   0  0  0  0  0   0   0  1  0    0    45
##     3:  0   0  0  0  0  0   0   0  0  0    0     2
##     4:  0   0  0  0  0  1   0   0  0  0    0     5
##     5:  0   0  0  0  0  0   0   0  0  0    0    NA
##    ---                                            
## 96596:  0   0  0  0  2  0   0   0  0  0    0     6
## 96597:  0   5  1 NA 11 NA  NA   1  1 NA   NA    27
## 96598: 16  95 21 NA 46 68  NA   1 10 NA   NA   154
## 96599: 13  94 24 NA 67 65  NA   2 18 NA   NA   150
## 96600:  1   8  0 NA  4  6  NA   0  2 NA   NA    35

This is a bit hard to benchmark because the philosophy of dplyr and data.table are quite different. data.table is designed for adding columns by reference in place (therefore columns are over allocated during data.table creation). But dplyr doesn’t like doing this. However, when benchmarking one should be benchmarking the equivalent of an operation in each tool, not how one thinks the design should be.

But if you insist, this can be accomplished the dplyr way, using data.table's shallow function which, like dplyr's mutate, shallow copies the data. Here’s the benchmark:

microbenchmark(dplyr_df = group_by(batting_df, playerID) %.% mutate(rank = rank(desc(AB))), 
    dplyr_dt = group_by(batting_dt, playerID) %.% mutate(rank = rank(desc(AB))), 
    dt_raw = data.table:::shallow(DT)[, `:=`(rank, rank(desc(AB))), by = playerID], 
    times = 10)
## Unit: milliseconds
##      expr   min    lq median    uq   max neval
##  dplyr_df 586.5 590.5  603.8 612.7 631.7    10
##  dplyr_dt 556.9 566.8  573.4 593.2 609.7    10
##    dt_raw 537.8 552.9  555.9 569.3 615.1    10

Arun Srinivasan's results:

Unit: seconds
     expr      min       lq   median       uq      max neval
 dplyr_df 1.165308 1.179277 1.214480 1.269659 1.300810    10
 dplyr_dt 1.183548 1.210675 1.257438 1.270460 1.275579    10
   dt_raw 1.106987 1.109709 1.120783 1.137172 1.192628    10

To summarise:

Re-produced RMarkdown/RStudio, having cleared Global Environment generated as html.

sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] microbenchmark_1.3-0 data.table_1.8.10    dplyr_0.1           
## [4] Lahman_2.0-3         knitr_1.5           
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 evaluate_0.5.1 formatR_0.10   Rcpp_0.10.6   
## [5] stringr_0.6.2  tools_3.0.2
rm(list = ls(all.names = TRUE))