dplyr (0.1) and data.table (1.8.10)
http://arunsrinivasan.github.io/dplyr_benchmark/
dplyr looks like a fantastic initiative. Congratulations to Hadley and Romain.
The recent CRAN release of dplyr showcased a benchmarking vignette using baseball data. Following that, we noticed under the comments section of the Rstudio blog post:
Comment from a reader: From the introduction vignette: dplyr also provides data.table methods for all verbs. While data.table is extremely, fast, the current benchmarks suggest that dplyr is 2-3x faster for most single operations, and up to 10x faster for grouped summaries (see the benchmark-baseball vignette for more details). This is really really really exciting. I cant wait to try this package out!! Yay for Hadley and collaborators!!!!
Hadley: Ooops, I forgot to remove that statement from the vignette. That was based on some initial benchmarks when I didnt understand data.table that well. Id now claim that dplyr and data.table are pretty similar for most operations. Sometimes data.table is faster, sometimes dplyr is faster.
Hadley had asked us in advance what we thought of the baseball vignette and we had several rounds of emails to correct it. But unfortunately due to an oversight the vignette went to CRAN unchanged.
Hadley: It was an oversight, and Ive filed a bug to remind me to fix it for the next version.
Since the vignette is on CRAN though and the comparison is to data.table, and since our users have been asking us about the comparison made to data.table, we feel it is appropriate to respond. Therefore for the sake of completeness, heres the benchmark for the relevant cases.
Note that we perform the benchmark with data.table 1.8.10 as on CRAN currently. Many improvements are in 1.8.11 which we will ignore here.
Initial setup
Note 1 An important point is that dplyrs group_by should be included in the benchmarks as it is an essential precursor for grouping operations in dplyr, whereas setkey is not mandatory for data.table for grouping. If setkey is to be used in the analysis, of course it must be timed as well.
Note 2 A common method to benchmark big-data is to run system.time 3 times and choose the minimum (to report timings without effects of cache). However the Batting data size here is so small (10MB) that it almost fits in cache! Relative speed at such small granularity seems utterly pointless to us. But people may believe it is informative. Well therefore use microbenchmark as well on the same 10MB dataset.
So, lets start with initial setup as follows. Note that we don't pre-calculate group_by operations here.
require(Lahman)
## Loading required package: Lahman
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(data.table)
## Loading required package: data.table
##
## Attaching package: 'data.table'
##
## The following object is masked from 'package:dplyr':
##
## last
require(microbenchmark)
## Loading required package: microbenchmark
You can also embed plots, for example:
batting_df = tbl_df(Batting) # for dplyr's data.frame
batting_dt = tbl_dt(Batting) # for dplyr's data.table
DT = data.table(Batting) # for native data.table
Compute the average number of bats for each player:
microbenchmark(dplyr_df = group_by(batting_df, playerID) %.% summarise(ab = mean(AB)),
dplyr_dt = group_by(batting_dt, playerID) %.% summarise(ab = mean(AB)),
dt_raw = DT[, list(ab = mean(AB)), by = playerID], times = 100)
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 16.69 17.01 17.18 17.64 48.95 100
## dplyr_dt 22.82 25.05 27.19 30.13 60.85 100
## dt_raw 18.49 19.04 19.51 20.56 51.30 100
My first results:
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 16.60107 17.17294 17.40236 18.38174 50.32965 100
## dplyr_dt 22.90540 23.81762 25.02265 25.62184 56.22527 100
## dt_raw 18.47471 19.28566 19.89119 21.02603 22.81938 100
My second results, when knitting to HTML:
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 16.45 16.82 16.95 17.62 48.22 100
## dplyr_dt 22.72 24.67 26.92 29.64 59.07 100
## dt_raw 18.55 19.04 19.58 20.64 52.15 100
Arun Srinivasan's results:
Unit: milliseconds
expr min lq median uq max neval
dplyr_df 37.51670 38.61942 39.36357 41.69157 109.1514 100
dplyr_dt 68.30773 75.87789 79.31867 135.32441 162.7321 100
dt_raw 38.22706 39.66757 43.07817 45.80393 117.2201 100
It shows a very different picture now. The time to group_by
was excluded in the vignette, but summarise
cant run without group_by having been run first.
Arrange by year within each player:
microbenchmark(dplyr_df = arrange(batting_df, playerID, yearID), dplyr_dt = arrange(batting_dt,
playerID, yearID), dt_raw = setkey(copy(DT), playerID, yearID), times = 100)
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 28.84 29.71 30.77 32.49 61.40 100
## dplyr_dt 447.57 453.32 458.29 480.43 493.83 100
## dt_raw 10.43 11.61 12.85 14.06 43.71 100
My first results, note large difference from AS/MD results for dplyr_dt
(versions? testing issue?):
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 28.71620 29.29577 29.50587 30.01918 32.42515 100
## dplyr_dt 439.69194 444.83275 446.75892 449.50418 515.68236 100
## dt_raw 10.65927 11.05093 11.23672 11.69962 14.01624 100
My second results, when knitting to HTML, same discrepancy for dplyr_dt
:
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 28.92 29.44 30.49 31.86 62.42 100
## dplyr_dt 450.62 454.88 457.13 481.79 489.11 100
## dt_raw 10.46 11.07 12.72 13.82 43.88 100
Arun Srinivasan's results:
Unit: milliseconds
expr min lq median uq max neval
dplyr_df 37.51670 38.61942 39.36357 41.69157 109.1514 100
dplyr_dt 68.30773 75.87789 79.31867 135.32441 162.7321 100
dt_raw 38.22706 39.66757 43.07817 45.80393 117.2201 100
Once again, a very different picture. This is because in the benchmarks from dplyr
vignette, the ordering by playerID was already done in players_dt
. So basically, the ordering here had to be done only on yearID for the dplyr_*
cases!
Note that dplyr_dt
performs much worse than data.table
run directly. Hadley has said that he considers such differences to be bugs in dplyr
.
Find the year for which each player played the most games:
microbenchmark(dplyr_df = group_by(batting_df, playerID) %.% filter(G == max(G)),
dplyr_dt = group_by(batting_dt, playerID) %.% filter(G == max(G)), dt_raw = DT[DT[,
.I[G == max(G)], by = playerID]$V1], times = 100)
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 45.59 46.93 47.56 48.56 80.08 100
## dplyr_dt 38.19 39.30 39.96 42.29 72.02 100
## dt_raw 30.94 32.88 33.33 33.64 63.90 100
My first results:
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 45.99887 47.70634 47.94616 48.27006 82.56583 100
## dplyr_dt 38.31443 39.67914 39.99865 40.37115 75.17269 100
## dt_raw 31.71185 33.45494 33.88872 34.20521 68.11732 100
My second results, when knitting to HTML:
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 45.59 46.65 47.11 48.29 77.79 100
## dplyr_dt 37.79 39.36 39.85 42.52 74.06 100
## dt_raw 31.47 32.78 33.27 33.63 64.30 100
Arun Srinivasan's results:
Unit: milliseconds
expr min lq median uq max neval
dplyr_df 84.75893 88.29261 90.86771 96.03549 181.3004 100
dplyr_dt 105.82635 112.10389 116.55127 124.82832 202.0836 100
dt_raw 68.33252 73.67968 74.88609 77.91639 165.2599 100
The benchmark in dplyr
vignette is incorrect because it doesnt add a column (doesnt use :=
at all), rather it aggregates. We'll just highlight one case here - rank.
microbenchmark(dplyr_df = group_by(batting_df, playerID) %.% mutate(rank = rank(desc(AB))),
dplyr_dt = group_by(batting_dt, playerID) %.% mutate(rank = rank(desc(AB))),
dt_raw = DT[, `:=`(rank, rank(desc(AB))), by = playerID], times = 10)
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 602.3 610.2 624.5 632.9 636.4 10
## dplyr_dt 573.3 581.5 596.5 607.4 608.6 10
## dt_raw 562.7 566.1 576.5 590.6 594.9 10
My first results, large discrepancy from AS test results for all three:
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 609.6373 612.1543 617.4226 629.3222 648.5280 10
## dplyr_dt 571.3057 583.2877 587.1879 597.6218 615.6105 10
## dt_raw 565.3005 568.8247 574.0326 601.5043 610.5097 10
My second results, when knitting to HTML, the same discrepancy for all:
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 598.4 600.7 605.0 611.9 663.2 10
## dplyr_dt 572.7 579.9 600.4 606.7 619.4 10
## dt_raw 550.0 557.0 563.4 567.6 599.0 10
Arun Srinivasan's results:
Unit: seconds
expr min lq median uq max neval
dplyr_df 1.173205 1.183997 1.195302 1.266843 1.518752 10
dplyr_dt 1.205231 1.255513 1.262523 1.279185 1.328591 10
dt_raw 1.101572 1.111967 1.120616 1.130896 1.155168 10
## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B
## 1: aardsda01 2004 1 SFN NL 11 11 0 0 0 0 0
## 2: aardsda01 2006 1 CHN NL 45 43 2 0 0 0 0
## 3: aardsda01 2007 1 CHA AL 25 2 0 0 0 0 0
## 4: aardsda01 2008 1 BOS AL 47 5 1 0 0 0 0
## 5: aardsda01 2009 1 SEA AL 73 3 0 0 0 0 0
## ---
## 96596: zuverge01 1959 1 BAL AL 6 6 0 0 0 0 0
## 96597: zwilldu01 1910 1 CHA AL 27 27 87 7 16 5 0
## 96598: zwilldu01 1914 1 CHF FL 154 154 592 91 185 38 8
## 96599: zwilldu01 1915 1 CHF FL 150 150 548 65 157 32 7
## 96600: zwilldu01 1916 1 CHN NL 35 35 53 4 6 1 0
## HR RBI SB CS BB SO IBB HBP SH SF GIDP G_old
## 1: 0 0 0 0 0 0 0 0 0 0 0 11
## 2: 0 0 0 0 0 0 0 0 1 0 0 45
## 3: 0 0 0 0 0 0 0 0 0 0 0 2
## 4: 0 0 0 0 0 1 0 0 0 0 0 5
## 5: 0 0 0 0 0 0 0 0 0 0 0 NA
## ---
## 96596: 0 0 0 0 2 0 0 0 0 0 0 6
## 96597: 0 5 1 NA 11 NA NA 1 1 NA NA 27
## 96598: 16 95 21 NA 46 68 NA 1 10 NA NA 154
## 96599: 13 94 24 NA 67 65 NA 2 18 NA NA 150
## 96600: 1 8 0 NA 4 6 NA 0 2 NA NA 35
This is a bit hard to benchmark because the philosophy of dplyr
and data.table
are quite different. data.table
is designed for adding columns by reference in place (therefore columns are over allocated during data.table
creation). But dplyr
doesnt like doing this. However, when benchmarking one should be benchmarking the equivalent of an operation in each tool, not how one thinks the design should be.
But if you insist, this can be accomplished the dplyr
way, using data.table's shallow
function which, like dplyr's mutate
, shallow
copies the data. Heres the benchmark:
microbenchmark(dplyr_df = group_by(batting_df, playerID) %.% mutate(rank = rank(desc(AB))),
dplyr_dt = group_by(batting_dt, playerID) %.% mutate(rank = rank(desc(AB))),
dt_raw = data.table:::shallow(DT)[, `:=`(rank, rank(desc(AB))), by = playerID],
times = 10)
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 586.5 590.5 603.8 612.7 631.7 10
## dplyr_dt 556.9 566.8 573.4 593.2 609.7 10
## dt_raw 537.8 552.9 555.9 569.3 615.1 10
My first results, again a large discrepancy from AS test results for all three:
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 610.5565 614.7448 621.0452 624.1577 657.4528 10
## dplyr_dt 581.7522 589.4291 598.6482 615.4261 624.2944 10
## dt_raw 553.4368 569.5537 577.8976 584.6868 601.5122 10
My second results, when knitting to HTML, and same pattern repeated:
## Unit: milliseconds
## expr min lq median uq max neval
## dplyr_df 594.4 599.4 604.5 612.7 620.8 10
## dplyr_dt 549.4 589.3 596.8 601.0 626.9 10
## dt_raw 549.6 552.2 561.5 582.6 613.6 10
Arun Srinivasan's results:
Unit: seconds
expr min lq median uq max neval
dplyr_df 1.165308 1.179277 1.214480 1.269659 1.300810 10
dplyr_dt 1.183548 1.210675 1.257438 1.270460 1.275579 10
dt_raw 1.106987 1.109709 1.120783 1.137172 1.192628 10
The micro benchmark results are not that different between dplyr
and data.table
here. The 2-3x faster and 10x faster figures were an oversight.
We'll propose a range of benchmarks on large data in a future note once the next version of data.table has been released to CRAN.
Re-produced RMarkdown/RStudio, having cleared Global Environment generated as html.
sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] microbenchmark_1.3-0 data.table_1.8.10 dplyr_0.1
## [4] Lahman_2.0-3 knitr_1.5
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 evaluate_0.5.1 formatR_0.10 Rcpp_0.10.6
## [5] stringr_0.6.2 tools_3.0.2
rm(list = ls(all.names = TRUE))