BIMS8382: %>% performance demonstration

Data

Let’s use a built-in dataset.

library(dplyr)
starwars

## # A tibble: 87 x 13
##    name     height  mass hair_color skin_color eye_color birth_year gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke Sk…    172  77.0 blond      fair       blue            19.0 male  
##  2 C-3PO       167  75.0 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2        96  32.0 <NA>       white, bl… red             33.0 <NA>  
##  4 Darth V…    202 136   none       white      yellow          41.9 male  
##  5 Leia Or…    150  49.0 brown      light      brown           19.0 female
##  6 Owen La…    178 120   brown, gr… light      blue            52.0 male  
##  7 Beru Wh…    165  75.0 brown      light      blue            47.0 female
##  8 R5-D4        97  32.0 <NA>       white, red red             NA   <NA>  
##  9 Biggs D…    183  84.0 black      light      brown           24.0 male  
## 10 Obi-Wan…    182  77.0 auburn, w… fair       blue-gray       57.0 male  
## # ... with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

Simple grouped summarize

Let’s group by species and calculate mass using pipes.

starwars %>% 
  group_by(species) %>% 
  summarize(mean(mass))

## # A tibble: 38 x 2
##    species   `mean(mass)`
##    <chr>            <dbl>
##  1 Aleena            15.0
##  2 Besalisk         102  
##  3 Cerean            82.0
##  4 Chagrian          NA  
##  5 Clawdite          55.0
##  6 Droid             NA  
##  7 Dug               40.0
##  8 Ewok              20.0
##  9 Geonosian         80.0
## 10 Gungan            NA  
## # ... with 28 more rows

Do it without pipes.

summarize(group_by(starwars, species), mean(mass))

## # A tibble: 38 x 2
##    species   `mean(mass)`
##    <chr>            <dbl>
##  1 Aleena            15.0
##  2 Besalisk         102  
##  3 Cerean            82.0
##  4 Chagrian          NA  
##  5 Clawdite          55.0
##  6 Droid             NA  
##  7 Dug               40.0
##  8 Ewok              20.0
##  9 Geonosian         80.0
## 10 Gungan            NA  
## # ... with 28 more rows

Do it with a temporary variable.

tmp <- group_by(starwars, species)
summarize(tmp, mean(mass))

## # A tibble: 38 x 2
##    species   `mean(mass)`
##    <chr>            <dbl>
##  1 Aleena            15.0
##  2 Besalisk         102  
##  3 Cerean            82.0
##  4 Chagrian          NA  
##  5 Clawdite          55.0
##  6 Droid             NA  
##  7 Dug               40.0
##  8 Ewok              20.0
##  9 Geonosian         80.0
## 10 Gungan            NA  
## # ... with 28 more rows

Benchmark

Let’s use the microbenchmark package to do it 100 times each, getting the performance for each. Don’t worry about how this code works - it’s just running each of the above procedures 100 times to get an average performance.

microbenchmark::microbenchmark(
  pipe = starwars %>% group_by(species) %>% summarize(mean(mass)), 
  nopipe = summarize(group_by(starwars, species), mean(mass)), 
  temp_variables = {tmp <- group_by(starwars); summarize(starwars, mean(mass))}
)

## Unit: milliseconds
##            expr      min       lq     mean   median       uq       max
##            pipe 2.020793 2.398608 2.637022 2.494255 2.685257  4.727662
##          nopipe 1.863022 2.230478 2.466623 2.330475 2.533246  3.939843
##  temp_variables 1.266467 1.458763 2.057072 1.517044 1.649115 42.447261
##  neval cld
##    100   a
##    100   a
##    100   a

Using the %>% adds a tiny performance overhead, but not much. This is on the scale of milliseconds. In either case, assigning to a temporary variable seems to increase performance very slightly, but the millisecond you save isn’t worth the cluttered workspace or cognitive overhead.

Profiling

Also check out code profiling with the profvis package, built into RStudio. You can highlight particular lines of code, and in the rstudio menu click Profile – Profile Selected Lines.

https://support.rstudio.com/hc/en-us/articles/218221837-Profiling-with-RStudio

BIMS8382: `%>%` performance demonstration

Stephen Turner

2/19/2018

Data

Simple grouped summarize

Benchmark

Profiling