Bench-marking apply() vs dplyr()

For this analysis I will be looking at some basic bench-marking of two commonly used functions: The base apply() function and the dplyr() function to mean of the departure delay time on the nycflights13 data set.

# Loading the necessary packages.
library(nycflights13)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Packagesfor doing the benchmarking analysis.
library(microbenchmark)
library(ggplot2)

For this benchmark we will calculate the average delay time for departures using the two packages.

apply(flights[5], 2, mean, na.rm = TRUE)

## dep_delay 
##  12.63907

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

## Source: local data frame [1 x 1]
## 
##      delay
## 1 12.63907

Now to benchmark these two functions.

results <- microbenchmark(
    apply(flights[5], 2, mean, na.rm = TRUE),
    summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
    )

Finally we will take a look at the results.

print(results)

## Unit: milliseconds
##                                                       expr      min
##                   apply(flights[5], 2, mean, na.rm = TRUE) 32.16812
##  summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) 14.71828
##        lq     mean   median       uq      max neval
##  32.75570 38.42403 37.79327 39.54100 90.31104   100
##  14.78279 15.19423 14.83704 14.89202 22.74656   100

qplot(y=time, data=results, colour=expr) + scale_y_log10()

From the above results we see an interesting feature. The dplyr() package is twice as fast as the built in apply() function and at the same time has much less variation the apply() function. This is a very nice fact since it’s also much easier to use the dplyr() package than the apply() function especially for more complicated tasks.

Bench-marking apply() vs dplyr()

Erik Nylander

Wednesday, November 26, 2014