Assignment 13 - Performance Comparison in R

Find an R package that covers the same functionality as some base R code. Write some code to benchmark– probably with loaded or generated data of a reasonable size–to compare speed in the two environments). Publish your code and benchmark results at rpubs.com.

library(dplyr)
library(microbenchmark)
library(nycflights13)

Here, we will compare the performance of generating a new field in the flights data.frame: distance_index which will compare the actual distance of the flight to the average distance for all flights.

flights.base <- flights
flights.dplyr <- flights

benchmark <- microbenchmark(
    apply(flights.base[14], 2, mean, na.rm = TRUE),
    summarize(flights, dist_index = mean(distance, na.rm = TRUE))
  )

Now, we can view the results:

benchmark

## Unit: milliseconds
##                                                           expr       min
##                 apply(flights.base[14], 2, mean, na.rm = TRUE) 16.995501
##  summarize(flights, dist_index = mean(distance, na.rm = TRUE))  4.767606
##         lq      mean    median        uq      max neval cld
##  19.897992 22.395884 21.579603 23.335008 62.78303   100   b
##   5.218185  5.734025  5.595525  5.984423 12.71280   100  a

This shows that the summarize function is faster than the apply function. However, we should also test a simpler function than apply.

flights.base <- flights
flights.dplyr <- flights

benchmark2 <- microbenchmark(
    mean(flights.base$distance, na.rm = TRUE),
    summarize(flights, dist_index = mean(distance, na.rm = TRUE))
  )

benchmark2

## Unit: milliseconds
##                                                           expr      min
##                      mean(flights.base$distance, na.rm = TRUE) 9.623818
##  summarize(flights, dist_index = mean(distance, na.rm = TRUE)) 4.781975
##         lq      mean    median        uq      max neval cld
##  10.995952 13.251323 12.512284 14.005903 64.89978   100   b
##   5.288839  5.814705  5.744264  6.131248 10.15069   100  a

While using the base package’s mean function is faster than apply, the summarize function is still superior.

Assignment 13 - Performance Comparison in R

Matt Moramarco

December 5, 2014