Find an R package that covers the same functionality as some base R code. Write some code to benchmark– probably with loaded or generated data of a reasonable size–to compare speed in the two environments). Publish your code and benchmark results at rpubs.com.
library(dplyr)
library(microbenchmark)
library(nycflights13)
Here, we will compare the performance of generating a new field in the flights data.frame: distance_index which will compare the actual distance of the flight to the average distance for all flights.
flights.base <- flights
flights.dplyr <- flights
benchmark <- microbenchmark(
apply(flights.base[14], 2, mean, na.rm = TRUE),
summarize(flights, dist_index = mean(distance, na.rm = TRUE))
)
Now, we can view the results:
benchmark
## Unit: milliseconds
## expr min
## apply(flights.base[14], 2, mean, na.rm = TRUE) 16.995501
## summarize(flights, dist_index = mean(distance, na.rm = TRUE)) 4.767606
## lq mean median uq max neval cld
## 19.897992 22.395884 21.579603 23.335008 62.78303 100 b
## 5.218185 5.734025 5.595525 5.984423 12.71280 100 a
This shows that the summarize function is faster than the apply function. However, we should also test a simpler function than apply.
flights.base <- flights
flights.dplyr <- flights
benchmark2 <- microbenchmark(
mean(flights.base$distance, na.rm = TRUE),
summarize(flights, dist_index = mean(distance, na.rm = TRUE))
)
benchmark2
## Unit: milliseconds
## expr min
## mean(flights.base$distance, na.rm = TRUE) 9.623818
## summarize(flights, dist_index = mean(distance, na.rm = TRUE)) 4.781975
## lq mean median uq max neval cld
## 10.995952 13.251323 12.512284 14.005903 64.89978 100 b
## 5.288839 5.814705 5.744264 6.131248 10.15069 100 a
While using the base package’s mean function is faster than apply, the summarize function is still superior.