It is striking question that either SPARK or data.table is faster? When I was learning the performance results, I can’t help remebring the story of a hare and a tortoise.
For exposition, I use the sparklyr interface to run SPARK job and data.table 1.10.4.
require(sparklyr)
## Loading required package: sparklyr
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(nycflights13)
## Loading required package: nycflights13
require(data.table)
## Loading required package: data.table
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
The flight data is loaded from the library nycflights13 into the SPARK and data.table.
sc<-spark_connect(master = 'local')
flights_tbl<-copy_to(sc,flights,'flights')
DT<-data.table(flights)
The basic profiling with filter by departure delay.
system.time(flights_tbl %>% filter(dep_delay==10))
## user system elapsed
## 0.00 0.00 0.08
system.time(DT[dep_delay==2])
## user system elapsed
## 0.06 0.00 0.06
It is observed from result that the SPARK is hare like and unimaginably faster than the data.table.But,
A little complex snippet:
system.time(DT[,.(distance=mean(distance),delay=mean(arr_delay),count=.N),by=tailnum][count>20 & distance<2000 & !is.na(delay)])
## user system elapsed
## 0.03 0.00 0.06
system.time(delay<-flights_tbl %>% group_by(tailnum) %>% summarise(count=n(),dist=mean(distance),delay=mean(arr_delay)) %>% filter(count>20, dist<2000, !is.na(delay)) %>% collect())
## user system elapsed
## 0.02 0.00 1.70
This time the SPARK is faster but it lets the user to wait much longer than data.table to collect the results back. The fastest hare- SPARK sleeps for a while.
Morale: Slow and steady wins the Race. SPARK?