It is striking question that either SPARK or data.table is faster? When I was learning the performance results, I can’t help remebring the story of a hare and a tortoise.

For exposition, I use the sparklyr interface to run SPARK job and data.table 1.10.4.

require(sparklyr)
## Loading required package: sparklyr
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(nycflights13)
## Loading required package: nycflights13
require(data.table)
## Loading required package: data.table
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

The flight data is loaded from the library nycflights13 into the SPARK and data.table.

sc<-spark_connect(master = 'local')
flights_tbl<-copy_to(sc,flights,'flights')
DT<-data.table(flights)

The basic profiling with filter by departure delay.

system.time(flights_tbl %>% filter(dep_delay==10))
##    user  system elapsed 
##    0.00    0.00    0.08
system.time(DT[dep_delay==2])
##    user  system elapsed 
##    0.06    0.00    0.06

It is observed from result that the SPARK is hare like and unimaginably faster than the data.table.But,

A little complex snippet:

system.time(DT[,.(distance=mean(distance),delay=mean(arr_delay),count=.N),by=tailnum][count>20 & distance<2000 & !is.na(delay)])
##    user  system elapsed 
##    0.03    0.00    0.06
system.time(delay<-flights_tbl %>% group_by(tailnum) %>% summarise(count=n(),dist=mean(distance),delay=mean(arr_delay)) %>% filter(count>20, dist<2000, !is.na(delay)) %>% collect())
##    user  system elapsed 
##    0.02    0.00    1.70

This time the SPARK is faster but it lets the user to wait much longer than data.table to collect the results back. The fastest hare- SPARK sleeps for a while.

Morale: Slow and steady wins the Race. SPARK?