Data Science In Context Presentation

Albert Gilharry

11 April 2018

Intro
Synchronous Programming
Asynchronous Programming
Comparison using R’s Future Package

Intro

The CCCCC is madated to guide CARICOM’s response to a changing climate.

Hosted in Belize
CARICOM’s official repository for climate change related data and information.
A subset of over 400 billion data points and growing!

Belize

Domain

Synchronous Programming

The “traditional” way of programming, in which code is executed “line by line” within the main thread.
Blocking is implemented to ensure consistency.

Synchronous

Asynchronous Programming

Tasks are executed independently from the main thread in a non-blocking manner!
Can become complex and difficult to trace
Recursive functions allow looping in asynchronous environments

Asynchronous

Use asynchronous programming when

aiming for efficiency over simplicity
performing possible long running operations.
tasks are independent of each other and inter-communication is not necessary
user experience (not waiting) is of high importance
doing AJAX requests

R’s Future Package For Parallel & Distributed Processing

we will load 6 relatively large files using async and sync methods to compare performances.

## [1] "There are 10669 rows and 369 columns in this data set."

Synchronous Version

load_data_sync <- function(){
  
df1 <- read.csv("data/2070.csv", skip = 6) 
df2 <- read.csv("data/2071.csv", skip = 6)
df3 <- read.csv("data/2072.csv", skip = 6) 
df4 <- read.csv("data/2073.csv", skip = 6) 
df5 <- read.csv("data/2074.csv", skip = 6)
df6 <- read.csv("data/2075.csv", skip = 6)
 
df <- rbind(df1, df2, df3, df4, df5)
}

Asynchronous Version

load_data_async <- function(){
  
df1 <- future({ read.csv("data/2070.csv", skip = 6) }) %plan% multiprocess
df2 <- future({ read.csv("data/2071.csv", skip = 6) }) %plan% multiprocess
df3 <- future({ read.csv("data/2072.csv", skip = 6) }) %plan% multiprocess
df4 <- future({ read.csv("data/2073.csv", skip = 6) }) %plan% multiprocess
df5 <- future({ read.csv("data/2074.csv", skip = 6) }) %plan% multiprocess
df6 <- future({ read.csv("data/2075.csv", skip = 6) }) %plan% multiprocess
 
df <- rbind(value(df1), value(df2), value(df3), value(df4), value(df5))
}

Results

microbenchmark(load_data_async(), list = NULL, times = 10)

## Unit: seconds
##               expr      min       lq     mean   median     uq      max
##  load_data_async() 9.895663 10.22015 11.01696 10.45476 10.746 16.52279
##  neval
##     10

microbenchmark(load_data_sync(), list = NULL, times = 10)

## Unit: seconds
##              expr      min      lq     mean   median       uq      max
##  load_data_sync() 34.70808 35.0382 35.87707 35.32625 35.80394 38.79051
##  neval
##     10

Thank You!

Any Questions?

Sources:

https://cran.r-project.org/web/packages/future/future.pdf

https://en.wikipedia.org/wiki/Asynchrony_(computer_programming)

https://stackify.com/when-to-use-asynchronous-programming/

http://www.i-programmer.info/programming/theory/6040-what-is-asynchronous-programming.html