Data Science In Context Presentation

Albert Gilharry

11 April 2018

  • Intro

  • Synchronous Programming

  • Asynchronous Programming

  • Comparison using R’s Future Package

Intro

The CCCCC is madated to guide CARICOM’s response to a changing climate.

  • Hosted in Belize

  • CARICOM’s official repository for climate change related data and information.

  • A subset of over 400 billion data points and growing!

Belize

Domain

Domain

Domain

Synchronous Programming

  • The “traditional” way of programming, in which code is executed “line by line” within the main thread.

  • Blocking is implemented to ensure consistency.

Synchronous

Asynchronous Programming

  • Tasks are executed independently from the main thread in a non-blocking manner!

  • Can become complex and difficult to trace

  • Recursive functions allow looping in asynchronous environments

Asynchronous

Use asynchronous programming when

  • aiming for efficiency over simplicity

  • performing possible long running operations.

  • tasks are independent of each other and inter-communication is not necessary

  • user experience (not waiting) is of high importance

  • doing AJAX requests

R’s Future Package For Parallel & Distributed Processing

we will load 6 relatively large files using async and sync methods to compare performances.

## [1] "There are 10669 rows and 369 columns in this data set."

Synchronous Version

load_data_sync <- function(){
  
df1 <- read.csv("data/2070.csv", skip = 6) 
df2 <- read.csv("data/2071.csv", skip = 6)
df3 <- read.csv("data/2072.csv", skip = 6) 
df4 <- read.csv("data/2073.csv", skip = 6) 
df5 <- read.csv("data/2074.csv", skip = 6)
df6 <- read.csv("data/2075.csv", skip = 6)
 
df <- rbind(df1, df2, df3, df4, df5)
}

Asynchronous Version

load_data_async <- function(){
  
df1 <- future({ read.csv("data/2070.csv", skip = 6) }) %plan% multiprocess
df2 <- future({ read.csv("data/2071.csv", skip = 6) }) %plan% multiprocess
df3 <- future({ read.csv("data/2072.csv", skip = 6) }) %plan% multiprocess
df4 <- future({ read.csv("data/2073.csv", skip = 6) }) %plan% multiprocess
df5 <- future({ read.csv("data/2074.csv", skip = 6) }) %plan% multiprocess
df6 <- future({ read.csv("data/2075.csv", skip = 6) }) %plan% multiprocess
 
df <- rbind(value(df1), value(df2), value(df3), value(df4), value(df5))
}

Results

microbenchmark(load_data_async(), list = NULL, times = 10)
## Unit: seconds
##               expr      min       lq     mean   median     uq      max
##  load_data_async() 9.895663 10.22015 11.01696 10.45476 10.746 16.52279
##  neval
##     10
microbenchmark(load_data_sync(), list = NULL, times = 10)
## Unit: seconds
##              expr      min      lq     mean   median       uq      max
##  load_data_sync() 34.70808 35.0382 35.87707 35.32625 35.80394 38.79051
##  neval
##     10

Thank You!

Any Questions?

Sources:

https://cran.r-project.org/web/packages/future/future.pdf

https://en.wikipedia.org/wiki/Asynchrony_(computer_programming)

https://stackify.com/when-to-use-asynchronous-programming/

http://www.i-programmer.info/programming/theory/6040-what-is-asynchronous-programming.html