Running R at Scale

with Apache Arrow on Spark

Javier Luraschi

Spark Summit 2019

Overview

  • Intro to R
  • R with Spark
  • Intro to Arrow
  • Arrow with R
  • Arrow on Spark

Intro to R

R Language

R is a programming language for statistical computing that is: vectorized, columnar and flexible.

R Packages

CRAN is R’s package manager, like NPM or Maven. Thousands of packages available and usage growing every year.

An R package

One of many packages, rayrender: A ray tracer written in R using Rcpp.

R with Spark

sparklyr 0.4 - Initial Release

Support to install, connect, analyze, model and extend Spark.

sparklyr 0.5 - Connections

Support for Apache Livy,

Databricks connections,

dplyr improvements, and certified with Cloudera.

sparklyr 0.6 - Distributed R

Distribute R computations to execute arbitrary R code over each partition using your favorite R packages:

sparklyr 0.7 - Pipelines and Machine Learning

Provide a uniform set of high-level APIs to help create, tune, and deploy machine learning pipelines at scale,

and support for all MLlib algorithms.

sparklyr 0.8 - MLeap and Graphs

MLeap allows you to use your Spark pipelines in any Java enabled device or service,

graphframes provides an interface to the GraphFrames Spark package.

sparklyr 0.9 - Streams

Spark structured streams provide parallel and fault-tolerant data processing,

enables support for Kubernetes and to properly interrupt long-running operations.

sparklyr 1.0 - Arrow

  • Arrow enables faster and larger data transfers between Spark and R.
  • XGBoost enables training gradient boosting models over distributed datasets.
  • TFRecords writes TensorFlow records from Spark to support deep learning workflows.

Intro to Arrow

What is Arrow?

Apache Arrow is a cross-language development platform for in-memory data.

Source: arrow.apache.org

Memory Layout

Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs and GPUs.

Source: arrow.apache.org

Arrow with R

Feather package

A lightweight binary columnar data store designed for maximum speed, based on Arrow’s memory layout.

Arrow package

Currently, install from GitHub:

The R arrow package supports feather, parquet, streams, and more.

[1] 44 02 00 00 10 00 00 00 00 00 0a 00

Arrow on Spark

Requirements

To use Arrow with Spark and R you’ll need:

  • A Spark 2.3.0+ cluster.
  • Arrow 0.13+ instealled in every node, Arrow 0.11+ usable.
  • R 3.5+, next version is likely to support R 3.1+.
  • sparklyr 1.0+.

Implementation

R transformations in Spark without and with Arrow:

Copy with Arrow

Copy 10x larger datasets and 3x faster with Arrow and Spark.

Collect with Arrow

Collect 5x larger datasets and 3x faster with Arrow and Spark.

Transform with Arrow

Transform datasets 40x faster with R, Arrow and Spark.

Thank you!

Resources

  • Docs: spark.rstudio.com
  • GitHub: github.com/rstudio/sparklyr
  • Blog: blog.rstudio.com/tags/sparklyr
  • R Help: community.rstudio.com
  • Spark Help: stackoverflow.com/tags/sparklyr
  • Issues: github.com/rstudio/sparklyr/issues
  • Chat: gitter.im/rstudio.sparklyr
  • Twitter: twitter.com/hashtag/sparkly