Scaling Spark with Streams and Arrow

@javierluraschi / @rstudio

05/31/2019

Intro

Outline

  • Spark
  • Streams
  • Arrow

What to do when code is slow?

Scaling Out with R and Spark

Using Spark from R

Streams

What about realtime data?

Using Spark Streams

Spark structured streams provide parallel and fault-tolerant data processing,

What can you do with streams?

Streaming with Spark, Kafka and Shiny

Apache Kafka is an open-source stream-processing software platform that provides a unified, high-throughput and low-latency for handling real-time data feeds.

–

Arrow

What is Arrow?

Apache Arrow is a cross-language development platform for in-memory data.

Source: arrow.apache.org

Memory Layout

Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs and GPUs.

Source: arrow.apache.org

Requirements

To use Arrow with Spark and R you’ll need:

  • A Spark 2.3.0+ cluster.
  • Arrow 0.13+ instealled in every node, Arrow 0.11+ usable.
  • R 3.5+, next version is likely to support R 3.1+.
  • sparklyr 1.0+.

Implementation

R transformations in Spark without and with Arrow:

Copy with Arrow

Copy 10x larger datasets and 3x faster with Arrow and Spark.

Collect with Arrow

Collect 5x larger datasets and 3x faster with Arrow and Spark.

Transform with Arrow

Transform datasets 40x faster with R, Arrow and Spark.

Thank You!

Resources