Scaling Spark with Streams and Arrow

@javierluraschi / @rstudio

05/31/2019

Intro

Outline

Spark
Streams
Arrow

What to do when code is slow?

mtcars %>% lm(mpg ~ wt + cyl, .)

# Sample
mtcars %>% dplyr::sample_n(10) %>% lm(mpg ~ wt + cyl, .)

# Profile
profvis::profvis(mtcars %>% lm(mpg ~ wt + cyl, .))

# Scale Up
cloudml::cloudml_train("train.R")

# Scale Out
mtcars_tbl %>% sparklyr::ml_linear_regression(mpg ~ wt + cyl)

Scaling Out with R and Spark

# Scale Out
mtcars_tbl %>% sparklyr::ml_linear_regression(mpg ~ wt + cyl)

Using Spark from R

install.packages("sparklyr")                         # R interface to Spark
library(sparklyr)

spark_install()                                      # Install Apache Spark
sc <- spark_connect(master = "local")                # Connect to Spark cluster

cars <- spark_read_csv(sc, "cars", "mtcars/")        # Read data in Spark

dplyr::summarize(cars, n = n())                      # Count records with dplyr
DBI::dbGetQuery(sc, "SELECT count(*) FROM cars")     # Count records with DBI

ml_linear_regression(cars, mpg ~ wt + cyl)           # Perform linear regression

spark_context(sc) %>% invoke("version")              # Extend sparklyr with Scala

Streams

What about realtime data?

Using Spark Streams

Spark structured streams provide parallel and fault-tolerant data processing,

stream_read_text(sc, "s3a://your-s3-bucket/") %>%    # Define input stream
  spark_apply(~webreadr::read_s3(.x$line),) %>%      # Transform with R
  group_by(uri) %>%                                  # Group using dplyr
  summarize(n = n()) %>%                             # Count using dplyr
  arrange(desc(n)) %>%                               # Arrange using dplyr
  stream_write_memory("urls", mode = "complete")     # Define output stream

What can you do with streams?

cars_str <- stream_read_csv(sc, "mtcars/", "cars")     # Read stream in Spark

out_str <- summarize(cars_str, n = n())                # Count records with dplyr
out_str <- dbGetQuery(sc, "SELECT count(*) FROM cars") # Count records with DBI

out_str <- ml_transform(fitted, cars_str)              # Transform stream with model

out_str <- spark_apply(cars_str, nrow)                 # Extend streams with R

stream_write_csv(out_str, "output/")                   # Write as a CSV stream
reactiveSpark(out_str)                                 # Use as a Shiny reactive

Streaming with Spark, Kafka and Shiny

Apache Kafka is an open-source stream-processing software platform that provides a unified, high-throughput and low-latency for handling real-time data feeds.

–

Arrow

What is Arrow?

Apache Arrow is a cross-language development platform for in-memory data.

: Source: arrow.apache.org

Memory Layout

Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs and GPUs.

: Source: arrow.apache.org

Requirements

To use Arrow with Spark and R you’ll need:

A Spark 2.3.0+ cluster.
Arrow 0.13+ instealled in every node, Arrow 0.11+ usable.
R 3.5+, next version is likely to support R 3.1+.
sparklyr 1.0+.

Implementation

R transformations in Spark without and with Arrow:

Copy with Arrow

Copy 10x larger datasets and 3x faster with Arrow and Spark.

library(arrow)
copy_to(sc, data.frame(y = 1:10^6))

Collect with Arrow

Collect 5x larger datasets and 3x faster with Arrow and Spark.

library(arrow)
sdf_len(sc, 10^7) %>% collect()

Transform with Arrow

Transform datasets 40x faster with R, Arrow and Spark.

library(arrow)
sdf_len(sc, 10^5) %>% spark_apply(~.x/2) %>% count()

Thank You!

Resources

spark.rstudio.com:
Main documentation site with examples and reference functions.
community.rstudio.com:
sparklyr questions? Use the the RStudio Community.
github.com/rstudio/sparklyr:
Something needs fixing? Open a GitHub issue.
stackoverflow.com/tags/sparklyr:
General questions? Stack Overflow is a good place to start.
gitter.im/rstudio/sparklyr:
Anything urgent? Chat with us in Gitter!
rpubs.com/jluraschi:
Want to review these slides?
github.com/javierluraschi/talks:
Want to run this R Notebook yourself?