Cluster Computing Made Easy with Spark and R

@javierluraschi / @rstudio

06/08/2019

Overview

Technology Periods

Stone Age: 3.4M BC - 5000 BC

Ancient History: 5000 BC - 500

Machine Age: 1880 – 1945

Space Age: 1957 - Present

Selfie Age: 1970 - Present

Information Age: 1970 - Present

Information Age: Present

World’s Capacity to Store Information

Hadoop and Map Reduce

Intro Spark

Spark Record Sorting

	Hadoop Record	Spark Record
Data Size	102.5 TB	100 TB
Elapsed Time	72 mins	23 mins
Nodes	2100	206
Cores	50400	6592
Disk	3150 GB/s	618 GB/s
Network	10Gbps	10Gbps
Sort rate	1.42 TB/min	4.27 TB/min
Sort rate / node	0.67 GB/min	20.7 GB/min

What can I do with cluster computing?

AlphaGo

OpenAI Dota

Distributed Training

Intro

What to do when code is slow?

mtcars %>% lm(mpg ~ wt + cyl, .)

# Sample
mtcars %>% dplyr::sample_n(10) %>% lm(mpg ~ wt + cyl, .)

# Profile
profvis::profvis(mtcars %>% lm(mpg ~ wt + cyl, .))

# Scale Up
cloudml::cloudml_train("train.R")

# Scale Out
mtcars_tbl %>% sparklyr::ml_linear_regression(mpg ~ wt + cyl)

Scaling Out with R and Spark

# Scale Out
mtcars_tbl %>% sparklyr::ml_linear_regression(mpg ~ wt + cyl)

Using Spark from R

install.packages("sparklyr")                         # R interface to Spark
library(sparklyr)

spark_install()                                      # Install Apache Spark
sc <- spark_connect(master = "local")                # Connect to Spark cluster

cars <- spark_read_csv(sc, "cars", "mtcars/")        # Read data in Spark

dplyr::summarize(cars, n = n())                      # Count records with dplyr
DBI::dbGetQuery(sc, "SELECT count(*) FROM cars")     # Count records with DBI

ml_linear_regression(cars, mpg ~ wt + cyl)           # Perform linear regression

spark_context(sc) %>% invoke("version")              # Extend sparklyr with Scala

Streams

What about realtime data?

Using Spark Streams

Spark structured streams provide parallel and fault-tolerant data processing,

stream_read_text(sc, "s3a://your-s3-bucket/") %>%    # Define input stream
  spark_apply(~webreadr::read_s3(.x$line),) %>%      # Transform with R
  group_by(uri) %>%                                  # Group using dplyr
  summarize(n = n()) %>%                             # Count using dplyr
  arrange(desc(n)) %>%                               # Arrange using dplyr
  stream_write_memory("urls", mode = "complete")     # Define output stream

What can you do with streams?

cars_str <- stream_read_csv(sc, "mtcars/", "cars")     # Read stream in Spark

out_str <- summarize(cars_str, n = n())                # Count records with dplyr
out_str <- dbGetQuery(sc, "SELECT count(*) FROM cars") # Count records with DBI

out_str <- ml_transform(fitted, cars_str)              # Transform stream with model

out_str <- spark_apply(cars_str, nrow)                 # Extend streams with R

stream_write_csv(out_str, "output/")                   # Write as a CSV stream
reactiveSpark(out_str)                                 # Use as a Shiny reactive

Streaming with Spark, Kafka and Shiny

Apache Kafka is an open-source stream-processing software platform that provides a unified, high-throughput and low-latency for handling real-time data feeds.

–

R Markdown

Donald Knuth

The Art of Computer Programming

Literate Programming

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: “Literate Programming.”

R Markdown and R Notebooks

Thank You!

Resources

spark.rstudio.com:
Main documentation site with examples and reference functions.
community.rstudio.com:
sparklyr questions? Use the the RStudio Community.
github.com/rstudio/sparklyr:
Something needs fixing? Open a GitHub issue.
stackoverflow.com/tags/sparklyr:
General questions? Stack Overflow is a good place to start.
gitter.im/rstudio/sparklyr:
Anything urgent? Chat with us in Gitter!
rpubs.com/jluraschi:
Want to review these slides?
github.com/javierluraschi/talks:
Want to run this R Notebook yourself?