Cluster Computing Made Easy with Spark and R

@javierluraschi / @rstudio

06/08/2019

Overview

Technology Periods

Stone Age: 3.4M BC - 5000 BC

Ancient History: 5000 BC - 500

Machine Age: 1880 – 1945

Space Age: 1957 - Present

Selfie Age: 1970 - Present

Information Age: 1970 - Present

Information Age: Present

World’s Capacity to Store Information

Hadoop and Map Reduce

Intro Spark

Spark Record Sorting

Hadoop Record Spark Record
Data Size 102.5 TB 100 TB
Elapsed Time 72 mins 23 mins
Nodes 2100 206
Cores 50400 6592
Disk 3150 GB/s 618 GB/s
Network 10Gbps 10Gbps
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate / node 0.67 GB/min 20.7 GB/min

What can I do with cluster computing?

AlphaGo

OpenAI Dota

Distributed Training

Intro

What to do when code is slow?

Scaling Out with R and Spark

Using Spark from R

Streams

What about realtime data?

Using Spark Streams

Spark structured streams provide parallel and fault-tolerant data processing,

What can you do with streams?

Streaming with Spark, Kafka and Shiny

Apache Kafka is an open-source stream-processing software platform that provides a unified, high-throughput and low-latency for handling real-time data feeds.

R Markdown

Donald Knuth

The Art of Computer Programming

Literate Programming

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: “Literate Programming.”

R Markdown and R Notebooks

Thank You!

Resources