Apache Spark

Real-time analytics at scale

Tom Swann

Spark is a general purpose cluster computing engine, that is fast and easy to use.

Created at UC Berkeley AMPLab
100% Open Source
Major commercial supporter is Databricks

General

Spark can handle batch and low-latency workloads.

Spark Stack

Spark powers a unified stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

General

Fast

Spark caches data in-memory across the cluster.

Logistic Regression

Programs can run up to 100X faster than MapReduce in-memory
Crucuially, they are also faster on disk

Iterative processes such as many machine learning algorithms benefit from caching data in-memory.

MapReduce typically performs poorly in these scenarios.

Easy to Use

Spark offers a range of APIs for:

Scala
Python
Java
SQL

Spark supports functional programming. It provides over 80 high-level operators that make it easy to build highly parallel applications.

You can use Spark interactively from the Scala and Python shells.

Word Count in Spark

Scala

val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

Python

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

Parallel Execution

Spark Workers

Use SparkContext to create RDDs
Manages a number of nodes called executors
RDD functions are automatically parallelised

RDD (Resilient Distributed Dataset)

RDDs are the fundamental unit of work in Spark

DataSet: Can be created from file, programmatically or from another RDD
Distributed: cached in memory across cluster nodes
Resilient: if data in memory is lost, it can be recreated

Spark depends heavily on concepts of functional programming

Functions have input and output, no state or side effects
Spark programming consists primarily of performing functional operations on RDDs

RDD Operations

Processing on an RDD forms a DAG [Directed Acyclic Graph] consisting of two types of operation.

Transformations

RDD Transformations

Defines a new RDD (RDDs are immutable)
May be chained together
Execute lazily

Actions

RDD Actions

Return values
Causes transformations to be evaluated

DEMO

Spark Streaming

Sparh Streaming is an extension of the core Spark API for real-time processing of live data streams.

Streaming Architecture

Second-scale latencies
"Once and only once" processing
Same API as "Spark batch"

Spark Streaming

Spark divides a stream up into micro-batches of n seconds.

Spark Streaming

Streaming jobs are typically long-running processes.

val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()

ssc.start()
ssc.awaitTermination()

Test with nc -lk 9999 + frantic typing!

DEMO PART THE SECOND

Want to know more?

References

Spark Project Home: https://spark.apache.org
Berkeley whitepaper: http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
DataBricks: https://databricks.com

Twitter:

@databricks: Main Spark supporters
@matei_zaharia: Databricks CTO, Spark creator
@TedMalaska: Cloudera, Spark Streaming constributor
@sean_r_owen: Cloudera, Oryx project

Apache Spark

Real-time analytics at scale

General

General

Fast

Easy to Use

Spark offers a range of APIs for:

Word Count in Spark

Scala

Python

Parallel Execution

RDD (Resilient Distributed Dataset)

RDD Operations

Transformations

Actions

DEMO

Spark Streaming

Spark Streaming

Spark Streaming

DEMO PART THE SECOND

Want to know more?

References

Twitter:

Thanks!