Apache Spark

Real-time analytics at scale

Tom Swann

Spark is a general purpose cluster computing engine, that is fast and easy to use.

     

  • Created at UC Berkeley AMPLab
  • 100% Open Source
  • Major commercial supporter is Databricks

General

Spark can handle batch and low-latency workloads.

Spark Stack

Spark powers a unified stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

General

Fast

Spark caches data in-memory across the cluster.

Logistic Regression

  • Programs can run up to 100X faster than MapReduce in-memory
  • Crucuially, they are also faster on disk

Iterative processes such as many machine learning algorithms benefit from caching data in-memory.

MapReduce typically performs poorly in these scenarios.

Easy to Use

Spark offers a range of APIs for:

  • Scala
  • Python
  • Java
  • SQL

Spark supports functional programming. It provides over 80 high-level operators that make it easy to build highly parallel applications.

You can use Spark interactively from the Scala and Python shells.

Word Count in Spark

Scala

val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

Python

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

Parallel Execution

Spark Workers

  • Use SparkContext to create RDDs
  • Manages a number of nodes called executors
  • RDD functions are automatically parallelised

RDD (Resilient Distributed Dataset)

RDDs are the fundamental unit of work in Spark

  • DataSet: Can be created from file, programmatically or from another RDD
  • Distributed: cached in memory across cluster nodes
  • Resilient: if data in memory is lost, it can be recreated

Spark depends heavily on concepts of functional programming

  • Functions have input and output, no state or side effects
  • Spark programming consists primarily of performing functional operations on RDDs

RDD Operations

Processing on an RDD forms a DAG [Directed Acyclic Graph] consisting of two types of operation.

Transformations

RDD Transformations

  • Defines a new RDD (RDDs are immutable)
  • May be chained together
  • Execute lazily

Actions

RDD Actions

  • Return values
  • Causes transformations to be evaluated

DEMO

Spark Streaming

Sparh Streaming is an extension of the core Spark API for real-time processing of live data streams.

Streaming Architecture

  • Second-scale latencies
  • "Once and only once" processing
  • Same API as "Spark batch"

Spark Streaming

Spark divides a stream up into micro-batches of n seconds.

Spark Streaming

Spark Streaming

Streaming jobs are typically long-running processes.

val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()

ssc.start()
ssc.awaitTermination()

Test with nc -lk 9999 + frantic typing!

DEMO PART THE SECOND

Want to know more?

References

  • Spark Project Home: https://spark.apache.org
  • Berkeley whitepaper: http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
  • DataBricks: https://databricks.com

Twitter:

  • @databricks: Main Spark supporters
  • @matei_zaharia: Databricks CTO, Spark creator
  • @TedMalaska: Cloudera, Spark Streaming constributor
  • @sean_r_owen: Cloudera, Oryx project

Thanks!