Tom Swann
Spark is a
general purpose cluster computing engine, that is fast and easy to use.
DatabricksSpark can handle batch and low-latency workloads.
Spark powers a unified stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Spark caches data in-memory across the cluster.
100X faster than MapReduce in-memoryon diskIterative processes such as many machine learning algorithms benefit from caching data in-memory.
MapReduce typically performs poorly in these scenarios.
Spark supports functional programming. It provides over 80 high-level operators that make it easy to build highly parallel applications.
You can use Spark interactively from the Scala and Python shells.
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
SparkContext to create RDDsexecutorsparallelisedRDDs are the fundamental unit of work in Spark
DataSet: Can be created from file, programmatically or from another RDDDistributed: cached in memory across cluster nodesResilient: if data in memory is lost, it can be recreatedSpark depends heavily on concepts of functional programming
Processing on an RDD forms a DAG [Directed Acyclic Graph] consisting of two types of operation.
new RDD (RDDs are immutable)lazilyvaluesevaluatedSparh Streaming is an extension of the core Spark API for real-time processing of live data streams.
Second-scale latenciesSpark divides a stream up into micro-batches of n seconds.
Streaming jobs are typically long-running processes.
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
Test with nc -lk 9999 + frantic typing!
https://spark.apache.orghttp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdfhttps://databricks.com@databricks: Main Spark supporters@matei_zaharia: Databricks CTO, Spark creator@TedMalaska: Cloudera, Spark Streaming constributor@sean_r_owen: Cloudera, Oryx project