I gonna use R and Spark for a sample data exploration task to explore the R capabilites for using spark distributed data processing
Setup
There is almost no installations needed for this tutorial but here is the requried tools for running this script in your machine.
Java and i assumed it was already installed on your machine and why java and the answer is very simple ( SPARK written in Scala which is a JMV Java Virtual Machine language ) you can download java from here and this link will help you
Spark lastest version 2.2 binaries and you can download it from here and follow the setup instruction if you don’t no worries i will do this step for you ;)
R version 3
R Studio to write this notebook
ggplot2 for graphs
- SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.2.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.
- The entry point into SparkR is the SparkSession which connects your R program to a Spark cluster. You can create a SparkSession using sparkR.session and pass in options such as the application name, any spark packages depended on.
Creating SparkDataFrames
With a SparkSession, applications can create SparkDataFrames from
Local R data frame for example “df <- as.DataFrame(faithful)”.
Hive table for example “results <- sql(‘FROM src SELECT key, value’)”.
Other data sources and that is our case reading data from CSV file.
The following dataset represent the flights in 2015 for x airlines so my first step is to read the dataset i will depend on my analysis on Q/A style by asking some questions and trying to answer them by a simple graph
csvPath <- "data/flight-data/csv/2015-summary.csv"
df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA")
head(df)
What is the top 10 country that people like to travel most?
new_df <- summarize(groupBy(df, df$DEST_COUNTRY_NAME), count = sum(df$count))
head(new_df)
plot_df <- as.data.frame(new_df) %>% dplyr::arrange(desc(count))
View(plot_df)
plot_df$DEST_COUNTRY_NAME <- factor(plot_df$DEST_COUNTRY_NAME)
p <- ggplot(plot_df[1:10,], aes(DEST_COUNTRY_NAME, count))
p + geom_point(aes(colour = DEST_COUNTRY_NAME, size = count))
What is the top 10 country that people hate to travel most?
plot_df$DEST_COUNTRY_NAME <- factor(plot_df$DEST_COUNTRY_NAME)
n <- nrow(plot_df)
p <- ggplot(plot_df[(n-10):n,], aes(DEST_COUNTRY_NAME, count))
p + geom_point(aes(colour = DEST_COUNTRY_NAME, size = count))
What is the top 10 country that people like to leave most?
new_df <- summarize(groupBy(df, df$ORIGIN_COUNTRY_NAME), count = sum(df$count))
head(new_df)
plot_df <- as.data.frame(new_df) %>% dplyr::arrange(desc(count))
View(plot_df)
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
p <- ggplot(plot_df[1:10,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))
What is the top 10 country that people hate to leave most?
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
n <- nrow(plot_df)
p <- ggplot(plot_df[(n-10):n,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))
What is the top 10 country that people like to travel to U.S ?
usa_flights_rdd <- filter(df, df$DEST_COUNTRY_NAME == "United States")
new_df <- summarize(groupBy(usa_flights_rdd, df$ORIGIN_COUNTRY_NAME), count = sum(df$count))
head(new_df)
plot_df <- as.data.frame(new_df) %>% dplyr::arrange(desc(count))
View(plot_df)
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
p <- ggplot(plot_df[1:10,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))
What is the top 10 country that people hate to travel to U.S ?
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
n <- nrow(plot_df)
p <- ggplot(plot_df[(n-10):n,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))
LS0tDQp0aXRsZTogIkRhdGEgRXhwbG9yYXRpb24gVXNpbmcgUiBhbmQgU3BhcmsiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQojIyMjIEkgZ29ubmEgdXNlIFIgYW5kIFNwYXJrIGZvciBhIHNhbXBsZSBkYXRhIGV4cGxvcmF0aW9uIHRhc2sgdG8gZXhwbG9yZSB0aGUgUiBjYXBhYmlsaXRlcyBmb3IgdXNpbmcgc3BhcmsgZGlzdHJpYnV0ZWQgZGF0YSBwcm9jZXNzaW5nDQoNCioqU2V0dXAqKiANCg0KVGhlcmUgaXMgYWxtb3N0IG5vIGluc3RhbGxhdGlvbnMgbmVlZGVkIGZvciB0aGlzIHR1dG9yaWFsIGJ1dCBoZXJlIGlzIHRoZSByZXF1cmllZCB0b29scyBmb3IgcnVubmluZyB0aGlzIHNjcmlwdCBpbiB5b3VyIG1hY2hpbmUuDQoNCi0gSmF2YSBhbmQgaSBhc3N1bWVkIGl0IHdhcyBhbHJlYWR5IGluc3RhbGxlZCBvbiB5b3VyIG1hY2hpbmUgYW5kIHdoeSBqYXZhIGFuZCB0aGUgYW5zd2VyIGlzIHZlcnkgc2ltcGxlICggU1BBUksgd3JpdHRlbiBpbiBTY2FsYSB3aGljaCBpcyBhIEpNViAqKkphdmEgVmlydHVhbCBNYWNoaW5lKiogbGFuZ3VhZ2UgKSB5b3UgY2FuIGRvd25sb2FkIGphdmEgZnJvbSBbaGVyZV0oaHR0cHM6Ly9qYXZhLmNvbS9lbi9kb3dubG9hZC8pIGFuZCB0aGlzIGxpbmsgd2lsbCBoZWxwIHlvdSANCg0KLSBTcGFyayBsYXN0ZXN0IHZlcnNpb24gMi4yIGJpbmFyaWVzIGFuZCB5b3UgY2FuIGRvd25sb2FkIGl0IGZyb20gW2hlcmVdKGh0dHBzOi8vc3BhcmsuYXBhY2hlLm9yZy9kb3dubG9hZHMuaHRtbCkgYW5kIGZvbGxvdyB0aGUgc2V0dXAgaW5zdHJ1Y3Rpb24gaWYgeW91IGRvbid0IG5vIHdvcnJpZXMgaSB3aWxsIGRvIHRoaXMgc3RlcCBmb3IgeW91IDspIA0KDQotIFIgdmVyc2lvbiAzIA0KDQotIFIgU3R1ZGlvIHRvIHdyaXRlIHRoaXMgbm90ZWJvb2sgDQoNCi0gZ2dwbG90MiBmb3IgZ3JhcGhzDQpgYGB7ciwgZWNobz1UUlVFLHRpZHk9VFJVRSxyZXN1bHRzPSdoaWRlJ30NCiNjaGVja2luZyBpZiB0aGUgU3BhcmsgSG9tZSBpcyBkZWNsYXJlZCBhcyBhbiBlbnZpcm9tZW50IHZhcmlhYmxlIGJ5IGNoZWNraW5nIHRoZSBleGlzdGVuY2Ugb2YgdGhlIFNQQVJLX0hPTUUgUGF0aCB2YXJpYWJsZSBhbmQgaWYgbm90IGV4aXN0IGRvbid0IHdvcnJ5IGkgd2lsbCBkbyBpdCBmb3IgeW91DQoNCmlmIChuY2hhcihTeXMuZ2V0ZW52KCJTUEFSS19IT01FIikpIDwgMSkgew0KICAgDQogICAgICAjIGRvd25sb2FkLmZpbGUodXJsID0gImh0dHBzOi8vZDNrYmNxYTQ5bWliMTMuY2xvdWRmcm9udC5uZXQvc3BhcmstMi4yLjAtYmluLWhhZG9vcDIuNy50Z3oiLA0KICAgICAgIyAgICAgICAgIGRlc3RmaWxlID0gInNwYXJrLTIuMi4wLWJpbi1oYWRvb3AyLjcudGd6IikNCiAgICAgICN1bnRhcigic3BhcmstMi4yLjAtYmluLWhhZG9vcDIuNy50Z3oiKQ0KICAgICAgU3lzLnNldGVudihTUEFSS19IT01FID0gcGFzdGUwKGdldHdkKCksICIvc3BhcmstMi4yLjAtYmluLWhhZG9vcDIuNyIpKQ0KfQ0KYGBgDQo8YnIvPg0KDQoqKi0gU3BhcmtSIGlzIGFuIFIgcGFja2FnZSB0aGF0IHByb3ZpZGVzIGEgbGlnaHQtd2VpZ2h0IGZyb250ZW5kIHRvIHVzZSBBcGFjaGUgU3BhcmsgZnJvbSBSLiBJbiBTcGFyayAyLjIuMCwgU3BhcmtSIHByb3ZpZGVzIGEgZGlzdHJpYnV0ZWQgZGF0YSBmcmFtZSBpbXBsZW1lbnRhdGlvbiB0aGF0IHN1cHBvcnRzIG9wZXJhdGlvbnMgbGlrZSBzZWxlY3Rpb24sIGZpbHRlcmluZywgYWdncmVnYXRpb24gZXRjLiAoc2ltaWxhciB0byBSIGRhdGEgZnJhbWVzLCBkcGx5cikgYnV0IG9uIGxhcmdlIGRhdGFzZXRzLiBTcGFya1IgYWxzbyBzdXBwb3J0cyBkaXN0cmlidXRlZCBtYWNoaW5lIGxlYXJuaW5nIHVzaW5nIE1MbGliLioqDQoNCioqLSBUaGUgZW50cnkgcG9pbnQgaW50byBTcGFya1IgaXMgdGhlIFNwYXJrU2Vzc2lvbiB3aGljaCBjb25uZWN0cyB5b3VyIFIgcHJvZ3JhbSB0byBhIFNwYXJrIGNsdXN0ZXIuIFlvdSBjYW4gY3JlYXRlIGEgU3BhcmtTZXNzaW9uIHVzaW5nIHNwYXJrUi5zZXNzaW9uIGFuZCBwYXNzIGluIG9wdGlvbnMgc3VjaCBhcyB0aGUgYXBwbGljYXRpb24gbmFtZSwgYW55IHNwYXJrIHBhY2thZ2VzIGRlcGVuZGVkIG9uLioqDQoNCmBgYHtyLCBlY2hvPVRSVUUsdGlkeT1UUlVFLHJlc3VsdHM9J2hpZGUnfQ0KDQpsaWJyYXJ5KGdncGxvdDIpDQpsaWJyYXJ5KGRwbHlyKQ0KDQojIyBpbXBvcnRpbmcgdGhlIFNwYXJrUiBsaWJyYXJ5IA0KbGlicmFyeShTcGFya1IsIGxpYi5sb2MgPSBjKG5vcm1hbGl6ZVBhdGgoZmlsZS5wYXRoKFN5cy5nZXRlbnYoIlNQQVJLX0hPTUUiKSwgIlIiLCAibGliIikpKSkNCg0KIyMgY3JlYXRlIA0Kc3BhcmtSLnNlc3Npb24obWFzdGVyID0gImxvY2FsWypdIiwgc3BhcmtDb25maWcgPSBsaXN0KHNwYXJrLmRyaXZlci5tZW1vcnkgPSAiMmciKSkNCmBgYA0KDQojIyNDcmVhdGluZyBTcGFya0RhdGFGcmFtZXMNCg0KV2l0aCBhIFNwYXJrU2Vzc2lvbiwgYXBwbGljYXRpb25zIGNhbiBjcmVhdGUgU3BhcmtEYXRhRnJhbWVzIGZyb20gDQoNCi0gTG9jYWwgUiBkYXRhIGZyYW1lIGZvciBleGFtcGxlICoqImRmIDwtIGFzLkRhdGFGcmFtZShmYWl0aGZ1bCkiKiouDQoNCi0gSGl2ZSB0YWJsZSAgZm9yIGV4YW1wbGUgKioicmVzdWx0cyA8LSBzcWwoJ0ZST00gc3JjIFNFTEVDVCBrZXksIHZhbHVlJykiKiouDQoNCi0gT3RoZXIgZGF0YSBzb3VyY2VzIGFuZCB0aGF0IGlzIG91ciBjYXNlIHJlYWRpbmcgZGF0YSBmcm9tIENTViBmaWxlLg0KDQpUaGUgZm9sbG93aW5nIGRhdGFzZXQgcmVwcmVzZW50IHRoZSBmbGlnaHRzIGluIDIwMTUgZm9yIHggYWlybGluZXMgc28gbXkgZmlyc3Qgc3RlcCBpcyB0byByZWFkIHRoZSBkYXRhc2V0DQppIHdpbGwgZGVwZW5kIG9uIG15IGFuYWx5c2lzIG9uIFEvQSBzdHlsZSBieSBhc2tpbmcgc29tZSBxdWVzdGlvbnMgYW5kIHRyeWluZyB0byBhbnN3ZXIgdGhlbSBieSBhIHNpbXBsZSBncmFwaA0KYGBge3J9DQpjc3ZQYXRoIDwtICJkYXRhL2ZsaWdodC1kYXRhL2Nzdi8yMDE1LXN1bW1hcnkuY3N2Ig0KZGYgPC0gcmVhZC5kZihjc3ZQYXRoLCAiY3N2IiwgaGVhZGVyID0gInRydWUiLCBpbmZlclNjaGVtYSA9ICJ0cnVlIiwgbmEuc3RyaW5ncyA9ICJOQSIpDQpoZWFkKGRmKQ0KYGBgDQoNCioqV2hhdCBpcyB0aGUgdG9wIDEwIGNvdW50cnkgdGhhdCBwZW9wbGUgbGlrZSB0byB0cmF2ZWwgbW9zdD8qKg0KYGBge3IsIGVjaG89VFJVRX0NCm5ld19kZiA8LSBzdW1tYXJpemUoZ3JvdXBCeShkZiwgZGYkREVTVF9DT1VOVFJZX05BTUUpLCBjb3VudCA9IHN1bShkZiRjb3VudCkpDQpoZWFkKG5ld19kZikNCg0KcGxvdF9kZiA8LSBhcy5kYXRhLmZyYW1lKG5ld19kZikgJT4lIGRwbHlyOjphcnJhbmdlKGRlc2MoY291bnQpKQ0KVmlldyhwbG90X2RmKQ0KcGxvdF9kZiRERVNUX0NPVU5UUllfTkFNRSA8LSBmYWN0b3IocGxvdF9kZiRERVNUX0NPVU5UUllfTkFNRSkNCg0KcCA8LSBnZ3Bsb3QocGxvdF9kZlsxOjEwLF0sIGFlcyhERVNUX0NPVU5UUllfTkFNRSwgY291bnQpKQ0KcCArIGdlb21fcG9pbnQoYWVzKGNvbG91ciA9IERFU1RfQ09VTlRSWV9OQU1FLCBzaXplID0gY291bnQpKQ0KDQpgYGANCg0KKipXaGF0IGlzIHRoZSB0b3AgMTAgY291bnRyeSB0aGF0IHBlb3BsZSBoYXRlIHRvIHRyYXZlbCBtb3N0PyoqDQpgYGB7ciwgZWNobz1UUlVFfQ0KcGxvdF9kZiRERVNUX0NPVU5UUllfTkFNRSA8LSBmYWN0b3IocGxvdF9kZiRERVNUX0NPVU5UUllfTkFNRSkNCm4gPC0gbnJvdyhwbG90X2RmKQ0KcCA8LSBnZ3Bsb3QocGxvdF9kZlsobi0xMCk6bixdLCBhZXMoREVTVF9DT1VOVFJZX05BTUUsIGNvdW50KSkNCnAgKyBnZW9tX3BvaW50KGFlcyhjb2xvdXIgPSBERVNUX0NPVU5UUllfTkFNRSwgc2l6ZSA9IGNvdW50KSkNCmBgYA0KDQoNCioqV2hhdCBpcyB0aGUgdG9wIDEwIGNvdW50cnkgdGhhdCBwZW9wbGUgbGlrZSB0byBsZWF2ZSBtb3N0PyoqDQpgYGB7ciwgZWNobz1UUlVFfQ0KbmV3X2RmIDwtIHN1bW1hcml6ZShncm91cEJ5KGRmLCBkZiRPUklHSU5fQ09VTlRSWV9OQU1FKSwgY291bnQgPSBzdW0oZGYkY291bnQpKQ0KaGVhZChuZXdfZGYpDQoNCnBsb3RfZGYgPC0gYXMuZGF0YS5mcmFtZShuZXdfZGYpICU+JSBkcGx5cjo6YXJyYW5nZShkZXNjKGNvdW50KSkNClZpZXcocGxvdF9kZikNCnBsb3RfZGYkT1JJR0lOX0NPVU5UUllfTkFNRSA8LSBmYWN0b3IocGxvdF9kZiRPUklHSU5fQ09VTlRSWV9OQU1FKQ0KDQpwIDwtIGdncGxvdChwbG90X2RmWzE6MTAsXSwgYWVzKE9SSUdJTl9DT1VOVFJZX05BTUUsIGNvdW50KSkNCnAgKyBnZW9tX3BvaW50KGFlcyhjb2xvdXIgPSBPUklHSU5fQ09VTlRSWV9OQU1FLCBzaXplID0gY291bnQpKQ0KDQpgYGANCg0KKipXaGF0IGlzIHRoZSB0b3AgMTAgY291bnRyeSB0aGF0IHBlb3BsZSBoYXRlIHRvIGxlYXZlIG1vc3Q/KioNCmBgYHtyLCBlY2hvPVRSVUV9DQpwbG90X2RmJE9SSUdJTl9DT1VOVFJZX05BTUUgPC0gZmFjdG9yKHBsb3RfZGYkT1JJR0lOX0NPVU5UUllfTkFNRSkNCm4gPC0gbnJvdyhwbG90X2RmKQ0KcCA8LSBnZ3Bsb3QocGxvdF9kZlsobi0xMCk6bixdLCBhZXMoT1JJR0lOX0NPVU5UUllfTkFNRSwgY291bnQpKQ0KcCArIGdlb21fcG9pbnQoYWVzKGNvbG91ciA9IE9SSUdJTl9DT1VOVFJZX05BTUUsIHNpemUgPSBjb3VudCkpDQpgYGANCg0KDQoqKldoYXQgaXMgdGhlIHRvcCAxMCBjb3VudHJ5IHRoYXQgcGVvcGxlIGxpa2UgdG8gdHJhdmVsIHRvIFUuUyA/KioNCmBgYHtyLCBlY2hvPVRSVUV9DQoNCnVzYV9mbGlnaHRzX3JkZCA8LSBmaWx0ZXIoZGYsIGRmJERFU1RfQ09VTlRSWV9OQU1FID09ICJVbml0ZWQgU3RhdGVzIikNCg0KbmV3X2RmIDwtIHN1bW1hcml6ZShncm91cEJ5KHVzYV9mbGlnaHRzX3JkZCwgZGYkT1JJR0lOX0NPVU5UUllfTkFNRSksIGNvdW50ID0gc3VtKGRmJGNvdW50KSkNCmhlYWQobmV3X2RmKQ0KDQpwbG90X2RmIDwtIGFzLmRhdGEuZnJhbWUobmV3X2RmKSAlPiUgZHBseXI6OmFycmFuZ2UoZGVzYyhjb3VudCkpDQpWaWV3KHBsb3RfZGYpDQpwbG90X2RmJE9SSUdJTl9DT1VOVFJZX05BTUUgPC0gZmFjdG9yKHBsb3RfZGYkT1JJR0lOX0NPVU5UUllfTkFNRSkNCg0KcCA8LSBnZ3Bsb3QocGxvdF9kZlsxOjEwLF0sIGFlcyhPUklHSU5fQ09VTlRSWV9OQU1FLCBjb3VudCkpDQpwICsgZ2VvbV9wb2ludChhZXMoY29sb3VyID0gT1JJR0lOX0NPVU5UUllfTkFNRSwgc2l6ZSA9IGNvdW50KSkNCg0KYGBgDQoNCioqV2hhdCBpcyB0aGUgdG9wIDEwIGNvdW50cnkgdGhhdCBwZW9wbGUgaGF0ZSB0byB0cmF2ZWwgdG8gVS5TID8qKg0KYGBge3IsIGVjaG89VFJVRX0NCnBsb3RfZGYkT1JJR0lOX0NPVU5UUllfTkFNRSA8LSBmYWN0b3IocGxvdF9kZiRPUklHSU5fQ09VTlRSWV9OQU1FKQ0KbiA8LSBucm93KHBsb3RfZGYpDQpwIDwtIGdncGxvdChwbG90X2RmWyhuLTEwKTpuLF0sIGFlcyhPUklHSU5fQ09VTlRSWV9OQU1FLCBjb3VudCkpDQpwICsgZ2VvbV9wb2ludChhZXMoY29sb3VyID0gT1JJR0lOX0NPVU5UUllfTkFNRSwgc2l6ZSA9IGNvdW50KSkNCmBgYA0KDQoNCg0KIyMjIFJlZmVyZW5jZXMgDQoNCi0gW1NwYXJrUiBUdXRvcmlhbHNdKGh0dHBzOi8vc3BhcmsuYXBhY2hlLm9yZy9kb2NzL2xhdGVzdC9zcGFya3IuaHRtbCkNCg0K