I gonna use R and Spark for a sample data exploration task to explore the R capabilites for using spark distributed data processing

Setup

There is almost no installations needed for this tutorial but here is the requried tools for running this script in your machine.


- SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.2.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.

- The entry point into SparkR is the SparkSession which connects your R program to a Spark cluster. You can create a SparkSession using sparkR.session and pass in options such as the application name, any spark packages depended on.

Creating SparkDataFrames

With a SparkSession, applications can create SparkDataFrames from

The following dataset represent the flights in 2015 for x airlines so my first step is to read the dataset i will depend on my analysis on Q/A style by asking some questions and trying to answer them by a simple graph

csvPath <- "data/flight-data/csv/2015-summary.csv"
df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA")
head(df)

What is the top 10 country that people like to travel most?

new_df <- summarize(groupBy(df, df$DEST_COUNTRY_NAME), count = sum(df$count))
head(new_df)

plot_df <- as.data.frame(new_df) %>% dplyr::arrange(desc(count))
View(plot_df)
plot_df$DEST_COUNTRY_NAME <- factor(plot_df$DEST_COUNTRY_NAME)

p <- ggplot(plot_df[1:10,], aes(DEST_COUNTRY_NAME, count))
p + geom_point(aes(colour = DEST_COUNTRY_NAME, size = count))

What is the top 10 country that people hate to travel most?

plot_df$DEST_COUNTRY_NAME <- factor(plot_df$DEST_COUNTRY_NAME)
n <- nrow(plot_df)
p <- ggplot(plot_df[(n-10):n,], aes(DEST_COUNTRY_NAME, count))
p + geom_point(aes(colour = DEST_COUNTRY_NAME, size = count))

What is the top 10 country that people like to leave most?

new_df <- summarize(groupBy(df, df$ORIGIN_COUNTRY_NAME), count = sum(df$count))
head(new_df)

plot_df <- as.data.frame(new_df) %>% dplyr::arrange(desc(count))
View(plot_df)
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)

p <- ggplot(plot_df[1:10,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))

What is the top 10 country that people hate to leave most?

plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
n <- nrow(plot_df)
p <- ggplot(plot_df[(n-10):n,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))

What is the top 10 country that people like to travel to U.S ?


usa_flights_rdd <- filter(df, df$DEST_COUNTRY_NAME == "United States")

new_df <- summarize(groupBy(usa_flights_rdd, df$ORIGIN_COUNTRY_NAME), count = sum(df$count))
head(new_df)

plot_df <- as.data.frame(new_df) %>% dplyr::arrange(desc(count))
View(plot_df)
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)

p <- ggplot(plot_df[1:10,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))

What is the top 10 country that people hate to travel to U.S ?

plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
n <- nrow(plot_df)
p <- ggplot(plot_df[(n-10):n,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))

References

LS0tDQp0aXRsZTogIkRhdGEgRXhwbG9yYXRpb24gVXNpbmcgUiBhbmQgU3BhcmsiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQojIyMjIEkgZ29ubmEgdXNlIFIgYW5kIFNwYXJrIGZvciBhIHNhbXBsZSBkYXRhIGV4cGxvcmF0aW9uIHRhc2sgdG8gZXhwbG9yZSB0aGUgUiBjYXBhYmlsaXRlcyBmb3IgdXNpbmcgc3BhcmsgZGlzdHJpYnV0ZWQgZGF0YSBwcm9jZXNzaW5nDQoNCioqU2V0dXAqKiANCg0KVGhlcmUgaXMgYWxtb3N0IG5vIGluc3RhbGxhdGlvbnMgbmVlZGVkIGZvciB0aGlzIHR1dG9yaWFsIGJ1dCBoZXJlIGlzIHRoZSByZXF1cmllZCB0b29scyBmb3IgcnVubmluZyB0aGlzIHNjcmlwdCBpbiB5b3VyIG1hY2hpbmUuDQoNCi0gSmF2YSBhbmQgaSBhc3N1bWVkIGl0IHdhcyBhbHJlYWR5IGluc3RhbGxlZCBvbiB5b3VyIG1hY2hpbmUgYW5kIHdoeSBqYXZhIGFuZCB0aGUgYW5zd2VyIGlzIHZlcnkgc2ltcGxlICggU1BBUksgd3JpdHRlbiBpbiBTY2FsYSB3aGljaCBpcyBhIEpNViAqKkphdmEgVmlydHVhbCBNYWNoaW5lKiogbGFuZ3VhZ2UgKSB5b3UgY2FuIGRvd25sb2FkIGphdmEgZnJvbSBbaGVyZV0oaHR0cHM6Ly9qYXZhLmNvbS9lbi9kb3dubG9hZC8pIGFuZCB0aGlzIGxpbmsgd2lsbCBoZWxwIHlvdSANCg0KLSBTcGFyayBsYXN0ZXN0IHZlcnNpb24gMi4yIGJpbmFyaWVzIGFuZCB5b3UgY2FuIGRvd25sb2FkIGl0IGZyb20gW2hlcmVdKGh0dHBzOi8vc3BhcmsuYXBhY2hlLm9yZy9kb3dubG9hZHMuaHRtbCkgYW5kIGZvbGxvdyB0aGUgc2V0dXAgaW5zdHJ1Y3Rpb24gaWYgeW91IGRvbid0IG5vIHdvcnJpZXMgaSB3aWxsIGRvIHRoaXMgc3RlcCBmb3IgeW91IDspIA0KDQotIFIgdmVyc2lvbiAzIA0KDQotIFIgU3R1ZGlvIHRvIHdyaXRlIHRoaXMgbm90ZWJvb2sgDQoNCi0gZ2dwbG90MiBmb3IgZ3JhcGhzDQpgYGB7ciwgZWNobz1UUlVFLHRpZHk9VFJVRSxyZXN1bHRzPSdoaWRlJ30NCiNjaGVja2luZyBpZiB0aGUgU3BhcmsgSG9tZSBpcyBkZWNsYXJlZCBhcyBhbiBlbnZpcm9tZW50IHZhcmlhYmxlIGJ5IGNoZWNraW5nIHRoZSBleGlzdGVuY2Ugb2YgdGhlIFNQQVJLX0hPTUUgUGF0aCB2YXJpYWJsZSBhbmQgaWYgbm90IGV4aXN0IGRvbid0IHdvcnJ5IGkgd2lsbCBkbyBpdCBmb3IgeW91DQoNCmlmIChuY2hhcihTeXMuZ2V0ZW52KCJTUEFSS19IT01FIikpIDwgMSkgew0KICAgDQogICAgICAjIGRvd25sb2FkLmZpbGUodXJsID0gImh0dHBzOi8vZDNrYmNxYTQ5bWliMTMuY2xvdWRmcm9udC5uZXQvc3BhcmstMi4yLjAtYmluLWhhZG9vcDIuNy50Z3oiLA0KICAgICAgIyAgICAgICAgIGRlc3RmaWxlID0gInNwYXJrLTIuMi4wLWJpbi1oYWRvb3AyLjcudGd6IikNCiAgICAgICN1bnRhcigic3BhcmstMi4yLjAtYmluLWhhZG9vcDIuNy50Z3oiKQ0KICAgICAgU3lzLnNldGVudihTUEFSS19IT01FID0gcGFzdGUwKGdldHdkKCksICIvc3BhcmstMi4yLjAtYmluLWhhZG9vcDIuNyIpKQ0KfQ0KYGBgDQo8YnIvPg0KDQoqKi0gU3BhcmtSIGlzIGFuIFIgcGFja2FnZSB0aGF0IHByb3ZpZGVzIGEgbGlnaHQtd2VpZ2h0IGZyb250ZW5kIHRvIHVzZSBBcGFjaGUgU3BhcmsgZnJvbSBSLiBJbiBTcGFyayAyLjIuMCwgU3BhcmtSIHByb3ZpZGVzIGEgZGlzdHJpYnV0ZWQgZGF0YSBmcmFtZSBpbXBsZW1lbnRhdGlvbiB0aGF0IHN1cHBvcnRzIG9wZXJhdGlvbnMgbGlrZSBzZWxlY3Rpb24sIGZpbHRlcmluZywgYWdncmVnYXRpb24gZXRjLiAoc2ltaWxhciB0byBSIGRhdGEgZnJhbWVzLCBkcGx5cikgYnV0IG9uIGxhcmdlIGRhdGFzZXRzLiBTcGFya1IgYWxzbyBzdXBwb3J0cyBkaXN0cmlidXRlZCBtYWNoaW5lIGxlYXJuaW5nIHVzaW5nIE1MbGliLioqDQoNCioqLSBUaGUgZW50cnkgcG9pbnQgaW50byBTcGFya1IgaXMgdGhlIFNwYXJrU2Vzc2lvbiB3aGljaCBjb25uZWN0cyB5b3VyIFIgcHJvZ3JhbSB0byBhIFNwYXJrIGNsdXN0ZXIuIFlvdSBjYW4gY3JlYXRlIGEgU3BhcmtTZXNzaW9uIHVzaW5nIHNwYXJrUi5zZXNzaW9uIGFuZCBwYXNzIGluIG9wdGlvbnMgc3VjaCBhcyB0aGUgYXBwbGljYXRpb24gbmFtZSwgYW55IHNwYXJrIHBhY2thZ2VzIGRlcGVuZGVkIG9uLioqDQoNCmBgYHtyLCBlY2hvPVRSVUUsdGlkeT1UUlVFLHJlc3VsdHM9J2hpZGUnfQ0KDQpsaWJyYXJ5KGdncGxvdDIpDQpsaWJyYXJ5KGRwbHlyKQ0KDQojIyBpbXBvcnRpbmcgdGhlIFNwYXJrUiBsaWJyYXJ5IA0KbGlicmFyeShTcGFya1IsIGxpYi5sb2MgPSBjKG5vcm1hbGl6ZVBhdGgoZmlsZS5wYXRoKFN5cy5nZXRlbnYoIlNQQVJLX0hPTUUiKSwgIlIiLCAibGliIikpKSkNCg0KIyMgY3JlYXRlIA0Kc3BhcmtSLnNlc3Npb24obWFzdGVyID0gImxvY2FsWypdIiwgc3BhcmtDb25maWcgPSBsaXN0KHNwYXJrLmRyaXZlci5tZW1vcnkgPSAiMmciKSkNCmBgYA0KDQojIyNDcmVhdGluZyBTcGFya0RhdGFGcmFtZXMNCg0KV2l0aCBhIFNwYXJrU2Vzc2lvbiwgYXBwbGljYXRpb25zIGNhbiBjcmVhdGUgU3BhcmtEYXRhRnJhbWVzIGZyb20gDQoNCi0gTG9jYWwgUiBkYXRhIGZyYW1lIGZvciBleGFtcGxlICoqImRmIDwtIGFzLkRhdGFGcmFtZShmYWl0aGZ1bCkiKiouDQoNCi0gSGl2ZSB0YWJsZSAgZm9yIGV4YW1wbGUgKioicmVzdWx0cyA8LSBzcWwoJ0ZST00gc3JjIFNFTEVDVCBrZXksIHZhbHVlJykiKiouDQoNCi0gT3RoZXIgZGF0YSBzb3VyY2VzIGFuZCB0aGF0IGlzIG91ciBjYXNlIHJlYWRpbmcgZGF0YSBmcm9tIENTViBmaWxlLg0KDQpUaGUgZm9sbG93aW5nIGRhdGFzZXQgcmVwcmVzZW50IHRoZSBmbGlnaHRzIGluIDIwMTUgZm9yIHggYWlybGluZXMgc28gbXkgZmlyc3Qgc3RlcCBpcyB0byByZWFkIHRoZSBkYXRhc2V0DQppIHdpbGwgZGVwZW5kIG9uIG15IGFuYWx5c2lzIG9uIFEvQSBzdHlsZSBieSBhc2tpbmcgc29tZSBxdWVzdGlvbnMgYW5kIHRyeWluZyB0byBhbnN3ZXIgdGhlbSBieSBhIHNpbXBsZSBncmFwaA0KYGBge3J9DQpjc3ZQYXRoIDwtICJkYXRhL2ZsaWdodC1kYXRhL2Nzdi8yMDE1LXN1bW1hcnkuY3N2Ig0KZGYgPC0gcmVhZC5kZihjc3ZQYXRoLCAiY3N2IiwgaGVhZGVyID0gInRydWUiLCBpbmZlclNjaGVtYSA9ICJ0cnVlIiwgbmEuc3RyaW5ncyA9ICJOQSIpDQpoZWFkKGRmKQ0KYGBgDQoNCioqV2hhdCBpcyB0aGUgdG9wIDEwIGNvdW50cnkgdGhhdCBwZW9wbGUgbGlrZSB0byB0cmF2ZWwgbW9zdD8qKg0KYGBge3IsIGVjaG89VFJVRX0NCm5ld19kZiA8LSBzdW1tYXJpemUoZ3JvdXBCeShkZiwgZGYkREVTVF9DT1VOVFJZX05BTUUpLCBjb3VudCA9IHN1bShkZiRjb3VudCkpDQpoZWFkKG5ld19kZikNCg0KcGxvdF9kZiA8LSBhcy5kYXRhLmZyYW1lKG5ld19kZikgJT4lIGRwbHlyOjphcnJhbmdlKGRlc2MoY291bnQpKQ0KVmlldyhwbG90X2RmKQ0KcGxvdF9kZiRERVNUX0NPVU5UUllfTkFNRSA8LSBmYWN0b3IocGxvdF9kZiRERVNUX0NPVU5UUllfTkFNRSkNCg0KcCA8LSBnZ3Bsb3QocGxvdF9kZlsxOjEwLF0sIGFlcyhERVNUX0NPVU5UUllfTkFNRSwgY291bnQpKQ0KcCArIGdlb21fcG9pbnQoYWVzKGNvbG91ciA9IERFU1RfQ09VTlRSWV9OQU1FLCBzaXplID0gY291bnQpKQ0KDQpgYGANCg0KKipXaGF0IGlzIHRoZSB0b3AgMTAgY291bnRyeSB0aGF0IHBlb3BsZSBoYXRlIHRvIHRyYXZlbCBtb3N0PyoqDQpgYGB7ciwgZWNobz1UUlVFfQ0KcGxvdF9kZiRERVNUX0NPVU5UUllfTkFNRSA8LSBmYWN0b3IocGxvdF9kZiRERVNUX0NPVU5UUllfTkFNRSkNCm4gPC0gbnJvdyhwbG90X2RmKQ0KcCA8LSBnZ3Bsb3QocGxvdF9kZlsobi0xMCk6bixdLCBhZXMoREVTVF9DT1VOVFJZX05BTUUsIGNvdW50KSkNCnAgKyBnZW9tX3BvaW50KGFlcyhjb2xvdXIgPSBERVNUX0NPVU5UUllfTkFNRSwgc2l6ZSA9IGNvdW50KSkNCmBgYA0KDQoNCioqV2hhdCBpcyB0aGUgdG9wIDEwIGNvdW50cnkgdGhhdCBwZW9wbGUgbGlrZSB0byBsZWF2ZSBtb3N0PyoqDQpgYGB7ciwgZWNobz1UUlVFfQ0KbmV3X2RmIDwtIHN1bW1hcml6ZShncm91cEJ5KGRmLCBkZiRPUklHSU5fQ09VTlRSWV9OQU1FKSwgY291bnQgPSBzdW0oZGYkY291bnQpKQ0KaGVhZChuZXdfZGYpDQoNCnBsb3RfZGYgPC0gYXMuZGF0YS5mcmFtZShuZXdfZGYpICU+JSBkcGx5cjo6YXJyYW5nZShkZXNjKGNvdW50KSkNClZpZXcocGxvdF9kZikNCnBsb3RfZGYkT1JJR0lOX0NPVU5UUllfTkFNRSA8LSBmYWN0b3IocGxvdF9kZiRPUklHSU5fQ09VTlRSWV9OQU1FKQ0KDQpwIDwtIGdncGxvdChwbG90X2RmWzE6MTAsXSwgYWVzKE9SSUdJTl9DT1VOVFJZX05BTUUsIGNvdW50KSkNCnAgKyBnZW9tX3BvaW50KGFlcyhjb2xvdXIgPSBPUklHSU5fQ09VTlRSWV9OQU1FLCBzaXplID0gY291bnQpKQ0KDQpgYGANCg0KKipXaGF0IGlzIHRoZSB0b3AgMTAgY291bnRyeSB0aGF0IHBlb3BsZSBoYXRlIHRvIGxlYXZlIG1vc3Q/KioNCmBgYHtyLCBlY2hvPVRSVUV9DQpwbG90X2RmJE9SSUdJTl9DT1VOVFJZX05BTUUgPC0gZmFjdG9yKHBsb3RfZGYkT1JJR0lOX0NPVU5UUllfTkFNRSkNCm4gPC0gbnJvdyhwbG90X2RmKQ0KcCA8LSBnZ3Bsb3QocGxvdF9kZlsobi0xMCk6bixdLCBhZXMoT1JJR0lOX0NPVU5UUllfTkFNRSwgY291bnQpKQ0KcCArIGdlb21fcG9pbnQoYWVzKGNvbG91ciA9IE9SSUdJTl9DT1VOVFJZX05BTUUsIHNpemUgPSBjb3VudCkpDQpgYGANCg0KDQoqKldoYXQgaXMgdGhlIHRvcCAxMCBjb3VudHJ5IHRoYXQgcGVvcGxlIGxpa2UgdG8gdHJhdmVsIHRvIFUuUyA/KioNCmBgYHtyLCBlY2hvPVRSVUV9DQoNCnVzYV9mbGlnaHRzX3JkZCA8LSBmaWx0ZXIoZGYsIGRmJERFU1RfQ09VTlRSWV9OQU1FID09ICJVbml0ZWQgU3RhdGVzIikNCg0KbmV3X2RmIDwtIHN1bW1hcml6ZShncm91cEJ5KHVzYV9mbGlnaHRzX3JkZCwgZGYkT1JJR0lOX0NPVU5UUllfTkFNRSksIGNvdW50ID0gc3VtKGRmJGNvdW50KSkNCmhlYWQobmV3X2RmKQ0KDQpwbG90X2RmIDwtIGFzLmRhdGEuZnJhbWUobmV3X2RmKSAlPiUgZHBseXI6OmFycmFuZ2UoZGVzYyhjb3VudCkpDQpWaWV3KHBsb3RfZGYpDQpwbG90X2RmJE9SSUdJTl9DT1VOVFJZX05BTUUgPC0gZmFjdG9yKHBsb3RfZGYkT1JJR0lOX0NPVU5UUllfTkFNRSkNCg0KcCA8LSBnZ3Bsb3QocGxvdF9kZlsxOjEwLF0sIGFlcyhPUklHSU5fQ09VTlRSWV9OQU1FLCBjb3VudCkpDQpwICsgZ2VvbV9wb2ludChhZXMoY29sb3VyID0gT1JJR0lOX0NPVU5UUllfTkFNRSwgc2l6ZSA9IGNvdW50KSkNCg0KYGBgDQoNCioqV2hhdCBpcyB0aGUgdG9wIDEwIGNvdW50cnkgdGhhdCBwZW9wbGUgaGF0ZSB0byB0cmF2ZWwgdG8gVS5TID8qKg0KYGBge3IsIGVjaG89VFJVRX0NCnBsb3RfZGYkT1JJR0lOX0NPVU5UUllfTkFNRSA8LSBmYWN0b3IocGxvdF9kZiRPUklHSU5fQ09VTlRSWV9OQU1FKQ0KbiA8LSBucm93KHBsb3RfZGYpDQpwIDwtIGdncGxvdChwbG90X2RmWyhuLTEwKTpuLF0sIGFlcyhPUklHSU5fQ09VTlRSWV9OQU1FLCBjb3VudCkpDQpwICsgZ2VvbV9wb2ludChhZXMoY29sb3VyID0gT1JJR0lOX0NPVU5UUllfTkFNRSwgc2l6ZSA9IGNvdW50KSkNCmBgYA0KDQoNCg0KIyMjIFJlZmVyZW5jZXMgDQoNCi0gW1NwYXJrUiBUdXRvcmlhbHNdKGh0dHBzOi8vc3BhcmsuYXBhY2hlLm9yZy9kb2NzL2xhdGVzdC9zcGFya3IuaHRtbCkNCg0K