Setup
There is almost no installations needed for this tutorial but here is the requried tools for running this script in your machine.
Java and i assumed it was already installed on your machine and why java and the answer is very simple ( SPARK written in Scala which is a JMV Java Virtual Machine language ) you can download java from here and this link will help you
Spark lastest version 2.2 binaries and you can download it from here and follow the setup instruction if you don’t no worries i will do this step for you ;)
R version 3
R Studio to write this notebook
ggplot2 for graphs
# checking if the Spark Home is declared as an enviroment variable by
# checking the existence of the SPARK_HOME Path variable and if not exist
# don't worry i will do it for you
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
# download.file(url =
# 'https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz',
# destfile = 'spark-2.2.0-bin-hadoop2.7.tgz')
# untar('spark-2.2.0-bin-hadoop2.7.tgz')
Sys.setenv(SPARK_HOME = paste0(getwd(), "/spark-2.2.0-bin-hadoop2.7"))
}
- SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.2.0, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.
- The entry point into SparkR is the SparkSession which connects your R program to a Spark cluster. You can create a SparkSession using sparkR.session and pass in options such as the application name, any spark packages depended on.
With a SparkSession, applications can create SparkDataFrames from
Local R data frame for example “df <- as.DataFrame(faithful)”.
Hive table for example “results <- sql(‘FROM src SELECT key, value’)”.
Other data sources and that is our case reading data from CSV file.
The following dataset represent the flights in 2015 for x airlines so my first step is to read the dataset i will depend on my analysis on Q/A style by asking some questions and trying to answer them by a simple graph
csvPath <- "data/flight-data/csv/2015-summary.csv"
df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA")
head(df)
What is the top 10 country that people like to travel most?
new_df <- summarize(groupBy(df, df$DEST_COUNTRY_NAME), count = sum(df$count))
head(new_df)
plot_df <- as.data.frame(new_df) %>% dplyr::arrange(desc(count))
[Stage 10:=================> (65 + 4) / 200]
[Stage 10:========================> (88 + 4) / 200]
[Stage 10:=============================> (108 + 4) / 200]
[Stage 10:====================================> (134 + 4) / 200]
[Stage 10:===========================================> (162 + 4) / 200]
[Stage 10:===================================================> (191 + 4) / 200]
plot_df$DEST_COUNTRY_NAME <- factor(plot_df$DEST_COUNTRY_NAME)
ggplotly({
p <- ggplot(plot_df[1:10,], aes(DEST_COUNTRY_NAME, count))
p + geom_point(aes(colour = DEST_COUNTRY_NAME, size = count))
})
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
replacing previous import by ‘shiny::includeHTML’ when loading ‘crosstalk’replacing previous import by ‘shiny::knit_print.shiny.tag’ when loading ‘crosstalk’replacing previous import by ‘shiny::code’ when loading ‘crosstalk’replacing previous import by ‘shiny::includeScript’ when loading ‘crosstalk’replacing previous import by ‘shiny::includeMarkdown’ when loading ‘crosstalk’replacing previous import by ‘shiny::tags’ when loading ‘crosstalk’replacing previous import by ‘shiny::is.singleton’ when loading ‘crosstalk’replacing previous import by ‘shiny::withTags’ when loading ‘crosstalk’replacing previous import by ‘shiny::img’ when loading ‘crosstalk’replacing previous import by ‘shiny::tagAppendAttributes’ when loading ‘crosstalk’replacing previous import by ‘shiny::knit_print.shiny.tag.list’ when loading ‘crosstalk’replacing previous import by ‘shiny::knit_print.html’ when loading ‘crosstalk’replacing previous import by ‘shiny::tagAppendChild’ when loading ‘crosstalk’replacing previous import by ‘shiny::includeCSS’ when loading ‘crosstalk’replacing previous import by ‘shiny::br’ when loading ‘crosstalk’replacing previous import by ‘shiny::singleton’ when loading ‘crosstalk’replacing previous import by ‘shiny::span’ when loading ‘crosstalk’replacing previous import by ‘shiny::a’ when loading ‘crosstalk’replacing previous import by ‘shiny::tagList’ when loading ‘crosstalk’replacing previous import by ‘shiny::strong’ when loading ‘crosstalk’replacing previous import by ‘shiny::tag’ when loading ‘crosstalk’replacing previous import by ‘shiny::p’ when loading ‘crosstalk’replacing previous import by ‘shiny::validateCssUnit’ when loading ‘crosstalk’replacing previous import by ‘shiny::HTML’ when loading ‘crosstalk’replacing previous import by ‘shiny::h1’ when loading ‘crosstalk’replacing previous import by ‘shiny::h2’ when loading ‘crosstalk’replacing previous import by ‘shiny::h3’ when loading ‘crosstalk’replacing previous import by ‘shiny::h4’ when loading ‘crosstalk’replacing previous import by ‘shiny::h5’ when loading ‘crosstalk’replacing previous import by ‘shiny::h6’ when loading ‘crosstalk’replacing previous import by ‘shiny::tagAppendChildren’ when loading ‘crosstalk’replacing previous import by ‘shiny::em’ when loading ‘crosstalk’replacing previous import by ‘shiny::div’ when loading ‘crosstalk’replacing previous import by ‘shiny::pre’ when loading ‘crosstalk’replacing previous import by ‘shiny::htmlTemplate’ when loading ‘crosstalk’replacing previous import by ‘shiny::suppressDependencies’ when loading ‘crosstalk’replacing previous import by ‘shiny::tagSetChildren’ when loading ‘crosstalk’replacing previous import by ‘shiny::includeText’ when loading ‘crosstalk’replacing previous import by ‘shiny::hr’ when loading ‘crosstalk’
What is the top 10 country that people hate to travel most?
plot_df$DEST_COUNTRY_NAME <- factor(plot_df$DEST_COUNTRY_NAME)
n <- nrow(plot_df)
ggplotly({
p <- ggplot(plot_df[(n-10):n,], aes(DEST_COUNTRY_NAME, count))
p + geom_point(aes(colour = DEST_COUNTRY_NAME, size = count))
})
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
What is the top 10 country that people like to leave most?
new_df <- summarize(groupBy(df, df$ORIGIN_COUNTRY_NAME), count = sum(df$count))
head(new_df)
plot_df <- as.data.frame(new_df) %>% dplyr::arrange(desc(count))
[Stage 18:=============================> (110 + 4) / 200]
[Stage 18:========================================> (151 + 4) / 200]
[Stage 18:===================================================> (191 + 4) / 200]
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
ggplotly({
p <- ggplot(plot_df[1:10,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))
})
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
What is the top 10 country that people hate to leave most?
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
n <- nrow(plot_df)
ggplotly({
p <- ggplot(plot_df[(n-10):n,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))
})
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
What is the top 10 country that people like to travel to U.S ?
usa_flights_rdd <- filter(df, df$DEST_COUNTRY_NAME == "United States")
new_df <- summarize(groupBy(usa_flights_rdd, df$ORIGIN_COUNTRY_NAME), count = sum(df$count))
head(new_df)
plot_df <- as.data.frame(new_df) %>% dplyr::arrange(desc(count))
[Stage 26:===========================> (101 + 4) / 200]
[Stage 26:===================================> (131 + 4) / 200]
[Stage 26:=======================================> (146 + 4) / 200]
[Stage 26:================================================> (178 + 4) / 200]
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
ggplotly({
p <- ggplot(plot_df[1:10,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))
})
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`
What is the top 10 country that people hate to travel to U.S ?
plot_df$ORIGIN_COUNTRY_NAME <- factor(plot_df$ORIGIN_COUNTRY_NAME)
n <- nrow(plot_df)
ggplotly({
p <- ggplot(plot_df[(n-10):n,], aes(ORIGIN_COUNTRY_NAME, count))
p + geom_point(aes(colour = ORIGIN_COUNTRY_NAME, size = count))
})
We recommend that you use the dev version of ggplot2 with `ggplotly()`
Install it with: `devtools::install_github('hadley/ggplot2')`