Below is the summary of the assignment -
Load necessary libraries -
library(kableExtra)
library(stringr)
library(dplyr)
library(tidyr)
library(scales)
library(ggplot2)
Below are the steps to be followed to generate the final output -
airlineDF <- read.csv("https://raw.githubusercontent.com/soumya2g/R-CUNY-MSDS/master/DATA-607/Airline%20Delays/airlines.csv",header = TRUE, stringsAsFactors = FALSE)
head(airlineDF) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
airline | status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
airlineTidyDF <- airlineDF %>% gather(key = destination, value = "No. of Flights", -airline, -status)
airlineTidyDF$destination <- str_replace_all(airlineTidyDF$destination, "\\."," ")
airlineTidyDF %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
airline | status | destination | No. of Flights |
---|---|---|---|
ALASKA | on time | Los Angeles | 497 |
ALASKA | delayed | Los Angeles | 62 |
AM WEST | on time | Los Angeles | 694 |
AM WEST | delayed | Los Angeles | 117 |
ALASKA | on time | Phoenix | 221 |
ALASKA | delayed | Phoenix | 12 |
AM WEST | on time | Phoenix | 4840 |
AM WEST | delayed | Phoenix | 415 |
ALASKA | on time | San Diego | 212 |
ALASKA | delayed | San Diego | 20 |
AM WEST | on time | San Diego | 383 |
AM WEST | delayed | San Diego | 65 |
ALASKA | on time | San Francisco | 503 |
ALASKA | delayed | San Francisco | 102 |
AM WEST | on time | San Francisco | 320 |
AM WEST | delayed | San Francisco | 129 |
ALASKA | on time | Seattle | 1841 |
ALASKA | delayed | Seattle | 305 |
AM WEST | on time | Seattle | 201 |
AM WEST | delayed | Seattle | 61 |
After spreading the data on ‘status’, using dplyr mutate() function, I have calculated ‘total_flights’ and ‘deayed_percentage’ for each airline and destination combination.
airlineDF1 <- airlineTidyDF %>% spread(status,"No. of Flights") %>% mutate(total_flights = (delayed + `on time`)) %>% mutate(delayed_percent = (delayed/total_flights))
airlineDF1 %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
airline | destination | delayed | on time | total_flights | delayed_percent |
---|---|---|---|---|---|
ALASKA | Los Angeles | 62 | 497 | 559 | 0.1109123 |
ALASKA | Phoenix | 12 | 221 | 233 | 0.0515021 |
ALASKA | San Diego | 20 | 212 | 232 | 0.0862069 |
ALASKA | San Francisco | 102 | 503 | 605 | 0.1685950 |
ALASKA | Seattle | 305 | 1841 | 2146 | 0.1421249 |
AM WEST | Los Angeles | 117 | 694 | 811 | 0.1442663 |
AM WEST | Phoenix | 415 | 4840 | 5255 | 0.0789724 |
AM WEST | San Diego | 65 | 383 | 448 | 0.1450893 |
AM WEST | San Francisco | 129 | 320 | 449 | 0.2873051 |
AM WEST | Seattle | 61 | 201 | 262 | 0.2328244 |
Based on the summerized data by airline across all destinations and the Bar plot below ALASKA arline shows a higher percentage of flight delays compared to AM WEST airline.
airlineSummaryDF <- airlineDF1 %>% group_by(airline) %>% summarise(total_delayed = sum(delayed), total_on_time = sum(`on time`), total_fights = sum(total_flights)) %>% mutate(delayed_percent = total_delayed/total_fights)
airlineSummaryDF %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="200px")
airline | total_delayed | total_on_time | total_fights | delayed_percent |
---|---|---|---|---|
ALASKA | 501 | 3274 | 3775 | 0.1327152 |
AM WEST | 787 | 6438 | 7225 | 0.1089273 |
## Bar plot with overall Delayed% by airline
ggplot(airlineSummaryDF, aes(y=delayed_percent, x=airline, color=airline, fill=airline)) +
geom_bar( stat="identity") +
geom_text(aes(label = percent(delayed_percent)), vjust = -.5) +
scale_y_continuous(labels = scales::percent)
Based on the below Grouped Bar plot, in all of the destinations AM WEST has higher percentage of delayed flights compared to ALASKA.
ggplot(airlineDF1, aes(y=delayed_percent, x=airline, color=airline, fill=airline)) +
geom_bar( stat="identity") +
geom_text(aes(label = percent(delayed_percent)), vjust = -.5) +
facet_wrap(~destination) +
scale_y_continuous(labels = scales::percent)
ggplot(airlineTidyDF, aes(x = airline, y = `No. of Flights`, fill = status)) +
geom_bar(stat = 'identity', position = "stack") +
geom_text(aes(label = `No. of Flights`), position = position_stack(vjust = 0.5)) +
facet_grid(~ destination)
Comparing above 3 Bar Plots, few interesting facts can be highlighted -
Even though, ALASKA airline shows a better track record in terms of having lesser percentage of flight delays when analyzed by individual destinations but overall ALASKA arline has a higher percentage of delayed flights compared to AM WEST.
In Phoenix, AM WEST airline operates many more flights (5,000+) compared to ALASKA (200+). But the percentages of delays for AM WEST airline is only 2.7% higher compared to ALASKA airline.
In Seattle, ALASKA operates more flights (2146 vs. 262) compared to AM WEST, but has a much lower (9.1% lower) flight delay percentage compared to AM WEST.