Assignment Summary:

Below is the summary of the assignment -

R Libraries:

Load necessary libraries -

library(kableExtra)
library(stringr)
library(dplyr)
library(tidyr)
library(scales)
library(ggplot2)

Data Processing Steps :

Below are the steps to be followed to generate the final output -

  1. Create a CSV file in a “Wide” format which has the same information as the table shown above in the problem statement:
  1. Import data into R:
airlineDF <- read.csv("https://raw.githubusercontent.com/soumya2g/R-CUNY-MSDS/master/DATA-607/Airline%20Delays/airlines.csv",header = TRUE, stringsAsFactors = FALSE)
head(airlineDF) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61
  1. Use tidyr function gather() to unpivot destination columns into a variable and replace ‘.’ character in Destination variable with a space character:
airlineTidyDF <- airlineDF %>% gather(key = destination, value = "No. of Flights", -airline, -status) 
airlineTidyDF$destination <- str_replace_all(airlineTidyDF$destination, "\\."," ")
airlineTidyDF  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
airline status destination No. of Flights
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on time San Diego 212
ALASKA delayed San Diego 20
AM WEST on time San Diego 383
AM WEST delayed San Diego 65
ALASKA on time San Francisco 503
ALASKA delayed San Francisco 102
AM WEST on time San Francisco 320
AM WEST delayed San Francisco 129
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Seattle 201
AM WEST delayed Seattle 61
  1. Use tidyr spread() function on ‘status’ column to to have two separate columns of ‘on time’ and ‘delayed’ flights:

After spreading the data on ‘status’, using dplyr mutate() function, I have calculated ‘total_flights’ and ‘deayed_percentage’ for each airline and destination combination.

airlineDF1 <- airlineTidyDF %>% spread(status,"No. of Flights") %>% mutate(total_flights = (delayed + `on time`)) %>%  mutate(delayed_percent = (delayed/total_flights)) 

airlineDF1 %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
airline destination delayed on time total_flights delayed_percent
ALASKA Los Angeles 62 497 559 0.1109123
ALASKA Phoenix 12 221 233 0.0515021
ALASKA San Diego 20 212 232 0.0862069
ALASKA San Francisco 102 503 605 0.1685950
ALASKA Seattle 305 1841 2146 0.1421249
AM WEST Los Angeles 117 694 811 0.1442663
AM WEST Phoenix 415 4840 5255 0.0789724
AM WEST San Diego 65 383 448 0.1450893
AM WEST San Francisco 129 320 449 0.2873051
AM WEST Seattle 61 201 262 0.2328244
  1. Summerize the data set by airline:

Based on the summerized data by airline across all destinations and the Bar plot below ALASKA arline shows a higher percentage of flight delays compared to AM WEST airline.

airlineSummaryDF <- airlineDF1 %>% group_by(airline) %>% summarise(total_delayed = sum(delayed), total_on_time = sum(`on time`), total_fights = sum(total_flights)) %>% mutate(delayed_percent = total_delayed/total_fights)

airlineSummaryDF  %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="200px")
airline total_delayed total_on_time total_fights delayed_percent
ALASKA 501 3274 3775 0.1327152
AM WEST 787 6438 7225 0.1089273
## Bar plot with overall Delayed% by airline

ggplot(airlineSummaryDF, aes(y=delayed_percent, x=airline, color=airline, fill=airline)) + 
    geom_bar( stat="identity") +
    geom_text(aes(label = percent(delayed_percent)), vjust = -.5) +
    scale_y_continuous(labels = scales::percent)

  1. Plot Delayed% by Airline for each Destination:

Based on the below Grouped Bar plot, in all of the destinations AM WEST has higher percentage of delayed flights compared to ALASKA.

ggplot(airlineDF1, aes(y=delayed_percent, x=airline, color=airline, fill=airline)) + 
    geom_bar( stat="identity") +
    geom_text(aes(label = percent(delayed_percent)), vjust = -.5) +
    facet_wrap(~destination) + 
    scale_y_continuous(labels = scales::percent)

  1. Comparing the ‘on time’ and delayed ‘delayed’ flight count comarison:
ggplot(airlineTidyDF, aes(x = airline, y = `No. of Flights`, fill = status)) + 
  geom_bar(stat = 'identity', position = "stack") + 
  geom_text(aes(label = `No. of Flights`), position = position_stack(vjust = 0.5)) +
  facet_grid(~ destination)

Conclusion:

Comparing above 3 Bar Plots, few interesting facts can be highlighted -