Overview

What does it mean to you, as a passenger to be on a delayed flight ? Yes, the frustration, anxiety and temptation to lodge a complaint is very natural (unless you are like me, who would like to sleep for those few extra minutes). But ever wondered, what it means to the airlines ?

Recognizing the significant strain placed by flight delays on the U.S. air transportation system, in August 2008 the Federal Aviation Administration (FAA) commissioned five NEXTOR universities (UC Berkeley, MIT, George Mason University, the University of Maryland and Virginia Tech) and the Brattle Group to conduct a comprehensive study on the total delay impact (TDI) in the United States.

In 2007, domestic flight delays were found to cost the U.S. economy $31.2 billion in 2007, including $8.3 billion in direct costs to airlines, $16.7 billion in direct costs to passengers, $2.2 billion from lost demand and $4.0 billion in forgone GDP.Source:airlines.org

Let’s pick two airlines specifically and analyze their delay times.

Read the CSV dataset

Step 1 is to load the csv from the github library and to load the required R libraries.

knitr::opts_chunk$set(eval = TRUE, results = FALSE)

library(RCurl)
library(knitr)


flights_url <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/flights_dataset.csv") 
flights_raw <- read.csv(text = flights_url)

knitr:: kable(flights_raw) 
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
delayed 117 415 65 129 61

Tidy the data with tidyr

library(tidyverse) 

#Tried 'fill' but somehow it did not work, so have to try rudimentary way  
airline <- c("ALASKA","ALASKA", "","AM WEST","AM WEST")

flights_raw[1] <- airline

col_name <- colnames(flights_raw)
tall_raw <- gather(flights_raw,col_name[3:7],key = "Destination",value = "Total_flights", na.rm = TRUE)

# The above step can also be done using pivot_longer() function, let's give it a try
flights_tall <- pivot_longer(flights_raw,col_name[3:7],names_to = "Destination", values_to = "Total_flights",values_drop_na = TRUE)

flights_tall <- dplyr::rename(flights_tall, Airline = X)
flights_tall <- dplyr::rename(flights_tall, Status = X.1)

# Let's write the table in quality table
library(formattable)
formattable(flights_tall, list(
  Airline = color_tile("light blue", "orange"),
  Status = formatter("span", style = x ~ ifelse(x == "delayed", 
  style(color = "red", font.weight = "bold"), NA)),  
  Total_flights = color_bar("gray", proportion)))
## Warning in gradient(as.numeric(x), ...): NAs introduced by coercion
Airline Status Destination Total_flights
ALASKA on time Los.Angeles 497
ALASKA on time Phoenix 221
ALASKA on time San.Diego 212
ALASKA on time San.Francisco 503
ALASKA on time Seattle 1841
ALASKA delayed Los.Angeles 62
ALASKA delayed Phoenix 12
ALASKA delayed San.Diego 20
ALASKA delayed San.Francisco 102
ALASKA delayed Seattle 305
AM WEST on time Los.Angeles 694
AM WEST on time Phoenix 4840
AM WEST on time San.Diego 383
AM WEST on time San.Francisco 320
AM WEST on time Seattle 201
AM WEST delayed Los.Angeles 117
AM WEST delayed Phoenix 415
AM WEST delayed San.Diego 65
AM WEST delayed San.Francisco 129
AM WEST delayed Seattle 61

Analysis

To calculate the performance, we need to tweak the table structure a bit to obtain total flights by status.

flights_tall_stat <- pivot_wider(flights_tall, names_from = "Status", values_from = "Total_flights")
flights_tall_stat <- mutate(flights_tall_stat, "Total_flights" = round(flights_tall_stat$delayed+flights_tall_stat$`on time`))

flights_tall_stat <- mutate(flights_tall_stat, "delayed_flights_pct" = round(( flights_tall_stat$delayed / Total_flights ) * 100, 2), "on_time_flights_pct" = round(( flights_tall_stat$`on time` / Total_flights ) * 100, 2) )

Question 1: Compare the per-city on-time performance for both airlines

dodger = position_dodge(width = 0.9)
ggplot(flights_tall_stat, aes(y=on_time_flights_pct, x=Destination, color=Airline, fill=Airline)) + 
    geom_bar( stat="identity",position = dodger, color="black")+
    geom_text(aes(label=on_time_flights_pct),color = "blue",position = dodger,vjust=-0.5)+
  scale_fill_manual(values = alpha(c("#000000", "#FF5733"))) +
  ylab("Percentage of On-time Flights")

Question 2: Compare the overall on-time performance for both airlines

airline_level <- flights_tall_stat[-c(2,6,7)] %>%
    group_by(Airline) %>% 
    summarise_each(funs(sum))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
airline_level <- mutate(airline_level, "on_time_flights_pct" = round(( airline_level$`on time` / Total_flights ) * 100, 2))

dodger = position_dodge(width = 0.9)
ggplot(airline_level, aes(y=on_time_flights_pct, x=Airline, color=Airline, fill=Airline)) + 
    geom_bar( stat="identity",position = dodger, color="black")+
    geom_text(aes(label=on_time_flights_pct),color = "blue",position = dodger,vjust=-0.5)+
  scale_fill_manual(values = alpha(c("#000000", "#FF5733"))) +
  ylab("Percentage of On-time Flights")

Conclusion

Question 3: Explain the apparent discrepancy between the per-city and overall performances

We see that on time performance when aggregated at overall airline level shows AM West to be best amonng AM West and Alaska but when we look at by destination, in each of the cases the Alaska flights had better on time performance

Question 4: Provide an interesting paradoxical conclusion

This is because of a phenamenon called “Simpson’s Paradox”. This happens when one conducts analysis ignoring the confounding variable (here destination). More about Simpson’s Paradox