What does it mean to you, as a passenger to be on a delayed flight ? Yes, the frustration, anxiety and temptation to lodge a complaint is very natural (unless you are like me, who would like to sleep for those few extra minutes). But ever wondered, what it means to the airlines ?
Recognizing the significant strain placed by flight delays on the U.S. air transportation system, in August 2008 the Federal Aviation Administration (FAA) commissioned five NEXTOR universities (UC Berkeley, MIT, George Mason University, the University of Maryland and Virginia Tech) and the Brattle Group to conduct a comprehensive study on the total delay impact (TDI) in the United States.
In 2007, domestic flight delays were found to cost the U.S. economy $31.2 billion in 2007, including $8.3 billion in direct costs to airlines, $16.7 billion in direct costs to passengers, $2.2 billion from lost demand and $4.0 billion in forgone GDP.Source:airlines.org
Let’s pick two airlines specifically and analyze their delay times.
Step 1 is to load the csv from the github library and to load the required R libraries.
knitr::opts_chunk$set(eval = TRUE, results = FALSE)
library(RCurl)
library(knitr)
flights_url <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/flights_dataset.csv")
flights_raw <- read.csv(text = flights_url)
knitr:: kable(flights_raw)
| X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| delayed | 62 | 12 | 20 | 102 | 305 | |
| NA | NA | NA | NA | NA | ||
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| delayed | 117 | 415 | 65 | 129 | 61 |
library(tidyverse)
#Tried 'fill' but somehow it did not work, so have to try rudimentary way
airline <- c("ALASKA","ALASKA", "","AM WEST","AM WEST")
flights_raw[1] <- airline
col_name <- colnames(flights_raw)
tall_raw <- gather(flights_raw,col_name[3:7],key = "Destination",value = "Total_flights", na.rm = TRUE)
# The above step can also be done using pivot_longer() function, let's give it a try
flights_tall <- pivot_longer(flights_raw,col_name[3:7],names_to = "Destination", values_to = "Total_flights",values_drop_na = TRUE)
flights_tall <- dplyr::rename(flights_tall, Airline = X)
flights_tall <- dplyr::rename(flights_tall, Status = X.1)
# Let's write the table in quality table
library(formattable)
formattable(flights_tall, list(
Airline = color_tile("light blue", "orange"),
Status = formatter("span", style = x ~ ifelse(x == "delayed",
style(color = "red", font.weight = "bold"), NA)),
Total_flights = color_bar("gray", proportion)))
## Warning in gradient(as.numeric(x), ...): NAs introduced by coercion
| Airline | Status | Destination | Total_flights |
|---|---|---|---|
| ALASKA | on time | Los.Angeles | 497 |
| ALASKA | on time | Phoenix | 221 |
| ALASKA | on time | San.Diego | 212 |
| ALASKA | on time | San.Francisco | 503 |
| ALASKA | on time | Seattle | 1841 |
| ALASKA | delayed | Los.Angeles | 62 |
| ALASKA | delayed | Phoenix | 12 |
| ALASKA | delayed | San.Diego | 20 |
| ALASKA | delayed | San.Francisco | 102 |
| ALASKA | delayed | Seattle | 305 |
| AM WEST | on time | Los.Angeles | 694 |
| AM WEST | on time | Phoenix | 4840 |
| AM WEST | on time | San.Diego | 383 |
| AM WEST | on time | San.Francisco | 320 |
| AM WEST | on time | Seattle | 201 |
| AM WEST | delayed | Los.Angeles | 117 |
| AM WEST | delayed | Phoenix | 415 |
| AM WEST | delayed | San.Diego | 65 |
| AM WEST | delayed | San.Francisco | 129 |
| AM WEST | delayed | Seattle | 61 |
To calculate the performance, we need to tweak the table structure a bit to obtain total flights by status.
flights_tall_stat <- pivot_wider(flights_tall, names_from = "Status", values_from = "Total_flights")
flights_tall_stat <- mutate(flights_tall_stat, "Total_flights" = round(flights_tall_stat$delayed+flights_tall_stat$`on time`))
flights_tall_stat <- mutate(flights_tall_stat, "delayed_flights_pct" = round(( flights_tall_stat$delayed / Total_flights ) * 100, 2), "on_time_flights_pct" = round(( flights_tall_stat$`on time` / Total_flights ) * 100, 2) )
Question 1: Compare the per-city on-time performance for both airlines
dodger = position_dodge(width = 0.9)
ggplot(flights_tall_stat, aes(y=on_time_flights_pct, x=Destination, color=Airline, fill=Airline)) +
geom_bar( stat="identity",position = dodger, color="black")+
geom_text(aes(label=on_time_flights_pct),color = "blue",position = dodger,vjust=-0.5)+
scale_fill_manual(values = alpha(c("#000000", "#FF5733"))) +
ylab("Percentage of On-time Flights")
Question 2: Compare the overall on-time performance for both airlines
airline_level <- flights_tall_stat[-c(2,6,7)] %>%
group_by(Airline) %>%
summarise_each(funs(sum))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
airline_level <- mutate(airline_level, "on_time_flights_pct" = round(( airline_level$`on time` / Total_flights ) * 100, 2))
dodger = position_dodge(width = 0.9)
ggplot(airline_level, aes(y=on_time_flights_pct, x=Airline, color=Airline, fill=Airline)) +
geom_bar( stat="identity",position = dodger, color="black")+
geom_text(aes(label=on_time_flights_pct),color = "blue",position = dodger,vjust=-0.5)+
scale_fill_manual(values = alpha(c("#000000", "#FF5733"))) +
ylab("Percentage of On-time Flights")
Question 3: Explain the apparent discrepancy between the per-city and overall performances
We see that on time performance when aggregated at overall airline level shows AM West to be best amonng AM West and Alaska but when we look at by destination, in each of the cases the Alaska flights had better on time performance
Question 4: Provide an interesting paradoxical conclusion
This is because of a phenamenon called “Simpson’s Paradox”. This happens when one conducts analysis ignoring the confounding variable (here destination). More about Simpson’s Paradox