Assignment 5: Tidying and Transforming Data

Overview

What does it mean to you, as a passenger to be on a delayed flight ? Yes, the frustration, anxiety and temptation to lodge a complaint is very natural (unless you are like me, who would like to sleep for those few extra minutes). But ever wondered, what it means to the airlines ?

Recognizing the significant strain placed by flight delays on the U.S. air transportation system, in August 2008 the Federal Aviation Administration (FAA) commissioned five NEXTOR universities (UC Berkeley, MIT, George Mason University, the University of Maryland and Virginia Tech) and the Brattle Group to conduct a comprehensive study on the total delay impact (TDI) in the United States.

In 2007, domestic flight delays were found to cost the U.S. economy $31.2 billion in 2007, including $8.3 billion in direct costs to airlines, $16.7 billion in direct costs to passengers, $2.2 billion from lost demand and $4.0 billion in forgone GDP.Source:airlines.org

Let’s pick two airlines specifically and analyze their delay times.

Read the CSV dataset

Step 1 is to load the csv from the github library and to load the required R libraries.

knitr::opts_chunk$set(eval = TRUE, results = FALSE)

library(RCurl)
library(knitr)


flights_url <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/flights_dataset.csv") 
flights_raw <- read.csv(text = flights_url)

knitr:: kable(flights_raw)

X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
	delayed	62	12	20	102	305
		NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
	delayed	117	415	65	129	61

Tidy the data with tidyr

library(tidyverse) 

#Tried 'fill' but somehow it did not work, so have to try rudimentary way  
airline <- c("ALASKA","ALASKA", "","AM WEST","AM WEST")

flights_raw[1] <- airline

col_name <- colnames(flights_raw)
tall_raw <- gather(flights_raw,col_name[3:7],key = "Destination",value = "Total_flights", na.rm = TRUE)

# The above step can also be done using pivot_longer() function, let's give it a try
flights_tall <- pivot_longer(flights_raw,col_name[3:7],names_to = "Destination", values_to = "Total_flights",values_drop_na = TRUE)

flights_tall <- dplyr::rename(flights_tall, Airline = X)
flights_tall <- dplyr::rename(flights_tall, Status = X.1)

# Let's write the table in quality table
library(formattable)
formattable(flights_tall, list(
  Airline = color_tile("light blue", "orange"),
  Status = formatter("span", style = x ~ ifelse(x == "delayed", 
  style(color = "red", font.weight = "bold"), NA)),  
  Total_flights = color_bar("gray", proportion)))

## Warning in gradient(as.numeric(x), ...): NAs introduced by coercion

Airline	Status	Destination	Total_flights
ALASKA	on time	Los.Angeles	497
ALASKA	on time	Phoenix	221
ALASKA	on time	San.Diego	212
ALASKA	on time	San.Francisco	503
ALASKA	on time	Seattle	1841
ALASKA	delayed	Los.Angeles	62
ALASKA	delayed	Phoenix	12
ALASKA	delayed	San.Diego	20
ALASKA	delayed	San.Francisco	102
ALASKA	delayed	Seattle	305
AM WEST	on time	Los.Angeles	694
AM WEST	on time	Phoenix	4840
AM WEST	on time	San.Diego	383
AM WEST	on time	San.Francisco	320
AM WEST	on time	Seattle	201
AM WEST	delayed	Los.Angeles	117
AM WEST	delayed	Phoenix	415
AM WEST	delayed	San.Diego	65
AM WEST	delayed	San.Francisco	129
AM WEST	delayed	Seattle	61

Analysis

To calculate the performance, we need to tweak the table structure a bit to obtain total flights by status.

flights_tall_stat <- pivot_wider(flights_tall, names_from = "Status", values_from = "Total_flights")
flights_tall_stat <- mutate(flights_tall_stat, "Total_flights" = round(flights_tall_stat$delayed+flights_tall_stat$`on time`))

flights_tall_stat <- mutate(flights_tall_stat, "delayed_flights_pct" = round(( flights_tall_stat$delayed / Total_flights ) * 100, 2), "on_time_flights_pct" = round(( flights_tall_stat$`on time` / Total_flights ) * 100, 2) )

Question 1: Compare the per-city on-time performance for both airlines

dodger = position_dodge(width = 0.9)
ggplot(flights_tall_stat, aes(y=on_time_flights_pct, x=Destination, color=Airline, fill=Airline)) + 
    geom_bar( stat="identity",position = dodger, color="black")+
    geom_text(aes(label=on_time_flights_pct),color = "blue",position = dodger,vjust=-0.5)+
  scale_fill_manual(values = alpha(c("#000000", "#FF5733"))) +
  ylab("Percentage of On-time Flights")

Question 2: Compare the overall on-time performance for both airlines

airline_level <- flights_tall_stat[-c(2,6,7)] %>%
    group_by(Airline) %>% 
    summarise_each(funs(sum))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

airline_level <- mutate(airline_level, "on_time_flights_pct" = round(( airline_level$`on time` / Total_flights ) * 100, 2))

dodger = position_dodge(width = 0.9)
ggplot(airline_level, aes(y=on_time_flights_pct, x=Airline, color=Airline, fill=Airline)) + 
    geom_bar( stat="identity",position = dodger, color="black")+
    geom_text(aes(label=on_time_flights_pct),color = "blue",position = dodger,vjust=-0.5)+
  scale_fill_manual(values = alpha(c("#000000", "#FF5733"))) +
  ylab("Percentage of On-time Flights")

Conclusion

Question 3: Explain the apparent discrepancy between the per-city and overall performances

We see that on time performance when aggregated at overall airline level shows AM West to be best amonng AM West and Alaska but when we look at by destination, in each of the cases the Alaska flights had better on time performance

Question 4: Provide an interesting paradoxical conclusion

This is because of a phenamenon called “Simpson’s Paradox”. This happens when one conducts analysis ignoring the confounding variable (here destination). More about Simpson’s Paradox