Tidying and Transforming Data

This week’s assignment required reproducing provided flight delay data for two airlines across five destinations in the form of a CSV file, reading the CSV data into R, tidying and transforming the data with tidyr and dplyr, and performing a comparative analysis of arrival delays for the airlines.

Load required packages

library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)

Import CSV data

Read in the CSV file, coerce to a data frame, and check the structure:

delays <- data.frame(read_csv("https://github.com/juddanderman/cuny-data-607/raw/master/Week5_Assignment/airline_delays.csv"))
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
str(delays)
## 'data.frame':    4 obs. of  7 variables:
##  $ X1           : chr  "ALASKA" NA "AM WEST" NA
##  $ X2           : chr  "on time" "delayed" "on time" "delayed"
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 61

Prepare delays data frame for tidying

Replicate each airline name twice to fill in missing values in the first column and rename the first two columns of delays:

delays[1] <- rep(delays[which(!is.na(delays[1])), 1], each = 2)
colnames(delays)[1:2] <- c("Airline", "Status")
str(delays)
## 'data.frame':    4 obs. of  7 variables:
##  $ Airline      : chr  "ALASKA" "ALASKA" "AM WEST" "AM WEST"
##  $ Status       : chr  "on time" "delayed" "on time" "delayed"
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 61

Tidy the delays data frame

Tidy delays using gather() to melt the dataset with the colvar Destination and spread() to rotate the elements of Status (i.e. on time and delayed) that represent the names of variables or types of observation into columns:

delays <- delays %>% gather(Destination, Count, Los.Angeles:Seattle)
delays$Destination <- str_replace(delays$Destination, "[.]", " ")
head(delays)
##   Airline  Status Destination Count
## 1  ALASKA on time Los Angeles   497
## 2  ALASKA delayed Los Angeles    62
## 3 AM WEST on time Los Angeles   694
## 4 AM WEST delayed Los Angeles   117
## 5  ALASKA on time     Phoenix   221
## 6  ALASKA delayed     Phoenix    12
delays <- delays %>% spread(Status, Count)
colnames(delays) <- str_to_title(colnames(delays))
colnames(delays) <- str_replace(colnames(delays), " ", "_")
head(delays)
##   Airline   Destination Delayed On_Time
## 1  ALASKA   Los Angeles      62     497
## 2  ALASKA       Phoenix      12     221
## 3  ALASKA     San Diego      20     212
## 4  ALASKA San Francisco     102     503
## 5  ALASKA       Seattle     305    1841
## 6 AM WEST   Los Angeles     117     694

Analysis of delays for ALASKA and AM WEST airlines

Below I calculate the proportion of delayed flights arriving at the five destination airports for both airlines, summary statistics of the airlines’ delay rates across destinations, and the proportion of delayed flights for both airlines over their respective total flight counts.

delays <- delays %>% mutate(Delay_Rate = Delayed / (Delayed + On_Time))
knitr::kable(delays[c("Airline", "Destination", "Delay_Rate")] %>% 
               arrange(Destination, Airline))
Airline Destination Delay_Rate
ALASKA Los Angeles 0.1109123
AM WEST Los Angeles 0.1442663
ALASKA Phoenix 0.0515021
AM WEST Phoenix 0.0789724
ALASKA San Diego 0.0862069
AM WEST San Diego 0.1450893
ALASKA San Francisco 0.1685950
AM WEST San Francisco 0.2873051
ALASKA Seattle 0.1421249
AM WEST Seattle 0.2328244
summary_dly_rates <- summarize(group_by(delays, Airline), Min = min(Delay_Rate), 
                               Median = median(Delay_Rate), Max = max(Delay_Rate), 
                               SD = sd(Delay_Rate), Mean_Dly_Rate = mean(Delay_Rate))

agg_dly_rate <- summarize(group_by(delays, Airline), Agg_Dly_Rate = sum(Delayed) / sum(Delayed + On_Time))

summary_stats <- inner_join(summary_dly_rates, agg_dly_rate)
knitr::kable(summary_stats)
Airline Min Median Max SD Mean_Dly_Rate Agg_Dly_Rate
ALASKA 0.0515021 0.1109123 0.1685950 0.0459262 0.1118683 0.1327152
AM WEST 0.0789724 0.1450893 0.2873051 0.0821285 0.1776915 0.1089273
simpsons_paradox <- rbind(c("Mean Delay Rate Across Destinations", 
                            summary_stats$Airline[which.min(summary_stats$Mean_Dly_Rate)]), 
                          c("Aggregate Delay Rate Across All Flights", 
                            summary_stats$Airline[which.min(summary_stats$Agg_Dly_Rate)]))
colnames(simpsons_paradox) <- c("Timeliness Statistic", "Airline with Minimum or Most Favorable Value")
knitr::kable(simpsons_paradox)
Timeliness Statistic Airline with Minimum or Most Favorable Value
Mean Delay Rate Across Destinations ALASKA
Aggregate Delay Rate Across All Flights AM WEST

The delay rate calculations reveal that ALASKA has a lower propoprtion of delayed flights at each of the five destinations and so a lower average delay rate across destinations, while AM WEST has a lower proportion of delayed arrivals across all flights.

Plotting delays by destination and airline

The seemingly counter-intuitive results of the flight delays comparison performed above are visualized below. The final bar plot of counts of flights arriving at the five destination airports helps to illustrate how in this case the airline with the lowest proportion of delayed flights switches when one evaluates either the data partitioned by destination or the aggregate flight counts for the two airlines. As Fung (2013) notes in the prologue to Numbersense, the given data set does not capture confounding variables or factors that might cause increased rates of delays like weather conditions, flight origin, pilot experience, and so on. AM WEST has far more flights arriving in Phoenix compared to the other destinations, where the proportion of delayed flights is relatively low for both airlines, while ALASKA has a greater share of its flights arriving in Seattle, where the delay rates are greater for both airlines. As a result, the aggregate delay rates across all flights are heavily weighted toward arrivals in Phoenix for AM WEST, where delays are relatively rare, and toward Seattle for ALASKA, where delays are relatively common, and so the trend found in the airport-by-airport comparison does not hold when one considers all flights in aggregate. Following Fung, we might conclude that delay rates may be more influenced by a lurking variable like weather than by either airline or flight destination, and that our results reflect the correlation between weather and flight destination and the different distributions of routes among airlines.

ggplot(delays, aes(Destination, Delay_Rate)) + 
  geom_bar(aes(fill = Airline), position = "dodge", stat = "identity") + 
  ylab("Delay Rate")

ggplot(agg_dly_rate, aes(Airline, Agg_Dly_Rate)) + 
  geom_bar(aes(fill = Airline), position = "dodge", stat = "identity") + 
  ylab("Aggregate Delay Rate")

delays_gathered <- delays %>% gather(Status, Flights, Delayed, On_Time)

ggplot(delays_gathered, aes(x = Airline, y = Flights, fill = Status)) + 
  geom_bar(position = "stack", stat = "identity") + facet_grid(~ Destination) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_fill_discrete(labels = c("Delayed", "On Time"))

Sources

Fung, Kaiser (2013). Numbersense: How to Use Big Data to Your Advantage. McGraw-Hill: 1-15.

  • Our data for this week’s assignment come from Fung’s Numbersense. In the book’s prologue, Fung shows how the flight delay data manifest Simpson’s paradox, and reminds readers to pay close attention to the ways that different methods of aggregation and analysis can produce conflicting interpretations of the same data.

Wickham, Hadley (2014). Tidy Data. Journal of Statistical Software, 59(10).