This week’s assignment required reproducing provided flight delay data for two airlines across five destinations in the form of a CSV file, reading the CSV data into R, tidying and transforming the data with tidyr and dplyr, and performing a comparative analysis of arrival delays for the airlines.
library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
Read in the CSV file, coerce to a data frame, and check the structure:
delays <- data.frame(read_csv("https://github.com/juddanderman/cuny-data-607/raw/master/Week5_Assignment/airline_delays.csv"))
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
str(delays)
## 'data.frame': 4 obs. of 7 variables:
## $ X1 : chr "ALASKA" NA "AM WEST" NA
## $ X2 : chr "on time" "delayed" "on time" "delayed"
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : int 221 12 4840 415
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : int 1841 305 201 61
delays data frame for tidyingReplicate each airline name twice to fill in missing values in the first column and rename the first two columns of delays:
delays[1] <- rep(delays[which(!is.na(delays[1])), 1], each = 2)
colnames(delays)[1:2] <- c("Airline", "Status")
str(delays)
## 'data.frame': 4 obs. of 7 variables:
## $ Airline : chr "ALASKA" "ALASKA" "AM WEST" "AM WEST"
## $ Status : chr "on time" "delayed" "on time" "delayed"
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : int 221 12 4840 415
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : int 1841 305 201 61
delays data frameTidy delays using gather() to melt the dataset with the colvar Destination and spread() to rotate the elements of Status (i.e. on time and delayed) that represent the names of variables or types of observation into columns:
delays <- delays %>% gather(Destination, Count, Los.Angeles:Seattle)
delays$Destination <- str_replace(delays$Destination, "[.]", " ")
head(delays)
## Airline Status Destination Count
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
delays <- delays %>% spread(Status, Count)
colnames(delays) <- str_to_title(colnames(delays))
colnames(delays) <- str_replace(colnames(delays), " ", "_")
head(delays)
## Airline Destination Delayed On_Time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
ALASKA and AM WEST airlinesBelow I calculate the proportion of delayed flights arriving at the five destination airports for both airlines, summary statistics of the airlines’ delay rates across destinations, and the proportion of delayed flights for both airlines over their respective total flight counts.
delays <- delays %>% mutate(Delay_Rate = Delayed / (Delayed + On_Time))
knitr::kable(delays[c("Airline", "Destination", "Delay_Rate")] %>%
arrange(Destination, Airline))
| Airline | Destination | Delay_Rate |
|---|---|---|
| ALASKA | Los Angeles | 0.1109123 |
| AM WEST | Los Angeles | 0.1442663 |
| ALASKA | Phoenix | 0.0515021 |
| AM WEST | Phoenix | 0.0789724 |
| ALASKA | San Diego | 0.0862069 |
| AM WEST | San Diego | 0.1450893 |
| ALASKA | San Francisco | 0.1685950 |
| AM WEST | San Francisco | 0.2873051 |
| ALASKA | Seattle | 0.1421249 |
| AM WEST | Seattle | 0.2328244 |
summary_dly_rates <- summarize(group_by(delays, Airline), Min = min(Delay_Rate),
Median = median(Delay_Rate), Max = max(Delay_Rate),
SD = sd(Delay_Rate), Mean_Dly_Rate = mean(Delay_Rate))
agg_dly_rate <- summarize(group_by(delays, Airline), Agg_Dly_Rate = sum(Delayed) / sum(Delayed + On_Time))
summary_stats <- inner_join(summary_dly_rates, agg_dly_rate)
knitr::kable(summary_stats)
| Airline | Min | Median | Max | SD | Mean_Dly_Rate | Agg_Dly_Rate |
|---|---|---|---|---|---|---|
| ALASKA | 0.0515021 | 0.1109123 | 0.1685950 | 0.0459262 | 0.1118683 | 0.1327152 |
| AM WEST | 0.0789724 | 0.1450893 | 0.2873051 | 0.0821285 | 0.1776915 | 0.1089273 |
simpsons_paradox <- rbind(c("Mean Delay Rate Across Destinations",
summary_stats$Airline[which.min(summary_stats$Mean_Dly_Rate)]),
c("Aggregate Delay Rate Across All Flights",
summary_stats$Airline[which.min(summary_stats$Agg_Dly_Rate)]))
colnames(simpsons_paradox) <- c("Timeliness Statistic", "Airline with Minimum or Most Favorable Value")
knitr::kable(simpsons_paradox)
| Timeliness Statistic | Airline with Minimum or Most Favorable Value |
|---|---|
| Mean Delay Rate Across Destinations | ALASKA |
| Aggregate Delay Rate Across All Flights | AM WEST |
The delay rate calculations reveal that ALASKA has a lower propoprtion of delayed flights at each of the five destinations and so a lower average delay rate across destinations, while AM WEST has a lower proportion of delayed arrivals across all flights.
The seemingly counter-intuitive results of the flight delays comparison performed above are visualized below. The final bar plot of counts of flights arriving at the five destination airports helps to illustrate how in this case the airline with the lowest proportion of delayed flights switches when one evaluates either the data partitioned by destination or the aggregate flight counts for the two airlines. As Fung (2013) notes in the prologue to Numbersense, the given data set does not capture confounding variables or factors that might cause increased rates of delays like weather conditions, flight origin, pilot experience, and so on. AM WEST has far more flights arriving in Phoenix compared to the other destinations, where the proportion of delayed flights is relatively low for both airlines, while ALASKA has a greater share of its flights arriving in Seattle, where the delay rates are greater for both airlines. As a result, the aggregate delay rates across all flights are heavily weighted toward arrivals in Phoenix for AM WEST, where delays are relatively rare, and toward Seattle for ALASKA, where delays are relatively common, and so the trend found in the airport-by-airport comparison does not hold when one considers all flights in aggregate. Following Fung, we might conclude that delay rates may be more influenced by a lurking variable like weather than by either airline or flight destination, and that our results reflect the correlation between weather and flight destination and the different distributions of routes among airlines.
ggplot(delays, aes(Destination, Delay_Rate)) +
geom_bar(aes(fill = Airline), position = "dodge", stat = "identity") +
ylab("Delay Rate")
ggplot(agg_dly_rate, aes(Airline, Agg_Dly_Rate)) +
geom_bar(aes(fill = Airline), position = "dodge", stat = "identity") +
ylab("Aggregate Delay Rate")
delays_gathered <- delays %>% gather(Status, Flights, Delayed, On_Time)
ggplot(delays_gathered, aes(x = Airline, y = Flights, fill = Status)) +
geom_bar(position = "stack", stat = "identity") + facet_grid(~ Destination) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_discrete(labels = c("Delayed", "On Time"))
Fung, Kaiser (2013). Numbersense: How to Use Big Data to Your Advantage. McGraw-Hill: 1-15.
Wickham, Hadley (2014). Tidy Data. Journal of Statistical Software, 59(10).