Performing data transformations on a small dataset of arrival delays for two airlines across five destinations. Using the following packages:
require(tidyr)
require(dplyr)
require(stringr)
require(ggplot2)
Read the information from a .CSV file (constructed in MySQL and uploaded to Github) into R.
## city Alaska_ontime Alaska_delay AMWest_ontime AMWest_delay
## 1 Los Angeles 497 62 694 117
## 2 Phoenix 221 12 4840 415
## 3 San Diego 212 20 383 65
## 4 San Francisco 503 102 320 129
## 5 Seattle 1841 305 201 61
Here, I use dyplr’s gather and separate functions to make the previously wide data long. The original columns were named according to airline and departure status. I first lengthen the data by breaking them into a temporary variable (Condition), then divide them into two separate variables, Airline and Status.
dly2 <- dly %>%
gather("Condition", "Freq", 2:5) %>%
separate("Condition", c("Airline", "Status"), sep = "_")
dly2
## city Airline Status Freq
## 1 Los Angeles Alaska ontime 497
## 2 Phoenix Alaska ontime 221
## 3 San Diego Alaska ontime 212
## 4 San Francisco Alaska ontime 503
## 5 Seattle Alaska ontime 1841
## 6 Los Angeles Alaska delay 62
## 7 Phoenix Alaska delay 12
## 8 San Diego Alaska delay 20
## 9 San Francisco Alaska delay 102
## 10 Seattle Alaska delay 305
## 11 Los Angeles AMWest ontime 694
## 12 Phoenix AMWest ontime 4840
## 13 San Diego AMWest ontime 383
## 14 San Francisco AMWest ontime 320
## 15 Seattle AMWest ontime 201
## 16 Los Angeles AMWest delay 117
## 17 Phoenix AMWest delay 415
## 18 San Diego AMWest delay 65
## 19 San Francisco AMWest delay 129
## 20 Seattle AMWest delay 61
Next, I re-arrange the data so that the observational unit is flights out of a certain city by a certain airline. Frequency of delays and ontime departures are listed in the last two columns.
I further mutate the dataframe, adding a column with total flights and likelihood of being delayed.
dly3 <- dly2 %>%
spread(Status, Freq) %>%
mutate(total = delay + ontime, pct.dly = round((delay / total),4))
dly3
## city Airline delay ontime total pct.dly
## 1 Los Angeles Alaska 62 497 559 0.1109
## 2 Los Angeles AMWest 117 694 811 0.1443
## 3 Phoenix Alaska 12 221 233 0.0515
## 4 Phoenix AMWest 415 4840 5255 0.0790
## 5 San Diego Alaska 20 212 232 0.0862
## 6 San Diego AMWest 65 383 448 0.1451
## 7 San Francisco Alaska 102 503 605 0.1686
## 8 San Francisco AMWest 129 320 449 0.2873
## 9 Seattle Alaska 305 1841 2146 0.1421
## 10 Seattle AMWest 61 201 262 0.2328
As part of my analysis, I want to sum the total amount of ontime and delayed departures for the two airlines, across all flights in the five cities.
AL.ontime <- sum(dly3$ontime[dly3$Airline == "Alaska"])
AM.ontime <- sum(dly3$ontime[dly3$Airline == "AMWest"])
AL.dly <- sum(dly3$delay[dly3$Airline == "Alaska"])
AM.dly <- sum(dly3$delay[dly3$Airline == "AMWest"])
dly.sum <- data.frame( ontime = c(AL.ontime,AM.ontime),
delayed = c(AL.dly, AM.dly))
rownames(dly.sum) <- c("Alaska", "AMWest")
dly.sum
## ontime delayed
## Alaska 3274 501
## AMWest 6438 787
I then use this simplified dataframe to construct a mosaicplot, which reveals that the smaller carrier, Alaska Airlines has a slightly higher percentage of delayed flights.
mosaicplot(dly.sum, color = c("Blue","Red"), main = "Flight Departure Status for Two Airlines out of the SW")
Now, I calculate the mean percentage of flights delayed, grouped by city, an aggregate of the data from the two airlines.
dly4 <- dly3 %>%
group_by(city) %>%
summarise(sum(delay), sum(total), pct.dly = round(sum(delay)/sum(total),4))
dly4
## # A tibble: 5 x 4
## city `sum(delay)` `sum(total)` pct.dly
## <fct> <int> <int> <dbl>
## 1 Los Angeles 179 1370 0.131
## 2 Phoenix 427 5488 0.0778
## 3 San Diego 85 680 0.125
## 4 San Francisco 231 1054 0.219
## 5 Seattle 366 2408 0.152
My aim is to determine, and then visualize, which of the airlines perform better than average in terms of delayed departures. To do so, I repeat the aggregate percent-delayed-by-city across each two rows for the 5 cities, then subtract it from the percent-delayed-by-city-by-airline to get a residual. This residual indicates the relative performance of each airline at the five hubs, with negatives indicating a positive feature (relatively less incidence of delays).
dly3$mean.dly <- rep(dly4$pct.dly, each=2)
dly3$mean.diff <- dly3$pct.dly - dly3$mean.dly
dly3
## city Airline delay ontime total pct.dly mean.dly mean.diff
## 1 Los Angeles Alaska 62 497 559 0.1109 0.1307 -0.0198
## 2 Los Angeles AMWest 117 694 811 0.1443 0.1307 0.0136
## 3 Phoenix Alaska 12 221 233 0.0515 0.0778 -0.0263
## 4 Phoenix AMWest 415 4840 5255 0.0790 0.0778 0.0012
## 5 San Diego Alaska 20 212 232 0.0862 0.1250 -0.0388
## 6 San Diego AMWest 65 383 448 0.1451 0.1250 0.0201
## 7 San Francisco Alaska 102 503 605 0.1686 0.2192 -0.0506
## 8 San Francisco AMWest 129 320 449 0.2873 0.2192 0.0681
## 9 Seattle Alaska 305 1841 2146 0.1421 0.1520 -0.0099
## 10 Seattle AMWest 61 201 262 0.2328 0.1520 0.0808
Finally, I visualize the differential using ggplot2.
fig1 <- ggplot(dly3, aes(x = factor(city), fill = Airline, y = mean.diff)) +
geom_dotplot(binaxis = "y", stackgroups = TRUE, binpositions="all") +
geom_hline(yintercept=0, linetype="dashed", color = "red", size=1)
fig2 <- fig1 + labs(x = "Cities", y = "+/- over Mean Delay",
title = "Comparing Relative Lateness of Two Airlines", subtitle= "Residual over Mean Percentage of Delayed Flights by City")
fig2
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
This visualization reveals the Simpson’s Paradox. Whereas the mosiacplot revealed that Alaska Airlines did overally worse in the aggregate, it did better than AMWest at each of the five airports tested! This is due to the sheer volume of AMWest flights departing Pheonix which had a relatively low rate of delays, which although worse than Alaska at that location, brought down AMWest’s overall delay incidence greatly.