The CSV file was created from a MySQL script, which is saved in GitHub at:
https://github.com/kecbenson/DATA_607_Wk5/blob/master/DATA607_Wk5_airlines.sql
The output CSV file from the MySQL script is saved in GitHub at:
https://github.com/kecbenson/DATA_607_Wk5/blob/master/arrival_delays.csv
tidyr and dplyr as needed to tidy and transform the dataFirst, let’s read in the CSV file and review the data.
library(tidyverse)
library(knitr)
url <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk5/master/arrival_delays.csv"
raw <- read_csv(url)
kable(raw)
| Airline | On Time/Delayed | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Second, let’s use gather and spread functions to create a tidy data frame.
df1 <- raw %>%
# gather flight counts into destination and frequency columns
gather("Dest", "Freq", 3:7) %>%
# spread delayed and ontime frequency into separate columns
spread(2, 4) %>%
# rename delayed and ontime columns
rename(Delayed = delayed, OnTime = 'on time')
df1
## # A tibble: 10 x 4
## Airline Dest Delayed OnTime
## <chr> <chr> <int> <int>
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
Third, let’s use mutate and arrange functions to create additional variables and re-order the data frame.
df2 <- df1 %>%
# add total flight count, % delayed and % ontime
mutate(Total = Delayed + OnTime, PctDelayed = round(100 * Delayed / Total, 2),
PctOnTime = round(100 * OnTime / Total, 2)) %>%
# re-arrange rows by destination then airline
arrange(Dest, Airline)
df2
## # A tibble: 10 x 7
## Airline Dest Delayed OnTime Total PctDelayed PctOnTime
## <chr> <chr> <int> <int> <int> <dbl> <dbl>
## 1 ALASKA Los Angeles 62 497 559 11.1 88.9
## 2 AM WEST Los Angeles 117 694 811 14.4 85.6
## 3 ALASKA Phoenix 12 221 233 5.15 94.8
## 4 AM WEST Phoenix 415 4840 5255 7.9 92.1
## 5 ALASKA San Diego 20 212 232 8.62 91.4
## 6 AM WEST San Diego 65 383 448 14.5 85.5
## 7 ALASKA San Francisco 102 503 605 16.9 83.1
## 8 AM WEST San Francisco 129 320 449 28.7 71.3
## 9 ALASKA Seattle 305 1841 2146 14.2 85.8
## 10 AM WEST Seattle 61 201 262 23.3 76.7
Finally, let’s use group_by and bind_rows functions to summarize the delay data by airline, and then append to the end of the data frame.
df3 <- df2 %>%
# group by airline
group_by(Airline) %>%
# summarize data by airline group
summarize(Dest = "Total", Delayed = sum(Delayed), OnTime = sum(OnTime), Total = sum(Total),
PctDelayed = round(100 * Delayed / Total, 2), PctOnTime = round(100 * OnTime / Total, 2))
# append summary rows at bottom of dataframe
df4 <- bind_rows(df2, df3)
kable(df4)
| Airline | Dest | Delayed | OnTime | Total | PctDelayed | PctOnTime |
|---|---|---|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 | 559 | 11.09 | 88.91 |
| AM WEST | Los Angeles | 117 | 694 | 811 | 14.43 | 85.57 |
| ALASKA | Phoenix | 12 | 221 | 233 | 5.15 | 94.85 |
| AM WEST | Phoenix | 415 | 4840 | 5255 | 7.90 | 92.10 |
| ALASKA | San Diego | 20 | 212 | 232 | 8.62 | 91.38 |
| AM WEST | San Diego | 65 | 383 | 448 | 14.51 | 85.49 |
| ALASKA | San Francisco | 102 | 503 | 605 | 16.86 | 83.14 |
| AM WEST | San Francisco | 129 | 320 | 449 | 28.73 | 71.27 |
| ALASKA | Seattle | 305 | 1841 | 2146 | 14.21 | 85.79 |
| AM WEST | Seattle | 61 | 201 | 262 | 23.28 | 76.72 |
| ALASKA | Total | 501 | 3274 | 3775 | 13.27 | 86.73 |
| AM WEST | Total | 787 | 6438 | 7225 | 10.89 | 89.11 |
We start by comparing the flight count for each airline, broken out by destination. Several observations can be made:
AM WEST has substantially more flights into Phoenix than ALASKA (5255 vs. 233)
ALASKA has substantially more flights into Seattle than AM WEST (2146 vs. 262)
For the other destinations, ALASKA and AM WEST have comparable flight counts
Overall, AM WEST has almost double the total number of flights compared to ALASKA (7225 vs 3775).
ggplot(df4) + geom_bar(aes(x = Airline, y = Total, fill = Airline), stat = "identity") +
facet_wrap(~ Dest, nrow = 2)
Next we compare the percentage of flights delayed for each airline, by destination. Two observations are apparent:
AM WEST has a higher delay percentage than ALASKA for each of the five destinations
However, AM WEST has a lower delay percentage than ALASKA across all five destinations in total (10.9% vs. 13.3%).
ggplot(df4) + geom_bar(aes(x = Airline, y = PctDelayed, fill = Airline), stat = "identity") +
facet_wrap(~ Dest, nrow = 2)
This exercise demonstrates that comparing (a) average statistics across subsets to (b) aggregate statistics across the entire dataset can be misleading. It is important to keep in mind the relative sizes of the subsets when making such comparisons. In particular, for the flight delay dataset:
AM WEST has a higher delay percentage than ALASKA for each of the destinations, yet across all destinations in total, AM WEST has a lower total delay percentage.
This is explained by the flight counts of each airline into each destination:
The average across all destinations for AM WEST is heavily weighted by the statistics for Phoenix, which accounts for 73% (5255 / 7225) of the total AM WEST flights. In Phoenix AM WEST has a relatively lower delay percentage of 7.9%.
The average across all destinations for ALASKA is heavily weighted by the statistics for Seattle, which accounts for 57% (2146 / 3775) of the total ALASKA flights. In Seattle ALASKA has a relatively higher delay percentage of 14.2%.
In aggregating across all flights for each airline, the average delay percentage is implicitly weighted by the flight counts into each destination. So even though AM WEST has a higher delay percentage than ALASKA for each destination individually, AM WEST’s overall average is heavily weighted by Phoenix, which results in a lower delay percentage than ALASKA across all destinations in total.