This script reads the flight delays CSV file in “wide” data structue from my GitHub repo into R.
gitURL <- "https://raw.githubusercontent.com/douglasbarley/DATA607/master/Week%205%20-%20Arrival%20Delays.csv"
delays <- read.csv(gitURL)
names(delays) <- c("Airline","status","LAX","PHX","SAN","SFO","SEA")
delays
## Airline status LAX PHX SAN SFO SEA
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
First pivot longer to move the columns for each respective destination into a single “Dest” column and the numbers of on time and delayed flights into a “count” column. Replace "" with NA in the Airline column, and then fill values forward to replace the NAs.
delays2 <- delays %>%
pivot_longer(`LAX`:`SEA`, names_to = "Dest", values_to = "count", values_drop_na = TRUE) %>%
replace_with_na_all(condition = ~.x == "") %>% # replace all "" values with NA
fill(Airline) # fill values forward in the Airline column
delays2$status <- sub(" ", "_", delays2$status)
delays2
## # A tibble: 20 x 4
## Airline status Dest count
## <chr> <chr> <chr> <int>
## 1 ALASKA on_time LAX 497
## 2 ALASKA on_time PHX 221
## 3 ALASKA on_time SAN 212
## 4 ALASKA on_time SFO 503
## 5 ALASKA on_time SEA 1841
## 6 ALASKA delayed LAX 62
## 7 ALASKA delayed PHX 12
## 8 ALASKA delayed SAN 20
## 9 ALASKA delayed SFO 102
## 10 ALASKA delayed SEA 305
## 11 AM WEST on_time LAX 694
## 12 AM WEST on_time PHX 4840
## 13 AM WEST on_time SAN 383
## 14 AM WEST on_time SFO 320
## 15 AM WEST on_time SEA 201
## 16 AM WEST delayed LAX 117
## 17 AM WEST delayed PHX 415
## 18 AM WEST delayed SAN 65
## 19 AM WEST delayed SFO 129
## 20 AM WEST delayed SEA 61
Then pivot wider to make separate columns for the “on time” and “delayed” variables and their respective values.
delays3 <- delays2 %>%
pivot_wider(names_from = status, values_from = count)
delays3
## # A tibble: 10 x 4
## Airline Dest on_time delayed
## <chr> <chr> <int> <int>
## 1 ALASKA LAX 497 62
## 2 ALASKA PHX 221 12
## 3 ALASKA SAN 212 20
## 4 ALASKA SFO 503 102
## 5 ALASKA SEA 1841 305
## 6 AM WEST LAX 694 117
## 7 AM WEST PHX 4840 415
## 8 AM WEST SAN 383 65
## 9 AM WEST SFO 320 129
## 10 AM WEST SEA 201 61
Create new variables using mutate() for total flights, percent of on time flights, and delayed flights per city.
delays3 <- mutate(delays3,
TotalPerCity = on_time + delayed,
OnTimePct = on_time / TotalPerCity,
DelayPct = delayed / TotalPerCity)
delays3
## # A tibble: 10 x 7
## Airline Dest on_time delayed TotalPerCity OnTimePct DelayPct
## <chr> <chr> <int> <int> <int> <dbl> <dbl>
## 1 ALASKA LAX 497 62 559 0.889 0.111
## 2 ALASKA PHX 221 12 233 0.948 0.0515
## 3 ALASKA SAN 212 20 232 0.914 0.0862
## 4 ALASKA SFO 503 102 605 0.831 0.169
## 5 ALASKA SEA 1841 305 2146 0.858 0.142
## 6 AM WEST LAX 694 117 811 0.856 0.144
## 7 AM WEST PHX 4840 415 5255 0.921 0.0790
## 8 AM WEST SAN 383 65 448 0.855 0.145
## 9 AM WEST SFO 320 129 449 0.713 0.287
## 10 AM WEST SEA 201 61 262 0.767 0.233
Make a copy of the dataset to show aggregates per airline. Remove the specific Destination data, group by Airline and use summarise() to compare overall airline performance by collapsing totals to a single line per airline. Once the sums are summarized, add % of on-time flights and % of delayed flights per Airline.
delaysSummary <- delays3
delaysSummary <- mutate(delaysSummary,
Dest = NULL) %>%
group_by(Airline) %>%
summarise(ot = sum(on_time), del = sum(delayed), tot_flt = sum(TotalPerCity))
delaysSummary <- mutate(delaysSummary,
Pct_OnTime = ot / tot_flt,
Pct_Delay = del / tot_flt)
delaysSummary
## # A tibble: 2 x 6
## Airline ot del tot_flt Pct_OnTime Pct_Delay
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 ALASKA 3274 501 3775 0.867 0.133
## 2 AM WEST 6438 787 7225 0.891 0.109
With per city and overall airline performance data ready, it’s time to visualize the data!
Comparing the per-city on-time performance as a percentage for both airlines, the data appears similar in shape for each city. Phoenix flights arrive on time the most, and San Francisco flights are least on time as a percentage. The visual impression is that Alaska airlines slightly edges out American West with a higher percentage of on time flights for every individual destination.
ggplot(delays3, aes(x = Dest)) + geom_col(aes(y = OnTimePct), fill = "light blue") + xlab("destinations") + ylab("% on time flights") + facet_wrap(~Airline) + labs(title = "Percent on time flights per destination by airline")
Comparing the overall on-time performance as a percentage for both airlines, it is interesting to note that overall American West appears to have a higher percentage of on time flights. It is plain in the data that American West is on time overall 89.1% of the time compared to Alaska’s 86.7% on time rate.
ggplot(delaysSummary, aes(x = Airline)) + geom_col(aes(y = Pct_OnTime), fill = "turquoise") + ylab("% on time flights") + labs(title = "Percent on time flights by airline")
There appears to be a disconnect between the per city on time rates and the overall on time rates for the airlines. If one airline, Alaska, has a greater per city on time rate for every city would it not make sense that they have an overall better on time rate? In order to answer that it begs that we dig in a little further.
We know the comparative statistics, but what if we look at the raw volumes of flights to each destination per airline?
ggplot(delays3, aes(x = Dest)) + geom_col(aes(y = on_time), fill = "light blue") + xlab("destinations") + ylab("# on time flights") + facet_wrap(~Airline) + labs(title = "# on time flights per destination by airline")
Look at how many more flights Am West has on time going to Phoenix!
So what about overall volumes for the airlines?
ggplot(delaysSummary, aes(x = Airline)) + geom_col(aes(y = ot), fill = "turquoise") + ylab("# on time flights") + labs(title = "# on time flights by airline")
Am West has almost twice the volume of on time flights overall!
With almost twice the number of flights, Am West airlines has more room to have delays per city while maintaining an overall superior on time rate over its competitors that fly smaller quantities of flights. Am West maintained an overall on time rate of 89.1%, which was 6438 out of 7225 flights that were on time. Its competitor, Alaska Air, maintained an overall on time rate of 86.7% or 3274 out of 3775 flights.
In order for Am West to drop to equal its peer with an 86.7% on time rate, it could afford to not be on time for an additional 174 flights, or only 6264 flights being on time. Paradoxically then, it appears that the more frequently you fly the more frequently you can be late without significantly impacting your on time arrival statistics.