In this exercise we import, tidy , transform and analyse flight data from a csv file that includes the counts of on time and delayed departure flights operated by Alaska air and AM WEST to 5 major cities in the united states.
The initial data set when imported looks like this
## Rows: 5
## Columns: 7
## $ V1 <chr> "Alaska", "", "", "AM WEST", ""
## $ V2 <chr> "on time", "delayed", "", "on time", "delayed"
## $ V3 <int> 497, 62, NA, 694, 117
## $ V4 <int> 221, 12, NA, 4840, 415
## $ V5 <int> 212, 20, NA, 383, 65
## $ V6 <int> 503, 102, NA, 320, 129
## $ V7 <int> 1841, 305, NA, 201, 61
We can now cleanup the empty row, add column names and add the Airline name to the second row
The Tidied data now looks like this
| Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle | |
|---|---|---|---|---|---|---|---|
| 1 | Alaska | on time | 497 | 221 | 212 | 503 | 1841 |
| 2 | Alaska | delayed | 62 | 12 | 20 | 102 | 305 |
| 4 | AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| 5 | AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
We can now transform the data into the standard form where all observations are rows and all columns are variables
| Airline | Status | City | Count |
|---|---|---|---|
| Alaska | on time | Los Angeles | 497 |
| Alaska | on time | Phoenix | 221 |
| Alaska | on time | San Diego | 212 |
| Alaska | on time | San Francisco | 503 |
| Alaska | on time | Seattle | 1841 |
| Alaska | delayed | Los Angeles | 62 |
With the data now tidy and transformed, the fist comparison made is between the on time and delayed flight numbers for both airlines across all 5 cities so that we have a basic overview
ggplot(flight_data_tidy, aes(x = City, y = Count, color = Airline, group = Airline)) +
geom_line() +
geom_point() +
labs(x = "City", y = "Flight Count") +
facet_wrap(~ Status) + # Facet by status
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Angle x-axis
if we look at overall delays, we see that AM WEST consistently have a lot more delayed flights than Alaska, especially in Phoenix. The only exception to this plot is Seattle where AM has less delayed flights than Alaska. In addition when we plot the on time arrivals we notice that AM WEST also have a lot more on time arrivals than Alaska, especially in Phoenix, with the trend reversing again in Seattle.
If we look at just Delayed flight numbers we might be led to believe that Alaska performs better in all cases except in Seattle, and if we only look at on time arrivals, we might be led to believe AM WEST does better expect in San Francisco and Seattle.
When considering data like this its important to bring flight volume into the equation. Knowing the proportion of delayed vs on-time flights will help us better analyse the trends in this data set. It will also be helpful if we can compare the overall stats for the airlines , and then break it down city by city
When analyzing overall rates for this data its important that we calculate some baseline stats for all 5 cities being studied. The most important stat here is the Average rate of delay that has been calculated for each city.
flight_stats_city <- flight_data_tidy %>%
group_by(City) %>%
summarise(
Total_Flights = sum(Count), # Total flights (on-time + delayed)
Delayed_Flights = sum(Count[Status == "delayed"]), # Delayed flights
OnTime_Flights = sum(Count[Status == "on time"]), # On-time flights
Rate_of_Delay = Delayed_Flights / Total_Flights # Calculate the rate of delay
)
knitr::kable(flight_stats_city,"simple")
| City | Total_Flights | Delayed_Flights | OnTime_Flights | Rate_of_Delay |
|---|---|---|---|---|
| Los Angeles | 1370 | 179 | 1191 | 0.1306569 |
| Phoenix | 5488 | 427 | 5061 | 0.0778061 |
| San Diego | 680 | 85 | 595 | 0.1250000 |
| San Francisco | 1054 | 231 | 823 | 0.2191651 |
| Seattle | 2408 | 366 | 2042 | 0.1519934 |
We can also see how each airline does when taken as a whole without going city by city
flight_stats_overall <- flight_data_tidy %>%
group_by(Airline) %>%
summarise(
Total_Flights = sum(Count), # Total flights (on-time + delayed)
Delayed_Flights = sum(Count[Status == "delayed"]), # Delayed flights
OnTime_Flights = sum(Count[Status == "on time"]), # On-time flights
Rate_of_Delay = Delayed_Flights / Total_Flights # Calculate the rate of delay
)
knitr::kable(flight_stats_overall,"simple")
| Airline | Total_Flights | Delayed_Flights | OnTime_Flights | Rate_of_Delay |
|---|---|---|---|---|
| AM WEST | 7225 | 787 | 6438 | 0.1089273 |
| Alaska | 3775 | 501 | 3274 | 0.1327152 |
Here we can see that Alaska has a slightly higher overall Rate of Delay compared to Alaska
No we observe how each airline does individually by calculating their total delays compared to the total number of operated flights within each city .
flight_stats <- flight_data_tidy %>%
group_by(Airline, City) %>%
summarise(
Total_Flights = sum(Count), # Total flights (on-time + delayed)
Delayed_Flights = sum(Count[Status == "delayed"]), # Delayed flights
OnTime_Flights = sum(Count[Status == "on time"]), # On-time flights
Rate_of_Delay = Delayed_Flights / Total_Flights # Calculate the rate of delay
)
knitr::kable(flight_stats,"simple")
| Airline | City | Total_Flights | Delayed_Flights | OnTime_Flights | Rate_of_Delay |
|---|---|---|---|---|---|
| AM WEST | Los Angeles | 811 | 117 | 694 | 0.1442663 |
| AM WEST | Phoenix | 5255 | 415 | 4840 | 0.0789724 |
| AM WEST | San Diego | 448 | 65 | 383 | 0.1450893 |
| AM WEST | San Francisco | 449 | 129 | 320 | 0.2873051 |
| AM WEST | Seattle | 262 | 61 | 201 | 0.2328244 |
| Alaska | Los Angeles | 559 | 62 | 497 | 0.1109123 |
| Alaska | Phoenix | 233 | 12 | 221 | 0.0515021 |
| Alaska | San Diego | 232 | 20 | 212 | 0.0862069 |
| Alaska | San Francisco | 605 | 102 | 503 | 0.1685950 |
| Alaska | Seattle | 2146 | 305 | 1841 | 0.1421249 |
We can now bring it all together and plot the average rate of delays for Alaska and AM West and compare it against the average city rate of delay which is symbolized by the red dot in this plot
ggplot() +
# Bar plot for the rate of delay (
geom_bar(data = flight_stats, aes(x = City, y = Rate_of_Delay, fill = Airline),
stat = "identity", position = "dodge", show.legend = TRUE) +
geom_point(data = flight_stats_city, aes(x = City, y = Rate_of_Delay),
color = "red", size = 4, shape = 16, show.legend = FALSE) +
# Labels and title
labs(x = "City",
y = "Rate of Delay") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
From this graph the data on flight delays is a lot more clear. While their overall delay rates where higher, At all 5 cities Alaska performs better than both the city average and AM West. In other words, if you were to fly to one of these cities, choosing Alaska would would ensure that your rate of delays are significantly lower than the city average. This is even true in cities like Seattle and San Francisco where they have a higher volume of operated flights compared to AM West.
The overall rates that are misleading is the result of volume discrepancy when it comes to AM west among the 5 cities. In this example, Phoenix has 5,255 flights operated by AM WEST with a relatively low 7.89 percentage delay rate while their delay rates across the other 4 cities lie around 15 - 20 percent. AM WEST’s volume going to phoenix counts for 78 percent of the total volume and therefore contributes significantly to their overall delay rates bringing it down.
On the other hand Alaska’s delay rate for its highest volume city Seattle with 2146 flights is on the higher end of its average delay rates across cities. Even though its lower than the AM west volume discrepancy, it accounts for 567 percent of all of Alaska’s flights, using their overall rate of delay higher
When analyzing data its important to consider groupings of variables that might be affecting your results. This data set is a good example for understanding the Simpsons paradox where the relationship between variables reverse or disappear when grouped using another variable.