DATA607_Week5_TidyTransform

Introduction

In this exercise we import, tidy , transform and analyse flight data from a csv file that includes the counts of on time and delayed departure flights operated by Alaska air and AM WEST to 5 major cities in the united states.

Data handling

The initial data set when imported looks like this

## Rows: 5
## Columns: 7
## $ V1 <chr> "Alaska", "", "", "AM WEST", ""
## $ V2 <chr> "on time", "delayed", "", "on time", "delayed"
## $ V3 <int> 497, 62, NA, 694, 117
## $ V4 <int> 221, 12, NA, 4840, 415
## $ V5 <int> 212, 20, NA, 383, 65
## $ V6 <int> 503, 102, NA, 320, 129
## $ V7 <int> 1841, 305, NA, 201, 61

We can now cleanup the empty row, add column names and add the Airline name to the second row

The Tidied data now looks like this

	Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
1	Alaska	on time	497	221	212	503	1841
2	Alaska	delayed	62	12	20	102	305
4	AM WEST	on time	694	4840	383	320	201
5	AM WEST	delayed	117	415	65	129	61

We can now transform the data into the standard form where all observations are rows and all columns are variables

Airline	Status	City	Count
Alaska	on time	Los Angeles	497
Alaska	on time	Phoenix	221
Alaska	on time	San Diego	212
Alaska	on time	San Francisco	503
Alaska	on time	Seattle	1841
Alaska	delayed	Los Angeles	62

Analysis

With the data now tidy and transformed, the fist comparison made is between the on time and delayed flight numbers for both airlines across all 5 cities so that we have a basic overview

ggplot(flight_data_tidy, aes(x = City, y = Count, color = Airline, group = Airline)) +
  geom_line() +
  geom_point() +
  labs(x = "City", y = "Flight Count") +
  facet_wrap(~ Status) +  # Facet by status
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Angle x-axis

if we look at overall delays, we see that AM WEST consistently have a lot more delayed flights than Alaska, especially in Phoenix. The only exception to this plot is Seattle where AM has less delayed flights than Alaska. In addition when we plot the on time arrivals we notice that AM WEST also have a lot more on time arrivals than Alaska, especially in Phoenix, with the trend reversing again in Seattle.

If we look at just Delayed flight numbers we might be led to believe that Alaska performs better in all cases except in Seattle, and if we only look at on time arrivals, we might be led to believe AM WEST does better expect in San Francisco and Seattle.

Rate analyis

When considering data like this its important to bring flight volume into the equation. Knowing the proportion of delayed vs on-time flights will help us better analyse the trends in this data set. It will also be helpful if we can compare the overall stats for the airlines , and then break it down city by city

City Baseline

When analyzing overall rates for this data its important that we calculate some baseline stats for all 5 cities being studied. The most important stat here is the Average rate of delay that has been calculated for each city.

flight_stats_city <- flight_data_tidy %>%
  group_by(City) %>%
  summarise(
    Total_Flights = sum(Count),  # Total flights (on-time + delayed)
    Delayed_Flights = sum(Count[Status == "delayed"]),  # Delayed flights
    OnTime_Flights = sum(Count[Status == "on time"]),  # On-time flights
    Rate_of_Delay = Delayed_Flights / Total_Flights  # Calculate the rate of delay
  )

knitr::kable(flight_stats_city,"simple")

City	Total_Flights	Delayed_Flights	OnTime_Flights	Rate_of_Delay
Los Angeles	1370	179	1191	0.1306569
Phoenix	5488	427	5061	0.0778061
San Diego	680	85	595	0.1250000
San Francisco	1054	231	823	0.2191651
Seattle	2408	366	2042	0.1519934

Airline overall

We can also see how each airline does when taken as a whole without going city by city

flight_stats_overall <- flight_data_tidy %>%
  group_by(Airline) %>%
  summarise(
    Total_Flights = sum(Count),  # Total flights (on-time + delayed)
    Delayed_Flights = sum(Count[Status == "delayed"]),  # Delayed flights
    OnTime_Flights = sum(Count[Status == "on time"]),  # On-time flights
    Rate_of_Delay = Delayed_Flights / Total_Flights  # Calculate the rate of delay
  )

knitr::kable(flight_stats_overall,"simple")

Airline	Total_Flights	Delayed_Flights	OnTime_Flights	Rate_of_Delay
AM WEST	7225	787	6438	0.1089273
Alaska	3775	501	3274	0.1327152

Here we can see that Alaska has a slightly higher overall Rate of Delay compared to Alaska

Airline rates across each city

No we observe how each airline does individually by calculating their total delays compared to the total number of operated flights within each city .

flight_stats <- flight_data_tidy %>%
  group_by(Airline, City) %>%
  summarise(
    Total_Flights = sum(Count),  # Total flights (on-time + delayed)
    Delayed_Flights = sum(Count[Status == "delayed"]),  # Delayed flights
    OnTime_Flights = sum(Count[Status == "on time"]),  # On-time flights
    Rate_of_Delay = Delayed_Flights / Total_Flights  # Calculate the rate of delay
  )

knitr::kable(flight_stats,"simple")

Airline	City	Total_Flights	Delayed_Flights	OnTime_Flights	Rate_of_Delay
AM WEST	Los Angeles	811	117	694	0.1442663
AM WEST	Phoenix	5255	415	4840	0.0789724
AM WEST	San Diego	448	65	383	0.1450893
AM WEST	San Francisco	449	129	320	0.2873051
AM WEST	Seattle	262	61	201	0.2328244
Alaska	Los Angeles	559	62	497	0.1109123
Alaska	Phoenix	233	12	221	0.0515021
Alaska	San Diego	232	20	212	0.0862069
Alaska	San Francisco	605	102	503	0.1685950
Alaska	Seattle	2146	305	1841	0.1421249

We can now bring it all together and plot the average rate of delays for Alaska and AM West and compare it against the average city rate of delay which is symbolized by the red dot in this plot

ggplot() +
  # Bar plot for the rate of delay (
  geom_bar(data = flight_stats, aes(x = City, y = Rate_of_Delay, fill = Airline), 
           stat = "identity", position = "dodge", show.legend = TRUE) +  
  geom_point(data = flight_stats_city, aes(x = City, y = Rate_of_Delay), 
             color = "red", size = 4, shape = 16, show.legend = FALSE) + 
  # Labels and title
  labs(x = "City",
       y = "Rate of Delay") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Findings

From this graph the data on flight delays is a lot more clear. While their overall delay rates where higher, At all 5 cities Alaska performs better than both the city average and AM West. In other words, if you were to fly to one of these cities, choosing Alaska would would ensure that your rate of delays are significantly lower than the city average. This is even true in cities like Seattle and San Francisco where they have a higher volume of operated flights compared to AM West.

Explaining the discrepancy

The overall rates that are misleading is the result of volume discrepancy when it comes to AM west among the 5 cities. In this example, Phoenix has 5,255 flights operated by AM WEST with a relatively low 7.89 percentage delay rate while their delay rates across the other 4 cities lie around 15 - 20 percent. AM WEST’s volume going to phoenix counts for 78 percent of the total volume and therefore contributes significantly to their overall delay rates bringing it down.

On the other hand Alaska’s delay rate for its highest volume city Seattle with 2146 flights is on the higher end of its average delay rates across cities. Even though its lower than the AM west volume discrepancy, it accounts for 567 percent of all of Alaska’s flights, using their overall rate of delay higher

Conclusion

When analyzing data its important to consider groupings of variables that might be affecting your results. This data set is a good example for understanding the Simpsons paradox where the relationship between variables reverse or disappear when grouped using another variable.