Approach

In this assignment, I recreating a chart that describes the arrival delays of two airlines, Alaska and Amwest, across 5 destinations. The goal is to tidy and transform by using tidyr and dplyr functions on this wide structure data set. Once the table is in tidy format, I will perform the following analysis on the two airlines:

Source:

Proposed Steps:

  1. Data creation and loading: Chart was created on excel and converted into a .csv file before being upload on to my Github repository. After the chart is loaded on to R using the raw text file URL on Github.

  2. Data Tidying and Transformation: Using tidyr and dplyr function, I plan on to first creating a header name on the first row for the airline and their status, fill in the missing blank cells, remove the NA row, then change the wide structure of this table into a longer structure to get the tidy format.

  3. Comparison: Now that the table is in a long tidy format, I can proceed with data analysis by grouping the Airlines (Alaska and AM West) then summarize to get either the delay/arrival rates of the 2 airlines’ overall flight for comparison. After I want to know what is the difference between the two for each cities listed which I will be combine and grouping the Airline and cities up before using the summarize function to perform the mathematics equation to get the percentages for comparison.

  4. Discrepancies: With all the percentages of overall flights for both airlines and flights for each cities, I will analyze if there are any observed inconsistencies and outliers in both airline company’s’ flight performances.

Execution

(1) Data Loading

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
flight_url <-read.csv("https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/Assignment5/Airline_delays.csv") #loading created csv from github repository

flight_url  #view table 
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

(2) Data Tidying and Transformation

#Rename first 2 columns that R gave random names to

flights <- flight_url |> 
  rename("Airline" = X , "Status" = X.1) |> 
  mutate(Airline = na_if(Airline,"")) |> 
  filter(row_number() != 3) |>  # drop 3rd row with missing NA values
  fill(Airline, .direction = "down")  

(3) a. Comparison of AM West Airline vs Alaska Airline Delay Percentages

long_flights <- flights |>              
  pivot_longer(cols = 3:7, names_to = "City", values_to = "Count")  #create longer tables with new variable column names 

head(long_flights)
## # A tibble: 6 × 4
##   Airline Status  City          Count
##   <chr>   <chr>   <chr>         <int>
## 1 ALASKA  on time Los.Angeles     497
## 2 ALASKA  on time Phoenix         221
## 3 ALASKA  on time San.Diego       212
## 4 ALASKA  on time San.Francisco   503
## 5 ALASKA  on time Seattle        1841
## 6 ALASKA  delayed Los.Angeles      62
Delay_airlines <- long_flights |> 
  summarize(.by = c(Airline, Status), Total_Count = sum(Count)) |>  #sum of counts based both on Airline and Status
  group_by(Airline) |> 
  mutate(Flight_percentage = (Total_Count / sum(Total_Count)) * 100) |>  #sum of total count of Airlines (Delay + ontime total flights)
  filter(Status == "delayed" )
 
#Exemplary chart for visualization
ggplot (Delay_airlines, aes(x = Airline, y= Flight_percentage, fill = Airline))+
  geom_col()+
  labs(title = "Amwest Airline vs Alaska Airline Delay Percentages",
  y = "Percentage Delayed(%)")

When looking at the overall performance of the 2 airlines across all combined flights, AM West appears more reliable. Alaska airlines has an overall delay rate of 13.3%, whereas AM West has a lower overall delay rate of about 10.9%. Based ON these percentages, a passenger might conclude that AM West is less likely to be delayed.”

(3) b. Comparison of Airline City by City

cities_delays <- long_flights |>
 mutate(.by = c(Airline, City), Total_City_Flights = sum(Count),
    CityF_percentage = (Count / Total_City_Flights) * 100) |>
    filter(Status == "delayed")

# Exemplary bar chart of Airlines flight delay percentage per city 
ggplot(cities_delays, aes(x = City, y = CityF_percentage, fill = Airline)) +
  geom_col(position = "dodge") + # dodge position to put the 2 Airline bars next to each other  for each city
  labs(title = "Flight Delay Percentage by Airline and City",
       y = "Percentage Delayed (%)",
       x = "Destination City")

Breaking the data down to compare the airlines performance per city, it revealed a different story. When comparing the 2 airlines at each city, Alaska airlines actually performs better (lower delays percentage) in all 5 cities compared to AM West. For example, if you look at San Francisco, there is the biggest gap where AM West has a much higher delay rate than Alaska, which also holds true for Los Angeles, Phoenix, San Diego, and Seattle as well.

(4) Discrepancy: Simpson’s Paradox

The definition of the Simp’son Paradox is the statistical phenomenon where a trend or association observed in several separate groups of data disappears/reverses when these groups are aggregated. This is seen in the comparison of AM West and Alaska airline flight delay rates in overall vs city by city where there is a reversal of trend depending on if it is combined data in overall total or in individual cities. In my analysis for city by city, Alaska airline had a lower delay percentage in every single city compare to AM West, For example in Phoenix, Alaska’s delay rate was 5.2% while AM West was 7.9%. In the overall combined data of all cities for each Airline, AM West appears to have the “better” or lower delay rate (10.9%) compared to Alaska (13.3%). The reason of why in this case is according to the data, AM West flies a large number of flights (4840) into Phoenix, which is a low delay city for everyone. The 4840 on time flight is such a massive number that this number effects the rest of the collected data, thus pulling AM West’s average down to make them look better when analyzing the overall aggregated data, even though Alaska is more reliable in specific locations.

Conclusion

The analysis of the flight delays for AMWest and Alaska in this assignment shows the importance of data tidying and manipulation to perform statistical analysis. Initial observation where the aggregated overall data of flight delays of each Airline suggested that AM West was the superior performer with its 10.9% delay rate compared to Alaska’s 13.3% delay rate.

However, transformation the data from a wide to long table allowed me to see the more micro scale for each cities that completed the aggregated data and to calculate the percentage by destination, tells a very different story. Alaska airlines actually outperformed AM West in every single city in terms of delay rate. The overall discrepancy is a good demonstration of Simpson’s paradox: Am West high volume of flights in Phoenix skewed the global average which mask the fact that Alaska is more punctual on a city to city context. If a passenger is deciding to choose an airline based on timing reliability for any of the 5 cities, Alaska is statistically the better choice.