In this assignment, I recreating a chart that describes the arrival delays of two airlines, Alaska and Amwest, across 5 destinations. The goal is to tidy and transform by using tidyr and dplyr functions on this wide structure data set. Once the table is in tidy format, I will perform the following analysis on the two airlines:
Percentage of either delays/arrival rates for Alaska vs Amwest overall in a table
Percentage of either delays/arrival rates for Alaska vs Amwest across 5 cities in a table
Discrepancies of Alaska vs Amwest flight performances city by city and overall (Descriptions and Explanation)
Source:
Data creation and loading: Chart was created on excel and converted into a .csv file before being upload on to my Github repository. After the chart is loaded on to R using the raw text file URL on Github.
Data Tidying and Transformation: Using tidyr and dplyr function, I plan on to first creating a header name on the first row for the airline and their status, fill in the missing blank cells, remove the NA row, then change the wide structure of this table into a longer structure to get the tidy format.
Comparison: Now that the table is in a long tidy format, I can proceed with data analysis by grouping the Airlines (Alaska and AM West) then summarize to get either the delay/arrival rates of the 2 airlines’ overall flight for comparison. After I want to know what is the difference between the two for each cities listed which I will be combine and grouping the Airline and cities up before using the summarize function to perform the mathematics equation to get the percentages for comparison.
Discrepancies: With all the percentages of overall flights for both airlines and flights for each cities, I will analyze if there are any observed inconsistencies and outliers in both airline company’s’ flight performances.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
flight_url <-read.csv("https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/Assignment5/Airline_delays.csv") #loading created csv from github repository
flight_url #view table
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#Rename first 2 columns that R gave random names to
flights <- flight_url |>
rename("Airline" = X , "Status" = X.1) |>
mutate(Airline = na_if(Airline,"")) |>
filter(row_number() != 3) |> # drop 3rd row with missing NA values
fill(Airline, .direction = "down")
long_flights <- flights |>
pivot_longer(cols = 3:7, names_to = "City", values_to = "Count") #create longer tables with new variable column names
head(long_flights)
## # A tibble: 6 × 4
## Airline Status City Count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
Delay_airlines <- long_flights |>
summarize(.by = c(Airline, Status), Total_Count = sum(Count)) |> #sum of counts based both on Airline and Status
group_by(Airline) |>
mutate(Flight_percentage = (Total_Count / sum(Total_Count)) * 100) |> #sum of total count of Airlines (Delay + ontime total flights)
filter(Status == "delayed" )
#Exemplary chart for visualization
ggplot (Delay_airlines, aes(x = Airline, y= Flight_percentage, fill = Airline))+
geom_col()+
labs(title = "Amwest Airline vs Alaska Airline Delay Percentages",
y = "Percentage Delayed(%)")
When looking at the overall performance of the 2 airlines across all combined flights, AM West appears more reliable. Alaska airlines has an overall delay rate of 13.3%, whereas AM West has a lower overall delay rate of about 10.9%. Based ON these percentages, a passenger might conclude that AM West is less likely to be delayed.”
cities_delays <- long_flights |>
mutate(.by = c(Airline, City), Total_City_Flights = sum(Count),
CityF_percentage = (Count / Total_City_Flights) * 100) |>
filter(Status == "delayed")
# Exemplary bar chart of Airlines flight delay percentage per city
ggplot(cities_delays, aes(x = City, y = CityF_percentage, fill = Airline)) +
geom_col(position = "dodge") + # dodge position to put the 2 Airline bars next to each other for each city
labs(title = "Flight Delay Percentage by Airline and City",
y = "Percentage Delayed (%)",
x = "Destination City")
Breaking the data down to compare the airlines performance per city, it revealed a different story. When comparing the 2 airlines at each city, Alaska airlines actually performs better (lower delays percentage) in all 5 cities compared to AM West. For example, if you look at San Francisco, there is the biggest gap where AM West has a much higher delay rate than Alaska, which also holds true for Los Angeles, Phoenix, San Diego, and Seattle as well.
The definition of the Simp’son Paradox is the statistical phenomenon where a trend or association observed in several separate groups of data disappears/reverses when these groups are aggregated. This is seen in the comparison of AM West and Alaska airline flight delay rates in overall vs city by city where there is a reversal of trend depending on if it is combined data in overall total or in individual cities. In my analysis for city by city, Alaska airline had a lower delay percentage in every single city compare to AM West, For example in Phoenix, Alaska’s delay rate was 5.2% while AM West was 7.9%. In the overall combined data of all cities for each Airline, AM West appears to have the “better” or lower delay rate (10.9%) compared to Alaska (13.3%). The reason of why in this case is according to the data, AM West flies a large number of flights (4840) into Phoenix, which is a low delay city for everyone. The 4840 on time flight is such a massive number that this number effects the rest of the collected data, thus pulling AM West’s average down to make them look better when analyzing the overall aggregated data, even though Alaska is more reliable in specific locations.
The analysis of the flight delays for AMWest and Alaska in this assignment shows the importance of data tidying and manipulation to perform statistical analysis. Initial observation where the aggregated overall data of flight delays of each Airline suggested that AM West was the superior performer with its 10.9% delay rate compared to Alaska’s 13.3% delay rate.
However, transformation the data from a wide to long table allowed me to see the more micro scale for each cities that completed the aggregated data and to calculate the percentage by destination, tells a very different story. Alaska airlines actually outperformed AM West in every single city in terms of delay rate. The overall discrepancy is a good demonstration of Simpson’s paradox: Am West high volume of flights in Phoenix skewed the global average which mask the fact that Alaska is more punctual on a city to city context. If a passenger is deciding to choose an airline based on timing reliability for any of the 5 cities, Alaska is statistically the better choice.