This assignment is asking to take the chart provided below, that describes arrival delays for two airlines across five destinations, and perform an analysis to compare the arrival delays for the two airlines.
The chart is presented in an image that would ultimately need to be made into a file that can be read more easily and then tidied and transformed to properly perform the analysis.
My approach is to recreate the chart in excel to output a .csv file using the wide format that it is in. I will then load it into my github repository so that I can reference it in RStudio to be read and will use a combination of tidyr and dplyr to clean and transform the data. Once the data is in a better format, then I will compare the arrival delays for each airline across the various cities and use different summary statistics to accomplish this.
I took the data presented and recreated it in a .csv file to use in this assignment. Which can be found in my github repository (https://github.com/DRA-SPS27/DATA607-Week-5-Assignments/tree/main). I am presenting a glimpse of that data below:
# Read .csv file
url<-"https://raw.githubusercontent.com/DRA-SPS27/DATA607-Week-5-Assignments/refs/heads/main/D.Atherley%20-%20Airline%20Delays%20(%235A).csv"
untidy_airline<-read.csv(url)
glimpse(untidy_airline)
## Rows: 5
## Columns: 7
## $ X <chr> "ALASKA", "", "", "AM WEST", ""
## $ X.1 <chr> "on time", "delayed", "", "on time", "delayed"
## $ Los.Angeles <int> 497, 62, NA, 694, 117
## $ Phoenix <int> 221, 12, NA, 4840, 415
## $ San.Diego <int> 212, 20, NA, 383, 65
## $ San.Francisco <int> 503, 102, NA, 320, 129
## $ Seattle <int> 1841, 305, NA, 201, 61
To get to the format I need to analyze this data, I will first need to tidy it by reformatting it and populating any missing data.
# Remove the blank row between the two airlines
untidy_airline <- untidy_airline[c(1,2,4,5), ]
untidy_airline
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
# Populate missing data. Establish new columns for Airline and Status and fill in values where missing
untidy_airline[c(2,4),1] <- untidy_airline[c(1,3),1]
colnames(untidy_airline)[1:2] <-c('Airline', 'Status')
untidy_airline
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Now that all of the rows and columns have values, I will transform the data to long format to facilitate the analysis process. To do this, I will need to make the observations my rows and the variables my columns.
flights_long_form <- untidy_airline %>%
pivot_longer(
cols = Los.Angeles:Seattle,
names_to = "City",
values_to = "Count"
) %>%
mutate(City = str_replace_all(City, "\\.", " "))
print(flights_long_form)
## # A tibble: 20 × 4
## Airline Status City Count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
I will determine the overall flight status for both airlines.
Airlines_summary <- flights_long_form %>%
group_by(Airline, Status) %>%
summarise(Total = sum(Count), .groups = "drop") %>%
group_by(Airline) %>%
mutate(
Grand_Total = sum(Total),
Percent = round(Total/Grand_Total * 100,2)
) %>%
ungroup()
Airlines_summary
## # A tibble: 4 × 5
## Airline Status Total Grand_Total Percent
## <chr> <chr> <int> <int> <dbl>
## 1 ALASKA delayed 501 3775 13.3
## 2 ALASKA on time 3274 3775 86.7
## 3 AM WEST delayed 787 7225 10.9
## 4 AM WEST on time 6438 7225 89.1
Now that I have my overall totals for both on time and delayed flights, I will isolate just the values for the delayed flights to use in my graph comparison.
Airline_delays <- Airlines_summary %>%
filter(Status == "delayed") %>%
select(Airline, Delayed = Total, Grand_Total, Delay_Percent = Percent)
Airline_delays
## # A tibble: 2 × 4
## Airline Delayed Grand_Total Delay_Percent
## <chr> <int> <int> <dbl>
## 1 ALASKA 501 3775 13.3
## 2 AM WEST 787 7225 10.9
I will now visually display the delayed percentages.
ggplot(Airline_delays, aes(x = Airline, y = Delay_Percent, fill = Airline)) +
geom_col() +
geom_text(aes(label = paste0(round(Delay_Percent, 1), "%")), vjust = -0.5, size = 4) +
theme_minimal() +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Percentage of Delayed Flights by Airline",
subtitle = "Aggregated overall across 5 city destinations",
x = "Airline",
y = "Delay Percentage (%)"
) +
theme(legend.position = "none")
In looking at the aggregate comparison, I can conclude that AM West Airline had less delays than Alaska Airlines. Alaska Airlines was delayed overall 13.3% of the time, while AM West Airline was only delayed 10.9% of the time. If I was a traveler, I would certainly take this into consideration when booking a flight and will most likely end up booking with AM West Airline.
I will now compare the flight delays percentage for the two airlines across the five cities to determine what the discrepancies are in each city. To do that, I will need to create a new data frame for the delays in each city grouped by each airline.
city_flights_summary <- flights_long_form %>%
group_by(Airline, City) %>%
mutate(City_Total = sum(Count)) %>%
ungroup() %>%
filter(Status == "delayed") %>%
mutate(Delay_Percent = round(Count / City_Total * 100,2)) %>%
select(Airline, City, Delayed = Count, City_Total, Delay_Percent)
city_flights_summary <- city_flights_summary %>% arrange(City, Airline)
city_flights_summary
## # A tibble: 10 × 5
## Airline City Delayed City_Total Delay_Percent
## <chr> <chr> <int> <int> <dbl>
## 1 ALASKA Los Angeles 62 559 11.1
## 2 AM WEST Los Angeles 117 811 14.4
## 3 ALASKA Phoenix 12 233 5.15
## 4 AM WEST Phoenix 415 5255 7.9
## 5 ALASKA San Diego 20 232 8.62
## 6 AM WEST San Diego 65 448 14.5
## 7 ALASKA San Francisco 102 605 16.9
## 8 AM WEST San Francisco 129 449 28.7
## 9 ALASKA Seattle 305 2146 14.2
## 10 AM WEST Seattle 61 262 23.3
I will now visually display the delayed percentages for each city.
ggplot(city_flights_summary, aes(x = City, y = Delay_Percent, fill = Airline)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7) +
geom_text(aes(label = paste0(round(Delay_Percent, 1), "%")),
position = position_dodge(width = 0.8),
vjust = -0.5,
size = 3.5,
fontface = "bold") +
labs(
title = "Flight Delay Percentage by City and Airline",
subtitle = "Alaska Airlines vs. AM West across 5 Cities",
x = "Destination City",
y = "Percentage of Flights Delayed (%)",
fill = "Airline"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.major.x = element_blank()
) +
scale_y_continuous(limits = c(0, max(city_flights_summary$Delay_Percent) + 5))
Looking at the results of my bar graph, I can see that across all 5 cities Alaska Airlines shows a lower percentage of delays. This is a very interesting finding because my results from the overall comparison showed that Alaska Airlines is often more delayed, but these results show otherwise. Seeing these results as a traveler, I would feel more comfortable knowing that in any of these cities, flying with Alaska Airline will prove to include a flight that is not delayed.
This assignment really highlighted Simpson’s Paradox, the statistical phenomenon where a trend or relationship observed within several distinct groups of data reverses or disappears when the groups are combined. In this example, even though Alaska Airlines showed to be less delayed across all of the 5 cities, what was also true is that there are often less Alaska Airline flights that fly out of those cities, compared to AM West. This reveals that the aggregated data completely masked this variable and made our results inaccurate and not representative of the actual relationship present.