To complete this assignment, I will first recreate the airline delay data in a wide-format CSV file that mirrors the original table structure with two airlines, five cities, and counts for on-time and delayed flights. I will then import the dataset into R and use tidyr and dplyr to transform the data from wide format to long format so that airline, city, flight status, and counts are organized into clear variables. After tidying the data, I will calculate total flights and compute delay percentages both overall and for each city. I will present the results using tables and/or visualizations and include written explanations interpreting the findings. Finally, I will explain the difference between the overall airline comparison and the city-by-city comparison.
One challenge may be correctly restructuring the dataset from wide to long format without mislabeling airlines or cities. I may also encounter formatting issues, such as numeric values being read as characters (especially if they contain commas), which will require cleaning and conversion. Preserving any missing or empty cells from the original structure could also require careful handling. Additionally, calculating percentages accurately will be important, since comparing raw counts alone could lead to misleading conclusions. Finally, explaining the discrepancy between overall results and city-level results may require careful interpretation to clearly demonstrate how aggregated data can produce different conclusions than grouped data.
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
airlines_wide <- data.frame(
Airline = c("Alaska", "Alaska", "AM West", "AM West"),
Status = c("On Time", "Delayed", "On Time", "Delayed"),
Los_Angeles = c(497, 62, 694, 117),
Phoenix = c(221, 12, 4840, 415),
San_Diego = c(212, 20, 383, 65),
San_Francisco = c(503, 102, 320, 129),
Seattle = c(1841, 305, 201, 61)
)
airlines_long <- airlines_wide %>%
pivot_longer(
cols = Los_Angeles:Seattle,
names_to = "City",
values_to = "Flights"
)
head(airlines_long)
overall_summary <- airlines_long %>%
group_by(Airline, Status) %>%
summarise(Total = sum(Flights), .groups = "drop") %>%
group_by(Airline) %>%
mutate(
Total_Flights = sum(Total),
Percent = round((Total / Total_Flights) * 100, 2)
)
overall_summary
overall_delays <- overall_summary %>%
filter(Status == "Delayed")
overall_delays
city_summary <- airlines_long %>%
group_by(Airline, City, Status) %>%
summarise(Total = sum(Flights), .groups = "drop") %>%
group_by(Airline, City) %>%
mutate(
Total_Flights = sum(Total),
Percent = round((Total / Total_Flights) * 100, 2)
) %>%
filter(Status == "Delayed")
city_summary
ggplot(city_summary, aes(x = City, y = Percent, fill = Airline)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
geom_text(aes(label = paste0(Percent, "%")),
position = position_dodge(width = 0.9),
vjust = -0.3,
size = 3.5) +
labs(title = "Delay Percentage by City",
y = "Delay Percentage",
x = "City") +
theme_minimal()
After transforming the data from wide to long format, I calculated the overall delay percentages for each airline by dividing total delayed flights by total flights. The results show that AM West has a lower overall delay percentage (approximately 10.9%) compared to Alaska (approximately 13.4%). Based on the aggregated totals alone, AM West appears to perform better in terms of on-time arrivals. However, this comparison uses combined totals across all cities and does not account for how flights are distributed across destinations.
When calculating delay percentages separately for each of the five cities (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle), a different pattern emerges. In every individual city, Alaska has a lower delay percentage than AM West. This means that when performance is evaluated within each destination, Alaska consistently outperforms AM West. The bar chart clearly shows that Alaska’s delay rates are lower across all five cities, even though the overall totals suggested the opposite conclusion.
##Discrepancy Between Overall and City Level Comparisons
The discrepancy between the overall comparison and the city-by-city comparison occurs because the two airlines operate different numbers of flights in different cities. AM West operates a large proportion of its flights in Phoenix, which has relatively low delay rates overall. Alaska, on the other hand, operates more flights in cities with higher delay rates. When the data is aggregated, these differences in flight distribution influence the overall percentages and make AM West appear better. However, when comparing within each city, Alaska performs better consistently.
This situation is an example of Simpson’s Paradox, where a trend observed within multiple groups reverses when the groups are combined. It demonstrates why analyzing percentages within subgroups is critical before drawing conclusions from aggregated data.
In conclusion, although AM West appears to have better overall on-time performance when examining total percentages, Alaska actually performs better in every individual city. This assignment highlights how aggregated data can be misleading and reinforces the importance of analyzing subgroup performance before making overall comparisons.