In this assignment, I will be tidying airline data and then analyzing to compare the arrival delays for the two airlines.
I created a .CSV file that includes all of the given information. It has been stored in my personal Github Repo, and I will be loading it from there.
flights <- read.csv("https://raw.githubusercontent.com/cdube89128/DATA-607/refs/heads/main/week-05/DATA607_Assignment5A_airline_delays.csv")
# Taking a look at the data
head(flights, 10)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
# Renaming Columns
colnames(flights) <- c("airline", "flight_status", "Los_Angeles", "Phoenix", "San_Diego", "San_Francisco", "Seattle")
# Removing blank rows
flights <- flights[-3,]
# Filling in Blank Airlines
flights$airline[c(2,4)] <- c("ALASKA", "AM WEST")
#Restructuring destinations
tidy_flights <- flights %>%
pivot_longer(
cols = !(airline:flight_status),
names_to = "destination",
values_to = "num"
)
kable(tidy_flights, caption = "Flights Data After Tidying")
airline | flight_status | destination | num |
---|---|---|---|
ALASKA | on time | Los_Angeles | 497 |
ALASKA | on time | Phoenix | 221 |
ALASKA | on time | San_Diego | 212 |
ALASKA | on time | San_Francisco | 503 |
ALASKA | on time | Seattle | 1841 |
ALASKA | delayed | Los_Angeles | 62 |
ALASKA | delayed | Phoenix | 12 |
ALASKA | delayed | San_Diego | 20 |
ALASKA | delayed | San_Francisco | 102 |
ALASKA | delayed | Seattle | 305 |
AM WEST | on time | Los_Angeles | 694 |
AM WEST | on time | Phoenix | 4840 |
AM WEST | on time | San_Diego | 383 |
AM WEST | on time | San_Francisco | 320 |
AM WEST | on time | Seattle | 201 |
AM WEST | delayed | Los_Angeles | 117 |
AM WEST | delayed | Phoenix | 415 |
AM WEST | delayed | San_Diego | 65 |
AM WEST | delayed | San_Francisco | 129 |
AM WEST | delayed | Seattle | 61 |
This is a lot tidier, but I think that with our goal for analysis, I would prefer to have the number of on time flights and the number of delayed flights as columns. Then I will use these for calculations.
transformed_flights <- tidy_flights %>%
pivot_wider(
id_cols = c("airline", "destination"),
names_from = "flight_status",
values_from = "num"
)
colnames(transformed_flights)[3:4] <- c("on_time", "delayed")
kable(transformed_flights, caption = "Flights Data After Transforming for Analysis")
airline | destination | on_time | delayed |
---|---|---|---|
ALASKA | Los_Angeles | 497 | 62 |
ALASKA | Phoenix | 221 | 12 |
ALASKA | San_Diego | 212 | 20 |
ALASKA | San_Francisco | 503 | 102 |
ALASKA | Seattle | 1841 | 305 |
AM WEST | Los_Angeles | 694 | 117 |
AM WEST | Phoenix | 4840 | 415 |
AM WEST | San_Diego | 383 | 65 |
AM WEST | San_Francisco | 320 | 129 |
AM WEST | Seattle | 201 | 61 |
With the data in its present state, I can more easily compute the numbers I need for analysis.
summary_flights <- transformed_flights %>%
group_by(airline) %>%
summarise(
total = sum(on_time + delayed),
pct_delayed = round(sum(delayed) / total, 2)
)
transformed_flights <- transformed_flights %>%
mutate(
total = on_time + delayed,
pct_delayed = round(delayed / total, 2)
) %>%
arrange(desc(pct_delayed))
kable(summary_flights, caption = "Flights by Airline With Calculated Delay Rate")
airline | total | pct_delayed |
---|---|---|
ALASKA | 3775 | 0.13 |
AM WEST | 7225 | 0.11 |
From this look at the data, I am seeing a higher delay rate on average from Alaska Airlines at 13%, compared to AM West at 11%.
kable(transformed_flights, caption = "Flights by Airline & Destination With Calculated Delay Rate")
airline | destination | on_time | delayed | total | pct_delayed |
---|---|---|---|---|---|
AM WEST | San_Francisco | 320 | 129 | 449 | 0.29 |
AM WEST | Seattle | 201 | 61 | 262 | 0.23 |
ALASKA | San_Francisco | 503 | 102 | 605 | 0.17 |
AM WEST | San_Diego | 383 | 65 | 448 | 0.15 |
ALASKA | Seattle | 1841 | 305 | 2146 | 0.14 |
AM WEST | Los_Angeles | 694 | 117 | 811 | 0.14 |
ALASKA | Los_Angeles | 497 | 62 | 559 | 0.11 |
ALASKA | San_Diego | 212 | 20 | 232 | 0.09 |
AM WEST | Phoenix | 4840 | 415 | 5255 | 0.08 |
ALASKA | Phoenix | 221 | 12 | 233 | 0.05 |
However, on a city by city basis, it is AM West that tends to show a higher delay rate. This surprising result may stem from the fact that AM West has almost double the number of total flights that Alaska Airlines does. In particular, it has over 20x as many flights going to Phoenix specifically. I am going to look into this really quickly by viewing the average delay rate excluding Phoenix.
summary_wo_phoenix <- transformed_flights[transformed_flights$destination != "Phoenix",1:4] %>%
group_by(airline) %>%
summarise(
total = sum(on_time + delayed),
pct_delayed = round(sum(delayed) / total, 2)
)
kable(summary_wo_phoenix, caption = "Flights by Airline With Calculated Delay Rate, Excluding Phoenix")
airline | total | pct_delayed |
---|---|---|
ALASKA | 3542 | 0.14 |
AM WEST | 1970 | 0.19 |
Yes, now I can see that the high number of flights by AM West to Phoenix in comparison to Alaska Airlines is what drives its overall average delay rate down. When I remove Phoenix, I am seeing a higher delay rate on average from AM West at 19%, whereas Alaska Airlines is at 14% (close to what they were before).
I want to take one quick visual look at this city-by-city basis.
# Re-order data for the bar chart
ordered_flights <- transformed_flights %>%
group_by(destination) %>%
summarise(avg_delay = mean(pct_delayed)) %>%
arrange(avg_delay) %>%
pull(destination)
transformed_flights <- transformed_flights %>%
mutate(destination = factor(destination, levels = ordered_flights))
ggplot(transformed_flights, aes(x = destination, y = pct_delayed, fill = airline)) +
geom_col(position = "dodge") +
labs(title = "Proportion of Delayed Flights by Airline & Destination",
y = "Delay Rate",
x = "Destination") +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal()
This makes it abundantly clear that when segmented by destination, AM West experiences a higher rate of delays than Alaska Airlines. This is despite the fact that when averaging all flights, it was Alaska Airlines that has the higher rate of delay.