Intro

In this assignment, I will be tidying airline data and then analyzing to compare the arrival delays for the two airlines.

1. Loading Source Data

I created a .CSV file that includes all of the given information. It has been stored in my personal Github Repo, and I will be loading it from there.

flights <- read.csv("https://raw.githubusercontent.com/cdube89128/DATA-607/refs/heads/main/week-05/DATA607_Assignment5A_airline_delays.csv")

# Taking a look at the data
head(flights, 10)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

2. Tidying & Transforming the data

# Renaming Columns
colnames(flights) <- c("airline", "flight_status", "Los_Angeles", "Phoenix", "San_Diego", "San_Francisco", "Seattle")

# Removing blank rows
flights <- flights[-3,]

# Filling in Blank Airlines 
flights$airline[c(2,4)] <- c("ALASKA", "AM WEST")

#Restructuring destinations
tidy_flights <- flights %>%
    pivot_longer(
      cols = !(airline:flight_status),
      names_to = "destination",
      values_to = "num"
    )

kable(tidy_flights, caption = "Flights Data After Tidying")
Flights Data After Tidying
airline flight_status destination num
ALASKA on time Los_Angeles 497
ALASKA on time Phoenix 221
ALASKA on time San_Diego 212
ALASKA on time San_Francisco 503
ALASKA on time Seattle 1841
ALASKA delayed Los_Angeles 62
ALASKA delayed Phoenix 12
ALASKA delayed San_Diego 20
ALASKA delayed San_Francisco 102
ALASKA delayed Seattle 305
AM WEST on time Los_Angeles 694
AM WEST on time Phoenix 4840
AM WEST on time San_Diego 383
AM WEST on time San_Francisco 320
AM WEST on time Seattle 201
AM WEST delayed Los_Angeles 117
AM WEST delayed Phoenix 415
AM WEST delayed San_Diego 65
AM WEST delayed San_Francisco 129
AM WEST delayed Seattle 61

This is a lot tidier, but I think that with our goal for analysis, I would prefer to have the number of on time flights and the number of delayed flights as columns. Then I will use these for calculations.

transformed_flights <- tidy_flights %>%
  pivot_wider(
    id_cols = c("airline", "destination"),
    names_from = "flight_status",
    values_from = "num"
  )

colnames(transformed_flights)[3:4] <- c("on_time", "delayed")

kable(transformed_flights, caption = "Flights Data After Transforming for Analysis")
Flights Data After Transforming for Analysis
airline destination on_time delayed
ALASKA Los_Angeles 497 62
ALASKA Phoenix 221 12
ALASKA San_Diego 212 20
ALASKA San_Francisco 503 102
ALASKA Seattle 1841 305
AM WEST Los_Angeles 694 117
AM WEST Phoenix 4840 415
AM WEST San_Diego 383 65
AM WEST San_Francisco 320 129
AM WEST Seattle 201 61

3. Analysis

With the data in its present state, I can more easily compute the numbers I need for analysis.

summary_flights <- transformed_flights %>%
  group_by(airline) %>%
  summarise(
    total = sum(on_time + delayed),
    pct_delayed = round(sum(delayed) / total, 2)
  )

transformed_flights <- transformed_flights %>%
  mutate(
    total = on_time + delayed,
    pct_delayed = round(delayed / total, 2)
  ) %>%
  arrange(desc(pct_delayed))
kable(summary_flights, caption = "Flights by Airline With Calculated Delay Rate")
Flights by Airline With Calculated Delay Rate
airline total pct_delayed
ALASKA 3775 0.13
AM WEST 7225 0.11

From this look at the data, I am seeing a higher delay rate on average from Alaska Airlines at 13%, compared to AM West at 11%.

kable(transformed_flights, caption = "Flights by Airline & Destination With Calculated Delay Rate")
Flights by Airline & Destination With Calculated Delay Rate
airline destination on_time delayed total pct_delayed
AM WEST San_Francisco 320 129 449 0.29
AM WEST Seattle 201 61 262 0.23
ALASKA San_Francisco 503 102 605 0.17
AM WEST San_Diego 383 65 448 0.15
ALASKA Seattle 1841 305 2146 0.14
AM WEST Los_Angeles 694 117 811 0.14
ALASKA Los_Angeles 497 62 559 0.11
ALASKA San_Diego 212 20 232 0.09
AM WEST Phoenix 4840 415 5255 0.08
ALASKA Phoenix 221 12 233 0.05

However, on a city by city basis, it is AM West that tends to show a higher delay rate. This surprising result may stem from the fact that AM West has almost double the number of total flights that Alaska Airlines does. In particular, it has over 20x as many flights going to Phoenix specifically. I am going to look into this really quickly by viewing the average delay rate excluding Phoenix.

summary_wo_phoenix <- transformed_flights[transformed_flights$destination != "Phoenix",1:4] %>%
  group_by(airline) %>%
  summarise(
    total = sum(on_time + delayed),
    pct_delayed = round(sum(delayed) / total, 2)
  )

kable(summary_wo_phoenix, caption = "Flights by Airline With Calculated Delay Rate, Excluding Phoenix")
Flights by Airline With Calculated Delay Rate, Excluding Phoenix
airline total pct_delayed
ALASKA 3542 0.14
AM WEST 1970 0.19

Yes, now I can see that the high number of flights by AM West to Phoenix in comparison to Alaska Airlines is what drives its overall average delay rate down. When I remove Phoenix, I am seeing a higher delay rate on average from AM West at 19%, whereas Alaska Airlines is at 14% (close to what they were before).

I want to take one quick visual look at this city-by-city basis.

# Re-order data for the bar chart
ordered_flights <- transformed_flights %>%
  group_by(destination) %>%
  summarise(avg_delay = mean(pct_delayed)) %>%
  arrange(avg_delay) %>%
  pull(destination)

transformed_flights <- transformed_flights %>%
  mutate(destination = factor(destination, levels = ordered_flights))

ggplot(transformed_flights, aes(x = destination, y = pct_delayed, fill = airline)) +
  geom_col(position = "dodge") +
  labs(title = "Proportion of Delayed Flights by Airline & Destination",
       y = "Delay Rate",
       x = "Destination") +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal()

This makes it abundantly clear that when segmented by destination, AM West experiences a higher rate of delays than Alaska Airlines. This is despite the fact that when averaging all flights, it was Alaska Airlines that has the higher rate of delay.