Assignment 5A: Airline Delays

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(ggplot2)

Reading in the data

data_url <- "https://raw.githubusercontent.com/Jeovany97/Data-607/refs/heads/main/Assignment%205a/airline_delays.csv"

raw_data <- read_csv(data_url, na = c("", "NA"))
Rows: 5 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Airline, Status
dbl (5): Los Angeles, Phoenix, San Diego, San Francisco, Seattle

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
clean_data <- raw_data %>%
  #Removing empty row & filling in missing Airline names
  filter(!if_all(everything(),is.na)) %>% 
  fill(Airline, .direction = "down")

Tidying the data

tidy_flights <- clean_data %>%
  pivot_longer(
    cols = c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
    names_to = "City",
    values_to = "Flights"
  )

Calculations

Total number of delayed and on time flights

airline_summary <- tidy_flights %>%
  group_by(Airline, Status) %>%
  summarise(total_flights = sum(Flights), .groups = "drop")

kable(airline_summary, caption = "Summary of Airline Delays")
Summary of Airline Delays
Airline Status total_flights
ALASKA delayed 501
ALASKA on time 3274
AM WEST delayed 787
AM WEST on time 6438

Comparing the number of delayed and on time for each Airline

delay_comparison <- tidy_flights %>%
  group_by(Airline, Status) %>%
  summarise(total = sum(Flights), .groups = "drop") %>%
  pivot_wider(names_from = Status, values_from = total) %>%
  mutate(
    total_flights = `on time` + delayed,
    delay_rate = delayed / total_flights
  )

kable(delay_comparison, caption = "Comparsion of delays between each airline")
Comparsion of delays between each airline
Airline delayed on time total_flights delay_rate
ALASKA 501 3274 3775 0.1327152
AM WEST 787 6438 7225 0.1089273

Based on the data given we can see the AM West currently has a much larger operation nearly double the total flights. Despite this AM WEST manages to maintain a slightly lower percentage of delays. Rounding both delay rates to the hundredth that mean a ALASKA customer has 22% increase chance to experience a delay.

Comparing the the delay and arrive time for each City

city_delay_rates <- tidy_flights %>%
  group_by(Airline, City, Status) %>%
  summarise(total = sum(Flights), .groups = "drop") %>%
  pivot_wider(names_from = Status, values_from = total) %>%
  mutate(
    total_flights = `on time` + delayed,
    delay_rate = delayed / total_flights
  )
kable(city_delay_rates, caption = "Comparsion of delays between each city")
Comparsion of delays between each city
Airline City delayed on time total_flights delay_rate
ALASKA Los Angeles 62 497 559 0.1109123
ALASKA Phoenix 12 221 233 0.0515021
ALASKA San Diego 20 212 232 0.0862069
ALASKA San Francisco 102 503 605 0.1685950
ALASKA Seattle 305 1841 2146 0.1421249
AM WEST Los Angeles 117 694 811 0.1442663
AM WEST Phoenix 415 4840 5255 0.0789724
AM WEST San Diego 65 383 448 0.1450893
AM WEST San Francisco 129 320 449 0.2873051
AM WEST Seattle 61 201 262 0.2328244

While previous data suggested AM WEST to be the better airline this city specific data reveals that ALASKA has the lower data. ALASKA beats AM WEST in every city. The reason AM WEST appeared better is because the concentration in phoenix. AM WEST concentrated 73% of all their flights in the city that has lowest average. This volume bias poisoned the previous data.

Creating graph to show difference between city flights

# Creating graph
ggplot(city_delay_rates, aes(x = City, y = total_flights, fill = Airline)) +
  geom_col(position = "dodge") +
  # Titles and labels
  labs(
    title = "Flight Volume by City and Airline",
    subtitle = "Comparing the number of flights for each city",
    x = "Destination City",
    y = "Total Number of Flights",
    fill = "Airline"
  ) +
  # Clean up the theme and rotate city names for better readability
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))