Week 3 Data Dive

# Load libraries
library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)

Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")

Delay by Airline

In this section, the data is grouped by airline. For each airline, the average arrival delay and the number of observations are calculated. The number of observations represents the probability that a randomly selected row from the dataset belongs to that airline. Airlines with fewer observations are considered low-probability groups.

airline_summary <- Airline_Delay_Post_COVID_2021_2023 %>%
  group_by(carrier_name) %>%
  summarise(
    avg_arr_delay = mean(arr_delay, na.rm = TRUE),
    flight_count = n()
  ) %>%
  mutate(
    probability_tag = ifelse(
      flight_count < quantile(flight_count, 0.25),
      "Low Probability", "Common"
    )
  ) %>%
  arrange(flight_count)

ggplot(airline_summary,
       aes(x = reorder(carrier_name, avg_arr_delay),
           y = avg_arr_delay,
           fill = probability_tag)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Average Arrival Delay by Airline",
    x = "Airline",
    y = "Average Arrival Delay (minutes)"
  ) +
  theme_minimal()

Airlines with the fewest observations represent low-probability events. If a row is randomly selected from the dataset, it is unlikely to belong to these airlines. These airlines may be smaller carriers or airlines with limited post-COVID operations.

Hypothesis

Airlines in the “Low Probability” group will have a lower count of unique destination airports than “Common” airlines. This can be measured by counting unique airport values per carrier.

Question

Do low-probability airlines have longer delays because they operate fewer routes with less buffer time, or are delays concentrated at specific airports? Could network structure explain unusual delays?

Delays by Airport

The data is grouped by airport to analyze how arrival delays differ across locations. Airports with fewer observations are considered low-probability airports.

airport_summary <- Airline_Delay_Post_COVID_2021_2023 %>%
  group_by(airport_name) %>%
  summarise(
    avg_arr_delay = mean(arr_delay, na.rm = TRUE),
    avg_delayed_flights = mean(arr_del15, na.rm = TRUE),
    obs_count = n()
  ) %>%
  arrange(obs_count)
airport_summary <- airport_summary %>%
  mutate(
    probability_tag = ifelse(
      obs_count < quantile(obs_count, 0.25),
      "Low Probability", "Common"
    )
  )

ggplot(airport_summary %>% slice_tail(n = 20),
       aes(x = reorder(airport_name, avg_arr_delay),
           y = avg_arr_delay,
           fill = probability_tag)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Arrival Delays: Least-Observed Airports",
    x = "Airport",
    y = "Average Arrival Delay (minutes)",
    fill = "Airport Type"
  ) +
  theme_minimal()

Low-probability airports are likely smaller or regional airports with fewer scheduled flights. These airports appear less frequently in the data set and are unlikely to be selected at random.

Hypothesis

Airports tagged as “Low Probability” will have fewer than 10 scheduled arrivals per day on average. This can be measured by dividing arr_flights by the number of days observed for each airport.\

Question

Are low-probability airports experiencing higher delays due to limited staffing, fewer gates, or weather exposure? Does airport size correlate with both volume and average delays?

Seasonal Delays

To analyze time-based patterns, months were grouped into seasons. This allows us to determine whether delays are more common during specific times of the year.

Airline_Delay_Post_COVID_2021_2023 <- Airline_Delay_Post_COVID_2021_2023 %>%
  mutate(season = case_when(
    month %in% c(12, 1, 2) ~ "Winter",
    month %in% c(3, 4, 5) ~ "Spring",
    month %in% c(6, 7, 8) ~ "Summer",
    TRUE ~ "Fall"
  ))

season_summary <- Airline_Delay_Post_COVID_2021_2023 %>%
  group_by(season) %>%
  summarise(
    avg_arr_delay = mean(arr_delay, na.rm = TRUE),
    obs_count = n()
  ) %>%
  mutate(
    probability_tag = ifelse(
      obs_count < quantile(obs_count, 0.25),
      "Low Probability",
      "Common"
    )
  )

ggplot(season_summary,
       aes(x = season, y = avg_arr_delay, fill = probability_tag)) +
  geom_col() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Average Arrival Delay by Season",
    x = "Season",
    y = "Average Arrival Delay (minutes)",
    fill = "Season Type"
  ) +
  theme_minimal()

Average arrival delays are higher during winter and summer compared to spring and fall due to weather-related disruptions and increased travel demand.

Hypothesis

The Summer season will have a higher ratio of weather_delay to arr_delay compared to the Fall season. This can be quantified by calculating sum(weather_delay)/sum(arr_delay) for each season.

Question

Are higher delays in Summer caused by thunderstorms or increased traffic, or by operational challenges like crew availability during vacation months? How do weather and passenger volume interact?

Categorical Combinations: Airline and Season

Two categorical variables, airline and season were examined together to identify whether all possible airline–season combinations occur in the dataset. This analysis helps determine whether airlines operate year-round or only during specific seasons.

carrier_season_counts <- Airline_Delay_Post_COVID_2021_2023 %>%
  count(carrier_name, season)

To check for missing combinations, all possible airline–season pairs were generated and compared against the observed data.

all_combinations <- expand_grid(
  carrier_name = unique(Airline_Delay_Post_COVID_2021_2023$carrier_name),
  season = unique(Airline_Delay_Post_COVID_2021_2023$season)
)

missing_combinations <- anti_join(
  all_combinations,
  carrier_season_counts,
  by = c("carrier_name", "season")
)

missing_combinations

## # A tibble: 0 × 2
## # ℹ 2 variables: carrier_name <chr>, season <chr>

Upon investigation, there are no missing combinations of Airline and Season. This suggests that all airlines in this dataset maintain a consistent year-round schedule rather than operating seasonal-only routes.

To keep the graph readable and within 5 colors, only the top five airlines by total flight count are plotted.

top_airlines <- carrier_season_counts %>%
  group_by(carrier_name) %>%
  summarise(total_flights = sum(n)) %>%
  slice_max(total_flights, n = 5)

carrier_season_top <- carrier_season_counts %>%
  semi_join(top_airlines, by = "carrier_name")

ggplot(carrier_season_top,
       aes(x = season, y = n, fill = carrier_name)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Set2") +   # <-- choose Set1, Set2, Set3, etc.
  labs(
    title = "Flight Counts by Season for Top 5 Airlines",
    x = "Season",
    y = "Number of Observations"
  ) +
  theme_minimal()

The most common airline–season combination is SkyWest Airlines Inc. in Winter (approx. 1,800 observations), likely due to its heavy role as a regional carrier for major hubs during peak holiday travel.

The least common combinations are from smaller airlines like Horizon Air, which have significantly fewer flights across all seasons due to their smaller fleet size.

The most common airline–season combinations correspond to major airlines operating consistently across all seasons. Differences in counts likely reflect variations in route networks, fleet size, and travel demand rather than seasonal absence.

Conclusion

By grouping airline delay data across airlines, airports, and seasons, this analysis identifies both common and rare operational patterns. Treating low-frequency groups as low-probability events provides insight into anomalies within the dataset and highlights opportunities for further investigation into airline operations and scheduling behavior.

Week 3 Data Dive

2026-01-30

Delay by Airline

Hypothesis

Question

Delays by Airport

Hypothesis

Question

Seasonal Delays

Hypothesis

Question

Categorical Combinations: Airline and Season

Conclusion