Flight Delays

Author

Kevin Havis

Introduction

Let’s analyze some flight data!

Our initial data is unfortunately only as an image, so we’ll start by recreating it using tribble and performing some basic clean up.

Then we’ll perform some exploratory data analysis to get an idea of delays for flights.

Our data contains on-time & delayed flight counts across two airlines and five cities on the West Coast.

Creating the data

library(tidyverse)
library(knitr)

Raw Data
# Create the data

df <- tibble::tribble(
  ~"",          ~"",       ~`Los Angeles`,  ~Phoenix, ~`San Diego`, ~`San Francisco`, ~Seattle,
  "ALASKA",     "on time", 497,             221,       212,           503,              1841,
  NA,           "delayed", 62,              12,        20,            102,              305,
  NA,           NA,        NA,              NA,        NA,            NA,               NA,
  "AM WEST",    "on time", 694,             4840,      383,           320,              201,
  NA,           "delayed", 117,             415,       65,            129,              61
)

kable(df)
Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

Our initial data isn’t too bad, but it is certainly not tidy. We’ll address this by dropping the empty row, renaming our columns, and pivoting the dataset longer to “flatten” it.

# Drop empty row, rename columns, fill empty cells, and pivot into long format

df <- df |> 
  rename("airline" = 1, "flight_status" = 2) |> 
  slice(c(-3)) |> 
  fill(airline) |> 
  rename_with(~ str_to_lower(str_replace(.x, " ", "_")), .cols = 3:7) |> 
  pivot_longer(3:7, names_to = "city", values_to = "flight_count") |> 
  arrange(desc(flight_count))

kable(df)
airline flight_status city flight_count
AM WEST on time phoenix 4840
ALASKA on time seattle 1841
AM WEST on time los_angeles 694
ALASKA on time san_francisco 503
ALASKA on time los_angeles 497
AM WEST delayed phoenix 415
AM WEST on time san_diego 383
AM WEST on time san_francisco 320
ALASKA delayed seattle 305
ALASKA on time phoenix 221
ALASKA on time san_diego 212
AM WEST on time seattle 201
AM WEST delayed san_francisco 129
AM WEST delayed los_angeles 117
ALASKA delayed san_francisco 102
AM WEST delayed san_diego 65
ALASKA delayed los_angeles 62
AM WEST delayed seattle 61
ALASKA delayed san_diego 20
ALASKA delayed phoenix 12

We now have a nice, flat, long formatted and tidy data set. We’ll export this to a new CSV file and continue on with some analysis!

write.csv(df, "output_data.csv", row.names = FALSE)

Analysis

Let’s answer some basic questions.

We’ll start with a general overview of our data. Since it is a simple data set, we can visualize most of the features in one plot.

# Set this up so we can factor airline for our factor wrap
df |> 
  mutate() |> 

# Plot data across two facets
ggplot(
       aes(
         x = flight_count,
         y = reorder(
          str_to_title( # Make city names pretty
            str_replace_all(city, "_", " ")),
          flight_count),
         fill = flight_status)) +
  geom_bar(stat = 'identity') +
  facet_wrap(~factor(airline, c('AM WEST', 'ALASKA'))) +
  labs(
    title = "Flight data overview",
    y = "Flights",
    x = "City",
    fill = "Flight Status"
  ) + 
  scale_fill_manual(values = c('#DB504A', '#A2A7A5')) +
  theme_minimal()

Compare delayed flights across airlines

When comparing airlines by flight status, we can see that the rate of delayed versus on time is roughly comparable between the two airlines, with AM west having about 11% of flights delayed and Alaska having 13%.

df |> 
  group_by(airline, flight_status) |> 
  summarize(flight_count = sum(flight_count), .groups = "drop") |> 
  mutate(prop = flight_count / sum(flight_count) * 100, .by = c(airline)) |> 
  kable()
airline flight_status flight_count prop
ALASKA delayed 501 13.27152
ALASKA on time 3274 86.72848
AM WEST delayed 787 10.89273
AM WEST on time 6438 89.10727
# Plot amount of delayed flights by city
ggplot(df, aes(x = airline, y = flight_count, fill = flight_status)) +
  geom_bar(stat = "identity", position = "fill") + 
  labs(
    x = "Airline",
    y = "Flights (proportion)",
    title = " Proportion of flights delayed by airline",
    fill = "Flight Status"
  ) +
  scale_fill_manual(values = c('#DB504A', '#A2A7A5')) +
  theme_minimal()

Compare delays across cities

San Francisco generally has the highest rate of delays for both airlines, especially for AM West. We can also see that Phoenix generally has the best rate of on-time flights.

This chart also demonstrates that when comparing airlines across cities, Alaska has a better rate of on time flights, which contradicts our earlier observation that AM West had more on time flights. This is known as Simpson’s Paradox.

on_time_flights <- df |> 
  group_by(airline, city) |> 
  summarise(total_flights = sum(flight_count), .groups = "drop") |>   
  left_join(df |> filter(flight_status == "on time") |>   
              group_by(airline, city) |> 
              summarise(on_time_count = sum(flight_count), .groups = "drop"), 
            by = c("airline", "city")) |> 
  mutate(percentage = (on_time_count / total_flights) * 100) |> 
  ungroup() 

# Reorder the cities by total flight count
on_time_flights <- on_time_flights |> 
  mutate(city = str_to_title(str_replace_all(city, "_", " ")), city) |> 
  mutate(city = fct_reorder(city, total_flights))

kable(on_time_flights)
airline city total_flights on_time_count percentage
ALASKA Los Angeles 559 497 88.90877
ALASKA Phoenix 233 221 94.84979
ALASKA San Diego 232 212 91.37931
ALASKA San Francisco 605 503 83.14050
ALASKA Seattle 2146 1841 85.78751
AM WEST Los Angeles 811 694 85.57337
AM WEST Phoenix 5255 4840 92.10276
AM WEST San Diego 448 383 85.49107
AM WEST San Francisco 449 320 71.26949
AM WEST Seattle 262 201 76.71756
# Graph side-by-side columns
ggplot(on_time_flights, aes(x = city, y = percentage, fill = airline)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = sprintf("%.1f%%", percentage)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 3) +
  labs(title = "Percentage of flights on time per city", 
       x = "City", 
       y = "Percentage of On Time Flights (%)",
       fill = "Airline") +
  theme_minimal() +
  scale_fill_manual(values = c('#00274A', '#007561')) # Respective airline colors

Analysis summary

In summary, despite the drastic difference in the overall number of flights, we can see that the airlines have similar delay rates to each other when compared in aggregate. However, that difference becomes far more meaningful when you evaluate it based per city, as we observed with San Francisco and Phoenix.

Conclusion

In this article we took a generated a simple dataset of flight data, tidied it into a proper long format for data analysis, then created visualizations to explore and understand how the rates of flight delays compare across airlines and cities. We identified a Simpson’s Paradox when comparing the rate of flights on time between airlines, and then subsequently per airline per city, each of which suggested a different preferred airline.