Flight Delays

Author

Kevin Havis

Introduction

Let’s analyze some flight data!

Our initial data is unfortunately only as an image, so we’ll start by recreating it using tribble and performing some basic clean up.

Then we’ll perform some exploratory data analysis to get an idea of delays for flights.

Our data contains on-time & delayed flight counts across two airlines and five cities on the West Coast.

Creating the data

library(tidyverse)
library(knitr)

# Create the data

df <- tibble::tribble(
  ~"",          ~"",       ~`Los Angeles`,  ~Phoenix, ~`San Diego`, ~`San Francisco`, ~Seattle,
  "ALASKA",     "on time", 497,             221,       212,           503,              1841,
  NA,           "delayed", 62,              12,        20,            102,              305,
  NA,           NA,        NA,              NA,        NA,            NA,               NA,
  "AM WEST",    "on time", 694,             4840,      383,           320,              201,
  NA,           "delayed", 117,             415,       65,            129,              61
)

kable(df)

		Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

Our initial data isn’t too bad, but it is certainly not tidy. We’ll address this by dropping the empty row, renaming our columns, and pivoting the dataset longer to “flatten” it.

# Drop empty row, rename columns, fill empty cells, and pivot into long format

df <- df |> 
  rename("airline" = 1, "flight_status" = 2) |> 
  slice(c(-3)) |> 
  fill(airline) |> 
  rename_with(~ str_to_lower(str_replace(.x, " ", "_")), .cols = 3:7) |> 
  pivot_longer(3:7, names_to = "city", values_to = "flight_count") |> 
  arrange(desc(flight_count))

kable(df)

airline	flight_status	city	flight_count
AM WEST	on time	phoenix	4840
ALASKA	on time	seattle	1841
AM WEST	on time	los_angeles	694
ALASKA	on time	san_francisco	503
ALASKA	on time	los_angeles	497
AM WEST	delayed	phoenix	415
AM WEST	on time	san_diego	383
AM WEST	on time	san_francisco	320
ALASKA	delayed	seattle	305
ALASKA	on time	phoenix	221
ALASKA	on time	san_diego	212
AM WEST	on time	seattle	201
AM WEST	delayed	san_francisco	129
AM WEST	delayed	los_angeles	117
ALASKA	delayed	san_francisco	102
AM WEST	delayed	san_diego	65
ALASKA	delayed	los_angeles	62
AM WEST	delayed	seattle	61
ALASKA	delayed	san_diego	20
ALASKA	delayed	phoenix	12

We now have a nice, flat, long formatted and tidy data set. We’ll export this to a new CSV file and continue on with some analysis!

write.csv(df, "output_data.csv", row.names = FALSE)

Analysis

Let’s answer some basic questions.

We’ll start with a general overview of our data. Since it is a simple data set, we can visualize most of the features in one plot.

# Set this up so we can factor airline for our factor wrap
df |> 
  mutate() |> 

# Plot data across two facets
ggplot(
       aes(
         x = flight_count,
         y = reorder(
          str_to_title( # Make city names pretty
            str_replace_all(city, "_", " ")),
          flight_count),
         fill = flight_status)) +
  geom_bar(stat = 'identity') +
  facet_wrap(~factor(airline, c('AM WEST', 'ALASKA'))) +
  labs(
    title = "Flight data overview",
    y = "Flights",
    x = "City",
    fill = "Flight Status"
  ) + 
  scale_fill_manual(values = c('#DB504A', '#A2A7A5')) +
  theme_minimal()

Compare delayed flights across airlines

When comparing airlines by flight status, we can see that the rate of delayed versus on time is roughly comparable between the two airlines, with AM west having about 11% of flights delayed and Alaska having 13%.

df |> 
  group_by(airline, flight_status) |> 
  summarize(flight_count = sum(flight_count), .groups = "drop") |> 
  mutate(prop = flight_count / sum(flight_count) * 100, .by = c(airline)) |> 
  kable()

airline	flight_status	flight_count	prop
ALASKA	delayed	501	13.27152
ALASKA	on time	3274	86.72848
AM WEST	delayed	787	10.89273
AM WEST	on time	6438	89.10727

# Plot amount of delayed flights by city
ggplot(df, aes(x = airline, y = flight_count, fill = flight_status)) +
  geom_bar(stat = "identity", position = "fill") + 
  labs(
    x = "Airline",
    y = "Flights (proportion)",
    title = " Proportion of flights delayed by airline",
    fill = "Flight Status"
  ) +
  scale_fill_manual(values = c('#DB504A', '#A2A7A5')) +
  theme_minimal()

Compare delays across cities

San Francisco generally has the highest rate of delays for both airlines, especially for AM West. We can also see that Phoenix generally has the best rate of on-time flights.

This chart also demonstrates that when comparing airlines across cities, Alaska has a better rate of on time flights, which contradicts our earlier observation that AM West had more on time flights. This is known as Simpson’s Paradox.

on_time_flights <- df |> 
  group_by(airline, city) |> 
  summarise(total_flights = sum(flight_count), .groups = "drop") |>   
  left_join(df |> filter(flight_status == "on time") |>   
              group_by(airline, city) |> 
              summarise(on_time_count = sum(flight_count), .groups = "drop"), 
            by = c("airline", "city")) |> 
  mutate(percentage = (on_time_count / total_flights) * 100) |> 
  ungroup() 

# Reorder the cities by total flight count
on_time_flights <- on_time_flights |> 
  mutate(city = str_to_title(str_replace_all(city, "_", " ")), city) |> 
  mutate(city = fct_reorder(city, total_flights))

kable(on_time_flights)

airline	city	total_flights	on_time_count	percentage
ALASKA	Los Angeles	559	497	88.90877
ALASKA	Phoenix	233	221	94.84979
ALASKA	San Diego	232	212	91.37931
ALASKA	San Francisco	605	503	83.14050
ALASKA	Seattle	2146	1841	85.78751
AM WEST	Los Angeles	811	694	85.57337
AM WEST	Phoenix	5255	4840	92.10276
AM WEST	San Diego	448	383	85.49107
AM WEST	San Francisco	449	320	71.26949
AM WEST	Seattle	262	201	76.71756

# Graph side-by-side columns
ggplot(on_time_flights, aes(x = city, y = percentage, fill = airline)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = sprintf("%.1f%%", percentage)), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 3) +
  labs(title = "Percentage of flights on time per city", 
       x = "City", 
       y = "Percentage of On Time Flights (%)",
       fill = "Airline") +
  theme_minimal() +
  scale_fill_manual(values = c('#00274A', '#007561')) # Respective airline colors

Analysis summary

In summary, despite the drastic difference in the overall number of flights, we can see that the airlines have similar delay rates to each other when compared in aggregate. However, that difference becomes far more meaningful when you evaluate it based per city, as we observed with San Francisco and Phoenix.

Conclusion

In this article we took a generated a simple dataset of flight data, tidied it into a proper long format for data analysis, then created visualizations to explore and understand how the rates of flight delays compare across airlines and cities. We identified a Simpson’s Paradox when comparing the rate of flights on time between airlines, and then subsequently per airline per city, each of which suggested a different preferred airline.