library(tidyverse)
library(knitr)
Flight Delays
Introduction
Let’s analyze some flight data!
Our initial data is unfortunately only as an image, so we’ll start by recreating it using tribble
and performing some basic clean up.
Then we’ll perform some exploratory data analysis to get an idea of delays for flights.
Our data contains on-time & delayed flight counts across two airlines and five cities on the West Coast.
Creating the data
# Create the data
<- tibble::tribble(
df ~"", ~"", ~`Los Angeles`, ~Phoenix, ~`San Diego`, ~`San Francisco`, ~Seattle,
"ALASKA", "on time", 497, 221, 212, 503, 1841,
NA, "delayed", 62, 12, 20, 102, 305,
NA, NA, NA, NA, NA, NA, NA,
"AM WEST", "on time", 694, 4840, 383, 320, 201,
NA, "delayed", 117, 415, 65, 129, 61
)
kable(df)
Los Angeles | Phoenix | San Diego | San Francisco | Seattle | ||
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
NA | delayed | 62 | 12 | 20 | 102 | 305 |
NA | NA | NA | NA | NA | NA | NA |
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
NA | delayed | 117 | 415 | 65 | 129 | 61 |
Our initial data isn’t too bad, but it is certainly not tidy. We’ll address this by dropping the empty row, renaming our columns, and pivoting the dataset longer to “flatten” it.
# Drop empty row, rename columns, fill empty cells, and pivot into long format
<- df |>
df rename("airline" = 1, "flight_status" = 2) |>
slice(c(-3)) |>
fill(airline) |>
rename_with(~ str_to_lower(str_replace(.x, " ", "_")), .cols = 3:7) |>
pivot_longer(3:7, names_to = "city", values_to = "flight_count") |>
arrange(desc(flight_count))
kable(df)
airline | flight_status | city | flight_count |
---|---|---|---|
AM WEST | on time | phoenix | 4840 |
ALASKA | on time | seattle | 1841 |
AM WEST | on time | los_angeles | 694 |
ALASKA | on time | san_francisco | 503 |
ALASKA | on time | los_angeles | 497 |
AM WEST | delayed | phoenix | 415 |
AM WEST | on time | san_diego | 383 |
AM WEST | on time | san_francisco | 320 |
ALASKA | delayed | seattle | 305 |
ALASKA | on time | phoenix | 221 |
ALASKA | on time | san_diego | 212 |
AM WEST | on time | seattle | 201 |
AM WEST | delayed | san_francisco | 129 |
AM WEST | delayed | los_angeles | 117 |
ALASKA | delayed | san_francisco | 102 |
AM WEST | delayed | san_diego | 65 |
ALASKA | delayed | los_angeles | 62 |
AM WEST | delayed | seattle | 61 |
ALASKA | delayed | san_diego | 20 |
ALASKA | delayed | phoenix | 12 |
We now have a nice, flat, long formatted and tidy data set. We’ll export this to a new CSV file and continue on with some analysis!
write.csv(df, "output_data.csv", row.names = FALSE)
Analysis
Let’s answer some basic questions.
We’ll start with a general overview of our data. Since it is a simple data set, we can visualize most of the features in one plot.
# Set this up so we can factor airline for our factor wrap
|>
df mutate() |>
# Plot data across two facets
ggplot(
aes(
x = flight_count,
y = reorder(
str_to_title( # Make city names pretty
str_replace_all(city, "_", " ")),
flight_count),fill = flight_status)) +
geom_bar(stat = 'identity') +
facet_wrap(~factor(airline, c('AM WEST', 'ALASKA'))) +
labs(
title = "Flight data overview",
y = "Flights",
x = "City",
fill = "Flight Status"
+
) scale_fill_manual(values = c('#DB504A', '#A2A7A5')) +
theme_minimal()
Compare delayed flights across airlines
When comparing airlines by flight status, we can see that the rate of delayed versus on time is roughly comparable between the two airlines, with AM west having about 11% of flights delayed and Alaska having 13%.
|>
df group_by(airline, flight_status) |>
summarize(flight_count = sum(flight_count), .groups = "drop") |>
mutate(prop = flight_count / sum(flight_count) * 100, .by = c(airline)) |>
kable()
airline | flight_status | flight_count | prop |
---|---|---|---|
ALASKA | delayed | 501 | 13.27152 |
ALASKA | on time | 3274 | 86.72848 |
AM WEST | delayed | 787 | 10.89273 |
AM WEST | on time | 6438 | 89.10727 |
# Plot amount of delayed flights by city
ggplot(df, aes(x = airline, y = flight_count, fill = flight_status)) +
geom_bar(stat = "identity", position = "fill") +
labs(
x = "Airline",
y = "Flights (proportion)",
title = " Proportion of flights delayed by airline",
fill = "Flight Status"
+
) scale_fill_manual(values = c('#DB504A', '#A2A7A5')) +
theme_minimal()
Compare delays across cities
San Francisco generally has the highest rate of delays for both airlines, especially for AM West. We can also see that Phoenix generally has the best rate of on-time flights.
This chart also demonstrates that when comparing airlines across cities, Alaska has a better rate of on time flights, which contradicts our earlier observation that AM West had more on time flights. This is known as Simpson’s Paradox.
<- df |>
on_time_flights group_by(airline, city) |>
summarise(total_flights = sum(flight_count), .groups = "drop") |>
left_join(df |> filter(flight_status == "on time") |>
group_by(airline, city) |>
summarise(on_time_count = sum(flight_count), .groups = "drop"),
by = c("airline", "city")) |>
mutate(percentage = (on_time_count / total_flights) * 100) |>
ungroup()
# Reorder the cities by total flight count
<- on_time_flights |>
on_time_flights mutate(city = str_to_title(str_replace_all(city, "_", " ")), city) |>
mutate(city = fct_reorder(city, total_flights))
kable(on_time_flights)
airline | city | total_flights | on_time_count | percentage |
---|---|---|---|---|
ALASKA | Los Angeles | 559 | 497 | 88.90877 |
ALASKA | Phoenix | 233 | 221 | 94.84979 |
ALASKA | San Diego | 232 | 212 | 91.37931 |
ALASKA | San Francisco | 605 | 503 | 83.14050 |
ALASKA | Seattle | 2146 | 1841 | 85.78751 |
AM WEST | Los Angeles | 811 | 694 | 85.57337 |
AM WEST | Phoenix | 5255 | 4840 | 92.10276 |
AM WEST | San Diego | 448 | 383 | 85.49107 |
AM WEST | San Francisco | 449 | 320 | 71.26949 |
AM WEST | Seattle | 262 | 201 | 76.71756 |
# Graph side-by-side columns
ggplot(on_time_flights, aes(x = city, y = percentage, fill = airline)) +
geom_col(position = "dodge") +
geom_text(aes(label = sprintf("%.1f%%", percentage)),
position = position_dodge(width = 0.9),
vjust = -0.5, size = 3) +
labs(title = "Percentage of flights on time per city",
x = "City",
y = "Percentage of On Time Flights (%)",
fill = "Airline") +
theme_minimal() +
scale_fill_manual(values = c('#00274A', '#007561')) # Respective airline colors
Analysis summary
In summary, despite the drastic difference in the overall number of flights, we can see that the airlines have similar delay rates to each other when compared in aggregate. However, that difference becomes far more meaningful when you evaluate it based per city, as we observed with San Francisco and Phoenix.
Conclusion
In this article we took a generated a simple dataset of flight data, tidied it into a proper long format for data analysis, then created visualizations to explore and understand how the rates of flight delays compare across airlines and cities. We identified a Simpson’s Paradox when comparing the rate of flights on time between airlines, and then subsequently per airline per city, each of which suggested a different preferred airline.