Flight Delays

Author

Kevin Havis

Introduction

Let’s analyze some flight data!

Our initial input data is unfortunately not in tidy shape, so we’ll start by parsing it into a proper format. Then we’ll perform some exploratory data analysis to get an idea of delays for flights. Our data contains on-time & delayed flight counts across two airlines and five cities on the West Coast.

Reading the initial file

library(tidyverse)
library(knitr)
# Read in the data
df <- read.csv('assign_5_input_data.csv', na.strings = c(""))

kable(df)
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

Our initial data isn’t too bad, but it is certainly not tidy. We’ll address this by dropping the empty row, renaming our columns, and pivoting the dataset longer to “flatten” it.

# Drop empty row, rename columns, fill empty cells, and pivot into long format

df <- df |> 
  slice(-3) |> 
  rename("airline" = "X", "flight_status" = "X.1") |> 
  fill(airline) |> 
  rename_with(~ str_to_lower(str_replace(.x, "\\.", "_")), .cols = 3:7) |> 
  pivot_longer(3:7, names_to = "city", values_to = "flight_count") |> 
  arrange(desc(flight_count))

kable(df)
airline flight_status city flight_count
AM WEST on time phoenix 4840
ALASKA on time seattle 1841
AM WEST on time los_angeles 694
ALASKA on time san_francisco 503
ALASKA on time los_angeles 497
AM WEST delayed phoenix 415
AM WEST on time san_diego 383
AM WEST on time san_francisco 320
ALASKA delayed seattle 305
ALASKA on time phoenix 221
ALASKA on time san_diego 212
AM WEST on time seattle 201
AM WEST delayed san_francisco 129
AM WEST delayed los_angeles 117
ALASKA delayed san_francisco 102
AM WEST delayed san_diego 65
ALASKA delayed los_angeles 62
AM WEST delayed seattle 61
ALASKA delayed san_diego 20
ALASKA delayed phoenix 12

We now have a nice, flat, long formatted and tidy data set. We can now do some analysis!

Analysis

Let’s answer some basic questions.

We’ll start with a general overview of our data. Since it is a simple data set, we can visualize most of the features in one plot.

# Set this up so we can factor airline for our factor wrap
df |> 
  mutate() |> 

# Plot data across two facets
ggplot(
       aes(
         x = flight_count,
         y = reorder(
          str_to_title( # Make city names pretty
            str_replace_all(city, "_", " ")),
          flight_count),
         fill = flight_status)) +
  geom_bar(stat = 'identity') +
  facet_wrap(~factor(airline, c('AM WEST', 'ALASKA'))) +
  labs(
    title = "Flight data overview",
    y = "Flights",
    x = "City",
    fill = "Flight Status"
  ) + 
  scale_fill_manual(values = c('#E06565', '#0088CC')) +
  theme_minimal()

What city has the highest amount of delayed flights?

# Plot amount of delayed flights by city
df |> 
  filter(flight_status == "delayed") |> 
  group_by(city) |> 
  summarize(total_flights = sum(flight_count)) |> 
  
ggplot(aes
       (x = reorder(
          str_to_title(
            str_replace_all(city, "_", " ")),
          -total_flights),
        y = total_flights)) +
  geom_bar(stat = "identity", fill = '#E06565') +
  labs(
    title = "Delayed flights by city",
    y = "City",
    x = "Flights",
    fill = "Flight Status"
  ) +
  geom_text(aes(label = total_flights),vjust = 1.5, color = "#3b3e40" )+
  theme_minimal()

We can see that Phoenix has the highest amount of delayed flights.

What airline has the higher amount of on-time flights?

# Plot proportion of flight status by airline

df |> 
  group_by(airline, flight_status) |> 
  summarize(total_flights = sum(flight_count)) |> 
  
ggplot(aes
       (x = reorder(
          str_to_title(
            str_replace_all(airline, "_", " ")),
          -total_flights),
          fill = flight_status,
        y = total_flights)) +
  geom_col(position = "fill") +
  labs(
    title = "Flight status by airline",
    y = "Airline",
    x = "Flights (Proportion)",
    fill = "Flight Status"
  ) + 
  scale_fill_manual(values = c('#E06565', '#0088CC')) +
  theme_minimal()

Interestingly, the rate of flights on time versus delayed is actually quite similar between the airlines, despite the significant differences in airports visited and overall number of flights we observed earlier.

Conclusion

In this article we took a simple dataset, tidied it into a proper long format for data analysis, then created visualizations to explore and understand the data better.

In summary, if you’re flying to Phoenix, fly Alaska!