Flight Delays

Author

Kevin Havis

Introduction

Let’s analyze some flight data!

Our initial input data is unfortunately not in tidy shape, so we’ll start by parsing it into a proper format. Then we’ll perform some exploratory data analysis to get an idea of delays for flights. Our data contains on-time & delayed flight counts across two airlines and five cities on the West Coast.

Reading the initial file

library(tidyverse)
library(knitr)

# Read in the data
df <- read.csv('assign_5_input_data.csv', na.strings = c(""))

kable(df)

X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

Our initial data isn’t too bad, but it is certainly not tidy. We’ll address this by dropping the empty row, renaming our columns, and pivoting the dataset longer to “flatten” it.

# Drop empty row, rename columns, fill empty cells, and pivot into long format

df <- df |> 
  slice(-3) |> 
  rename("airline" = "X", "flight_status" = "X.1") |> 
  fill(airline) |> 
  rename_with(~ str_to_lower(str_replace(.x, "\\.", "_")), .cols = 3:7) |> 
  pivot_longer(3:7, names_to = "city", values_to = "flight_count") |> 
  arrange(desc(flight_count))

kable(df)

airline	flight_status	city	flight_count
AM WEST	on time	phoenix	4840
ALASKA	on time	seattle	1841
AM WEST	on time	los_angeles	694
ALASKA	on time	san_francisco	503
ALASKA	on time	los_angeles	497
AM WEST	delayed	phoenix	415
AM WEST	on time	san_diego	383
AM WEST	on time	san_francisco	320
ALASKA	delayed	seattle	305
ALASKA	on time	phoenix	221
ALASKA	on time	san_diego	212
AM WEST	on time	seattle	201
AM WEST	delayed	san_francisco	129
AM WEST	delayed	los_angeles	117
ALASKA	delayed	san_francisco	102
AM WEST	delayed	san_diego	65
ALASKA	delayed	los_angeles	62
AM WEST	delayed	seattle	61
ALASKA	delayed	san_diego	20
ALASKA	delayed	phoenix	12

We now have a nice, flat, long formatted and tidy data set. We can now do some analysis!

Analysis

Let’s answer some basic questions.

We’ll start with a general overview of our data. Since it is a simple data set, we can visualize most of the features in one plot.

# Set this up so we can factor airline for our factor wrap
df |> 
  mutate() |> 

# Plot data across two facets
ggplot(
       aes(
         x = flight_count,
         y = reorder(
          str_to_title( # Make city names pretty
            str_replace_all(city, "_", " ")),
          flight_count),
         fill = flight_status)) +
  geom_bar(stat = 'identity') +
  facet_wrap(~factor(airline, c('AM WEST', 'ALASKA'))) +
  labs(
    title = "Flight data overview",
    y = "Flights",
    x = "City",
    fill = "Flight Status"
  ) + 
  scale_fill_manual(values = c('#E06565', '#0088CC')) +
  theme_minimal()

What city has the highest amount of delayed flights?

# Plot amount of delayed flights by city
df |> 
  filter(flight_status == "delayed") |> 
  group_by(city) |> 
  summarize(total_flights = sum(flight_count)) |> 
  
ggplot(aes
       (x = reorder(
          str_to_title(
            str_replace_all(city, "_", " ")),
          -total_flights),
        y = total_flights)) +
  geom_bar(stat = "identity", fill = '#E06565') +
  labs(
    title = "Delayed flights by city",
    y = "City",
    x = "Flights",
    fill = "Flight Status"
  ) +
  geom_text(aes(label = total_flights),vjust = 1.5, color = "#3b3e40" )+
  theme_minimal()

We can see that Phoenix has the highest amount of delayed flights.

What airline has the higher amount of on-time flights?

# Plot proportion of flight status by airline

df |> 
  group_by(airline, flight_status) |> 
  summarize(total_flights = sum(flight_count)) |> 
  
ggplot(aes
       (x = reorder(
          str_to_title(
            str_replace_all(airline, "_", " ")),
          -total_flights),
          fill = flight_status,
        y = total_flights)) +
  geom_col(position = "fill") +
  labs(
    title = "Flight status by airline",
    y = "Airline",
    x = "Flights (Proportion)",
    fill = "Flight Status"
  ) + 
  scale_fill_manual(values = c('#E06565', '#0088CC')) +
  theme_minimal()

Interestingly, the rate of flights on time versus delayed is actually quite similar between the airlines, despite the significant differences in airports visited and overall number of flights we observed earlier.

Conclusion

In this article we took a simple dataset, tidied it into a proper long format for data analysis, then created visualizations to explore and understand the data better.

In summary, if you’re flying to Phoenix, fly Alaska!