Assignment 5A: Airline Delays

Author

Emily El Mouaquite

Approach

I will begin by recreating the airline delay chart and exporting it to a .CSV file. I will then be able to use tidyr and dplyr to transform the data from wide to long by creating airline, status (on time/ delayed), city and flight count columns, and compare delays/ arrival rates for airlines and cities.

Code Base

# load libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

I created the airline delay .CSV file using Excel, and will begin by reading it.

# read csv
airline_df <- read.csv("airline.csv")

Populate missing cells by adding column names.

# add airline and status column names
colnames(airline_df) <- c(
  "airline",
  "status",
  "Los_Angeles",
  "Phoenix",
  "San_Diego",
  "San_Francisco",
  "Seattle"
)

Remove blank row.

# make sure all blank cells are NA
airline_df[airline_df == ""] <- NA
# remove row 3
airline_df <- airline_df[-3, ]

Populate the rest of the NA cells by filling the airline column.

airline_df <- airline_df %>%
  fill(airline)

Convert the data from wide to long.

# make sure all values are numeric, and remove commas
airline_df<- airline_df %>%
  mutate(
    across(Los_Angeles:Seattle,
           ~ as.numeric(gsub(",", "", .)))
  )
# use pivot_longer to change each city column to a singular cities column, and a flight count column 
airline_df <- airline_df %>%
  pivot_longer(
    cols = Los_Angeles:Seattle,
    names_to = "city",
    values_to = "flight_count"
  )

Delay rate comparison between airlines.

delays <- airline_df %>%
  group_by(airline) %>%
  summarise(
    delayed_count = sum(flight_count[status == "delayed"]),
    total_count = sum(flight_count)
  ) %>%
  mutate(delay_percentage = round((delayed_count / total_count) * 100, 1))

delays
# A tibble: 2 × 4
  airline delayed_count total_count delay_percentage
  <chr>           <dbl>       <dbl>            <dbl>
1 ALASKA            501        3775             13.3
2 AM WEST           787        7225             10.9

This shows that Alaska had a higher rate of delayed flights than AM West overall.

On time arrival comparison between airlines across cities.

arrivals_by_city <- airline_df %>%
  group_by(airline, city) %>%
  summarise(
    arrival_count = sum(flight_count[status == "on time"]),
    total_count = sum(flight_count)
  ) %>%
  mutate(arrival_percentage = round((arrival_count / total_count) * 100, 1))  %>%
ungroup() %>%
  arrange(desc(arrival_percentage))
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
arrivals_by_city
# A tibble: 10 × 5
   airline city          arrival_count total_count arrival_percentage
   <chr>   <chr>                 <dbl>       <dbl>              <dbl>
 1 ALASKA  Phoenix                 221         233               94.8
 2 AM WEST Phoenix                4840        5255               92.1
 3 ALASKA  San_Diego               212         232               91.4
 4 ALASKA  Los_Angeles             497         559               88.9
 5 ALASKA  Seattle                1841        2146               85.8
 6 AM WEST Los_Angeles             694         811               85.6
 7 AM WEST San_Diego               383         448               85.5
 8 ALASKA  San_Francisco           503         605               83.1
 9 AM WEST Seattle                 201         262               76.7
10 AM WEST San_Francisco           320         449               71.3

This table shows that, for both airlines, the city with the highest rate of on time flights was Phoenix, and the city with the lowest rate of on time flights was San Francisco. It also shows that Alaska has a higher rate of on time flights than AM West in every city.

Conclusion

When comparing the airlines AM West and Alaska, a discrepancy rises in that Alaska has a higher rate of delayed flights overall, but when comparing on time flight rates by city, Alaska also has a higher rate of on time flights in every city. This can be explained by Simpson’s Paradox, which is when groups of data show different trends when reversed. Since AM West operates a higher number of flights in each city with lower overall delays, its overall delay rate appears lower, even though Alaska performs better in every individual city. To verify this work, one could hand calculate the counts and percentages to compare them to the table outputs.