Overview

For this assignment, I will use the information provided in the table to create a .csv file. Then, I will use tidyr and dplyr to tidy and transform the data. Once that’s done, I will perform an analysis of the arrival delays for the two airlines. I believe it will be a good learning experience with tidying data and data transformations. From a glance, tidyr has to do with data structure and will be useful for converting between a “long” and “wide” format. I will use dplyr for the data manipulation. I think using a “wide” structure and turning it into a “long” format will be interesting to learn about.

To start off, I created a .csv file for the table and uploaded it to my personal github.

Source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205A/Airline%20Delays%20-%20Sheet1.csv

Creating a data frame

I created a data frame from the csv file.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)
library(gt)

url <- "https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205A/Airline%20Delays%20-%20Sheet1.csv"

df <- read_csv(
  file = url,
  show_col_types = FALSE,
  progress = FALSE
)
## New names:
## • `` -> `...1`
## • `` -> `...2`
head(df)
## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

I noticed there was an empty row in the data so I will remove the empty row and add names to the first couple columns as well.

df <- df |> filter(!if_all(everything(), is.na)) |>
  setNames(replace(names(df), 1:2, c("airline", "status")))
head(df)
## # A tibble: 4 × 7
##   airline status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 AM WEST on time           694    4840         383             320     201
## 4 <NA>    delayed           117     415          65             129      61

Looking at the data, there doesn’t appear to be any missing data. Each Airline has a value for “on time” and “delayed” for each of the 5 destinations. If there was any values missing, we could replace the empty cell with 0. In order to simulate this, we’ll create a dummy data frame with some values missing and then replace the empty cell with a 0 using dplyr.

To create the dummy data frame, I set a seed to keep the sample() function consistent and then used the sample() function to replace 50% of the Los Angeles Column to NA.

dummy_df <- df
set.seed(85858)
dummy_df$`Los Angeles`[sample(1:nrow(df), nrow(df) * 0.5)] <- NA

head(dummy_df)
## # A tibble: 4 × 7
##   airline status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            NA      12          20             102     305
## 3 AM WEST on time            NA    4840         383             320     201
## 4 <NA>    delayed           117     415          65             129      61

Next I went ahead and replaced the NA with 0s.

dummy_df <- dummy_df |>
  mutate(`Los Angeles` = replace_na(`Los Angeles`, 0)) # Replace NA with 0

head(dummy_df)
## # A tibble: 4 × 7
##   airline status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed             0      12          20             102     305
## 3 AM WEST on time             0    4840         383             320     201
## 4 <NA>    delayed           117     415          65             129      61

Going back to the original data, I’ll convert the data frame from a wide to a long format using the tidyr package and the pivot_longer() function.

df <- df |>
  tidyr::fill(airline, .direction = "down") |>
  mutate(status = tolower(status), 
         status = ifelse(status %in% c("on time","on time"), "on_time", status)
  )

destination_columns <- setdiff(names(df), c("airline","status"))

long_df <- df |>
  pivot_longer(all_of(destination_columns), names_to = "destination", values_to = "count") |>
  mutate(count = as.numeric(count))
head(long_df, 20)
## # A tibble: 20 × 4
##    airline status  destination   count
##    <chr>   <chr>   <chr>         <dbl>
##  1 ALASKA  on_time Los Angeles     497
##  2 ALASKA  on_time Phoenix         221
##  3 ALASKA  on_time San Diego       212
##  4 ALASKA  on_time San Francisco   503
##  5 ALASKA  on_time Seattle        1841
##  6 ALASKA  delayed Los Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San Diego        20
##  9 ALASKA  delayed San Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on_time Los Angeles     694
## 12 AM WEST on_time Phoenix        4840
## 13 AM WEST on_time San Diego       383
## 14 AM WEST on_time San Francisco   320
## 15 AM WEST on_time Seattle         201
## 16 AM WEST delayed Los Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San Diego        65
## 19 AM WEST delayed San Francisco   129
## 20 AM WEST delayed Seattle          61

Next I’ll calculate the percentage of delays for each destination and display the results in a table.

# percentage of delayed flights for each destination
delay_destination <- long_df |>
  group_by(airline, destination, status) |>
  summarise(count = sum(count, na.rm = TRUE), .groups = "drop") |>
  tidyr::pivot_wider(names_from = status, values_from = count, values_fill = 0) |>
  mutate(
    total_count = delayed + on_time, 
    percent_delayed = ifelse(total_count > 0, 100 * delayed / total_count, NA_real_)
    )

delay_destination |>
  gt() |>
  cols_label(
    airline = "Airline",
    destination = "Destination",
    delayed = "Delayed",
    on_time = "On Time",
    total_count = "Total",
    percent_delayed = "Percent Delayed"
  ) |>
  tab_header(
    title = "Airline Delays",
  )
Airline Delays
Airline Destination Delayed On Time Total Percent Delayed
ALASKA Los Angeles 62 497 559 11.091234
ALASKA Phoenix 12 221 233 5.150215
ALASKA San Diego 20 212 232 8.620690
ALASKA San Francisco 102 503 605 16.859504
ALASKA Seattle 305 1841 2146 14.212488
AM WEST Los Angeles 117 694 811 14.426634
AM WEST Phoenix 415 4840 5255 7.897241
AM WEST San Diego 65 383 448 14.508929
AM WEST San Francisco 129 320 449 28.730512
AM WEST Seattle 61 201 262 23.282443

Next I’ll calculate the percentage of delays for each airline and display them in a table.

# percentage of total delayed flights
delay_airline <- delay_destination |>
  group_by(airline) |>
  summarise(
    delayed = sum(delayed), on_time = sum(on_time),
    total_count = delayed + on_time,
    percent_delayed = 100 * delayed / total_count,
    .groups = "drop"
  )

delay_airline |>
  gt() |>
  cols_label(
    airline = "Airline",
    delayed = "Delayed",
    on_time = "On Time",
    total_count = "Total",
    percent_delayed = "Percent Delayed"
  ) |>
  tab_header(
    title = "Airline Delays",
  )
Airline Delays
Airline Delayed On Time Total Percent Delayed
ALASKA 501 3274 3775 13.27152
AM WEST 787 6438 7225 10.89273

Next I’ll create a plot for the delays percentage for each destination using ggplot.

ggplot(delay_destination, aes(x = destination, y = percent_delayed, fill = airline)) +
  geom_col(position = position_dodge(width = 0.7)) +
  geom_text(
    aes(label = round(percent_delayed, 1)),  # Rounding for cleanliness
    position = position_dodge(width = 0.7),  # MUST match the geom_col width
    vjust = -0.5,                            # Move text slightly above the bar
    size = 3.5                               # Adjust font size as needed
  ) +
  labs(title = "Percentage of Delayed Flights by Destination", x = "Destination", y = "Percent Delayed")

Next I’ll create a plot for the delays percentage for each airline using ggplot.

ggplot(delay_airline, aes(airline, percent_delayed, fill = airline)) +
  geom_col() +
  geom_text(
    aes(label = round(percent_delayed, 1)), 
    vjust = -0.5,   # Positions the text just above the top of the bar
    size = 4        # Adjusts the text size
  ) +
  labs(title = "Percentage of Delayed Flights", x = "Airline", y = "Percent Delayed")

Conclusion

Looking at the two plots, we can see that the percentage of delays for each of the destinations is higher for AM WEST compared to ALASKA but the percentage of delays for the AM WEST airline altogether is lower than that of ALASKA. This is the Simpson’s paradox where the relationship between variables in separate groups reverses when you take the data as a whole. In this case, ALASKA has a higher percentage of delays in total even though they have a lower rate of delays for each destination compared to AM WEST.