Assignment 5A

Overview

For this assignment, I will use the information provided in the table to create a .csv file. Then, I will use tidyr and dplyr to tidy and transform the data. Once that’s done, I will perform an analysis of the arrival delays for the two airlines. I believe it will be a good learning experience with tidying data and data transformations. From a glance, tidyr has to do with data structure and will be useful for converting between a “long” and “wide” format. I will use dplyr for the data manipulation. I think using a “wide” structure and turning it into a “long” format will be interesting to learn about.

To start off, I created a .csv file for the table and uploaded it to my personal github.

Source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205A/Airline%20Delays%20-%20Sheet1.csv

Creating a data frame

I created a data frame from the csv file.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(dplyr)
library(gt)

url <- "https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205A/Airline%20Delays%20-%20Sheet1.csv"

df <- read_csv(
  file = url,
  show_col_types = FALSE,
  progress = FALSE
)

## New names:
## • `` -> `...1`
## • `` -> `...2`

head(df)

## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

I noticed there was an empty row in the data so I will remove the empty row and add names to the first couple columns as well.

df <- df |> filter(!if_all(everything(), is.na)) |>
  setNames(replace(names(df), 1:2, c("airline", "status")))
head(df)

## # A tibble: 4 × 7
##   airline status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 AM WEST on time           694    4840         383             320     201
## 4 <NA>    delayed           117     415          65             129      61

Looking at the data, there doesn’t appear to be any missing data. Each Airline has a value for “on time” and “delayed” for each of the 5 destinations. If there was any values missing, we could replace the empty cell with 0. In order to simulate this, we’ll create a dummy data frame with some values missing and then replace the empty cell with a 0 using dplyr.

To create the dummy data frame, I set a seed to keep the sample() function consistent and then used the sample() function to replace 50% of the Los Angeles Column to NA.

dummy_df <- df
set.seed(85858)
dummy_df$`Los Angeles`[sample(1:nrow(df), nrow(df) * 0.5)] <- NA

head(dummy_df)

## # A tibble: 4 × 7
##   airline status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            NA      12          20             102     305
## 3 AM WEST on time            NA    4840         383             320     201
## 4 <NA>    delayed           117     415          65             129      61

Next I went ahead and replaced the NA with 0s.

dummy_df <- dummy_df |>
  mutate(`Los Angeles` = replace_na(`Los Angeles`, 0)) # Replace NA with 0

head(dummy_df)

## # A tibble: 4 × 7
##   airline status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed             0      12          20             102     305
## 3 AM WEST on time             0    4840         383             320     201
## 4 <NA>    delayed           117     415          65             129      61

Going back to the original data, I’ll convert the data frame from a wide to a long format using the tidyr package and the pivot_longer() function.

df <- df |>
  tidyr::fill(airline, .direction = "down") |>
  mutate(status = tolower(status), 
         status = ifelse(status %in% c("on time","on time"), "on_time", status)
  )

destination_columns <- setdiff(names(df), c("airline","status"))

long_df <- df |>
  pivot_longer(all_of(destination_columns), names_to = "destination", values_to = "count") |>
  mutate(count = as.numeric(count))
head(long_df, 20)

## # A tibble: 20 × 4
##    airline status  destination   count
##    <chr>   <chr>   <chr>         <dbl>
##  1 ALASKA  on_time Los Angeles     497
##  2 ALASKA  on_time Phoenix         221
##  3 ALASKA  on_time San Diego       212
##  4 ALASKA  on_time San Francisco   503
##  5 ALASKA  on_time Seattle        1841
##  6 ALASKA  delayed Los Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San Diego        20
##  9 ALASKA  delayed San Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on_time Los Angeles     694
## 12 AM WEST on_time Phoenix        4840
## 13 AM WEST on_time San Diego       383
## 14 AM WEST on_time San Francisco   320
## 15 AM WEST on_time Seattle         201
## 16 AM WEST delayed Los Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San Diego        65
## 19 AM WEST delayed San Francisco   129
## 20 AM WEST delayed Seattle          61

Next I’ll calculate the percentage of delays for each destination and display the results in a table.

# percentage of delayed flights for each destination
delay_destination <- long_df |>
  group_by(airline, destination, status) |>
  summarise(count = sum(count, na.rm = TRUE), .groups = "drop") |>
  tidyr::pivot_wider(names_from = status, values_from = count, values_fill = 0) |>
  mutate(
    total_count = delayed + on_time, 
    percent_delayed = ifelse(total_count > 0, 100 * delayed / total_count, NA_real_)
    )

delay_destination |>
  gt() |>
  cols_label(
    airline = "Airline",
    destination = "Destination",
    delayed = "Delayed",
    on_time = "On Time",
    total_count = "Total",
    percent_delayed = "Percent Delayed"
  ) |>
  tab_header(
    title = "Airline Delays",
  )

Airline	Destination	Delayed	On Time	Total	Percent Delayed
Airline Delays
ALASKA	Los Angeles	62	497	559	11.091234
ALASKA	Phoenix	12	221	233	5.150215
ALASKA	San Diego	20	212	232	8.620690
ALASKA	San Francisco	102	503	605	16.859504
ALASKA	Seattle	305	1841	2146	14.212488
AM WEST	Los Angeles	117	694	811	14.426634
AM WEST	Phoenix	415	4840	5255	7.897241
AM WEST	San Diego	65	383	448	14.508929
AM WEST	San Francisco	129	320	449	28.730512
AM WEST	Seattle	61	201	262	23.282443

Next I’ll calculate the percentage of delays for each airline and display them in a table.

# percentage of total delayed flights
delay_airline <- delay_destination |>
  group_by(airline) |>
  summarise(
    delayed = sum(delayed), on_time = sum(on_time),
    total_count = delayed + on_time,
    percent_delayed = 100 * delayed / total_count,
    .groups = "drop"
  )

delay_airline |>
  gt() |>
  cols_label(
    airline = "Airline",
    delayed = "Delayed",
    on_time = "On Time",
    total_count = "Total",
    percent_delayed = "Percent Delayed"
  ) |>
  tab_header(
    title = "Airline Delays",
  )

Airline	Delayed	On Time	Total	Percent Delayed
Airline Delays
ALASKA	501	3274	3775	13.27152
AM WEST	787	6438	7225	10.89273

Next I’ll create a plot for the delays percentage for each destination using ggplot.

ggplot(delay_destination, aes(x = destination, y = percent_delayed, fill = airline)) +
  geom_col(position = position_dodge(width = 0.7)) +
  geom_text(
    aes(label = round(percent_delayed, 1)),  # Rounding for cleanliness
    position = position_dodge(width = 0.7),  # MUST match the geom_col width
    vjust = -0.5,                            # Move text slightly above the bar
    size = 3.5                               # Adjust font size as needed
  ) +
  labs(title = "Percentage of Delayed Flights by Destination", x = "Destination", y = "Percent Delayed")

Next I’ll create a plot for the delays percentage for each airline using ggplot.

ggplot(delay_airline, aes(airline, percent_delayed, fill = airline)) +
  geom_col() +
  geom_text(
    aes(label = round(percent_delayed, 1)), 
    vjust = -0.5,   # Positions the text just above the top of the bar
    size = 4        # Adjusts the text size
  ) +
  labs(title = "Percentage of Delayed Flights", x = "Airline", y = "Percent Delayed")

Conclusion

Looking at the two plots, we can see that the percentage of delays for each of the destinations is higher for AM WEST compared to ALASKA but the percentage of delays for the AM WEST airline altogether is lower than that of ALASKA. This is the Simpson’s paradox where the relationship between variables in separate groups reverses when you take the data as a whole. In this case, ALASKA has a higher percentage of delays in total even though they have a lower rate of delays for each destination compared to AM WEST.

Assignment 5A

Long Lin

2026-03-01

Overview

Creating a data frame

Conclusion