For this assignment, I will use the information provided in the table to create a .csv file. Then, I will use tidyr and dplyr to tidy and transform the data. Once that’s done, I will perform an analysis of the arrival delays for the two airlines. I believe it will be a good learning experience with tidying data and data transformations. From a glance, tidyr has to do with data structure and will be useful for converting between a “long” and “wide” format. I will use dplyr for the data manipulation. I think using a “wide” structure and turning it into a “long” format will be interesting to learn about.
To start off, I created a .csv file for the table and uploaded it to my personal github.
I created a data frame from the csv file.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)
library(gt)
url <- "https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205A/Airline%20Delays%20-%20Sheet1.csv"
df <- read_csv(
file = url,
show_col_types = FALSE,
progress = FALSE
)
## New names:
## • `` -> `...1`
## • `` -> `...2`
head(df)
## # A tibble: 5 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
I noticed there was an empty row in the data so I will remove the empty row and add names to the first couple columns as well.
df <- df |> filter(!if_all(everything(), is.na)) |>
setNames(replace(names(df), 1:2, c("airline", "status")))
head(df)
## # A tibble: 4 × 7
## airline status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
Looking at the data, there doesn’t appear to be any missing data. Each Airline has a value for “on time” and “delayed” for each of the 5 destinations. If there was any values missing, we could replace the empty cell with 0. In order to simulate this, we’ll create a dummy data frame with some values missing and then replace the empty cell with a 0 using dplyr.
To create the dummy data frame, I set a seed to keep the sample() function consistent and then used the sample() function to replace 50% of the Los Angeles Column to NA.
dummy_df <- df
set.seed(85858)
dummy_df$`Los Angeles`[sample(1:nrow(df), nrow(df) * 0.5)] <- NA
head(dummy_df)
## # A tibble: 4 × 7
## airline status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed NA 12 20 102 305
## 3 AM WEST on time NA 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
Next I went ahead and replaced the NA with 0s.
dummy_df <- dummy_df |>
mutate(`Los Angeles` = replace_na(`Los Angeles`, 0)) # Replace NA with 0
head(dummy_df)
## # A tibble: 4 × 7
## airline status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 0 12 20 102 305
## 3 AM WEST on time 0 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
Going back to the original data, I’ll convert the data frame from a wide to a long format using the tidyr package and the pivot_longer() function.
df <- df |>
tidyr::fill(airline, .direction = "down") |>
mutate(status = tolower(status),
status = ifelse(status %in% c("on time","on time"), "on_time", status)
)
destination_columns <- setdiff(names(df), c("airline","status"))
long_df <- df |>
pivot_longer(all_of(destination_columns), names_to = "destination", values_to = "count") |>
mutate(count = as.numeric(count))
head(long_df, 20)
## # A tibble: 20 × 4
## airline status destination count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on_time Los Angeles 497
## 2 ALASKA on_time Phoenix 221
## 3 ALASKA on_time San Diego 212
## 4 ALASKA on_time San Francisco 503
## 5 ALASKA on_time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on_time Los Angeles 694
## 12 AM WEST on_time Phoenix 4840
## 13 AM WEST on_time San Diego 383
## 14 AM WEST on_time San Francisco 320
## 15 AM WEST on_time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
Next I’ll calculate the percentage of delays for each destination and display the results in a table.
# percentage of delayed flights for each destination
delay_destination <- long_df |>
group_by(airline, destination, status) |>
summarise(count = sum(count, na.rm = TRUE), .groups = "drop") |>
tidyr::pivot_wider(names_from = status, values_from = count, values_fill = 0) |>
mutate(
total_count = delayed + on_time,
percent_delayed = ifelse(total_count > 0, 100 * delayed / total_count, NA_real_)
)
delay_destination |>
gt() |>
cols_label(
airline = "Airline",
destination = "Destination",
delayed = "Delayed",
on_time = "On Time",
total_count = "Total",
percent_delayed = "Percent Delayed"
) |>
tab_header(
title = "Airline Delays",
)
| Airline Delays | |||||
| Airline | Destination | Delayed | On Time | Total | Percent Delayed |
|---|---|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 | 559 | 11.091234 |
| ALASKA | Phoenix | 12 | 221 | 233 | 5.150215 |
| ALASKA | San Diego | 20 | 212 | 232 | 8.620690 |
| ALASKA | San Francisco | 102 | 503 | 605 | 16.859504 |
| ALASKA | Seattle | 305 | 1841 | 2146 | 14.212488 |
| AM WEST | Los Angeles | 117 | 694 | 811 | 14.426634 |
| AM WEST | Phoenix | 415 | 4840 | 5255 | 7.897241 |
| AM WEST | San Diego | 65 | 383 | 448 | 14.508929 |
| AM WEST | San Francisco | 129 | 320 | 449 | 28.730512 |
| AM WEST | Seattle | 61 | 201 | 262 | 23.282443 |
Next I’ll calculate the percentage of delays for each airline and display them in a table.
# percentage of total delayed flights
delay_airline <- delay_destination |>
group_by(airline) |>
summarise(
delayed = sum(delayed), on_time = sum(on_time),
total_count = delayed + on_time,
percent_delayed = 100 * delayed / total_count,
.groups = "drop"
)
delay_airline |>
gt() |>
cols_label(
airline = "Airline",
delayed = "Delayed",
on_time = "On Time",
total_count = "Total",
percent_delayed = "Percent Delayed"
) |>
tab_header(
title = "Airline Delays",
)
| Airline Delays | ||||
| Airline | Delayed | On Time | Total | Percent Delayed |
|---|---|---|---|---|
| ALASKA | 501 | 3274 | 3775 | 13.27152 |
| AM WEST | 787 | 6438 | 7225 | 10.89273 |
Next I’ll create a plot for the delays percentage for each destination using ggplot.
ggplot(delay_destination, aes(x = destination, y = percent_delayed, fill = airline)) +
geom_col(position = position_dodge(width = 0.7)) +
geom_text(
aes(label = round(percent_delayed, 1)), # Rounding for cleanliness
position = position_dodge(width = 0.7), # MUST match the geom_col width
vjust = -0.5, # Move text slightly above the bar
size = 3.5 # Adjust font size as needed
) +
labs(title = "Percentage of Delayed Flights by Destination", x = "Destination", y = "Percent Delayed")
Next I’ll create a plot for the delays percentage for each airline using ggplot.
ggplot(delay_airline, aes(airline, percent_delayed, fill = airline)) +
geom_col() +
geom_text(
aes(label = round(percent_delayed, 1)),
vjust = -0.5, # Positions the text just above the top of the bar
size = 4 # Adjusts the text size
) +
labs(title = "Percentage of Delayed Flights", x = "Airline", y = "Percent Delayed")
Looking at the two plots, we can see that the percentage of delays for each of the destinations is higher for AM WEST compared to ALASKA but the percentage of delays for the AM WEST airline altogether is lower than that of ALASKA. This is the Simpson’s paradox where the relationship between variables in separate groups reverses when you take the data as a whole. In this case, ALASKA has a higher percentage of delays in total even though they have a lower rate of delays for each destination compared to AM WEST.