Overview

In this assignment, I use tidyverse to “tidy” a data set using pivot_longer, add some summary statistics, and then use pivot_wider to create a new comparison table to look at flight delay rates from two airlines to five airports.

Importing Data

In this code block, I import the original “wide” .csv file from my github repository into a data frame. I also rename the columns.

original_data <- read.csv("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/assignment5csv")

colnames(original_data) <- c("Airline", "Status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")

library(knitr)

kable(original_data, format = "pipe", caption = "Original Flight Data", align = "llccccc")
Original Flight Data
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
delayed 117 415 65 129 61

Pivot Longer

In this code block, I use pivot_longer to tidy the data. I name the column with all the different destination airports “Destination” and the column with the number of flights to each airport “Count”. Since each airline only appeared once in the table, there are some rows that do not have the airline in the data frame (the rows corresponding to the “delayed” flight data).

library(tidyr)

tidy_data <- original_data %>%
  pivot_longer(
    cols = -c("Airline", "Status"),
    names_to = "Destination",
    values_to = "Count",
    values_drop_na = TRUE)

kable(tidy_data, format = "pipe", caption = "Tidy Flight Data (Missing Some Airline Values)", align = "lllc")
Tidy Flight Data (Missing Some Airline Values)
Airline Status Destination Count
ALASKA on time Los Angeles 497
ALASKA on time Phoenix 221
ALASKA on time San Diego 212
ALASKA on time San Francisco 503
ALASKA on time Seattle 1841
delayed Los Angeles 62
delayed Phoenix 12
delayed San Diego 20
delayed San Francisco 102
delayed Seattle 305
AM WEST on time Los Angeles 694
AM WEST on time Phoenix 4840
AM WEST on time San Diego 383
AM WEST on time San Francisco 320
AM WEST on time Seattle 201
delayed Los Angeles 117
delayed Phoenix 415
delayed San Diego 65
delayed San Francisco 129
delayed Seattle 61

Filling In Missing Data

In this code block, I use a for loop to fill in the missing airline data. If the cell is empty, it uses the value from the row above. This allows the code to “cascade” through the column, filling in the airline for each empty cell until it gets to the next airline.

for (i in 1:nrow(tidy_data)) {
  if(tidy_data$Airline[i] == "") {tidy_data$Airline[i] <- tidy_data$Airline[i-1]}
}

kable(tidy_data, format = "pipe", caption = "Final Tidy Flight Data", align = "lllc")
Final Tidy Flight Data
Airline Status Destination Count
ALASKA on time Los Angeles 497
ALASKA on time Phoenix 221
ALASKA on time San Diego 212
ALASKA on time San Francisco 503
ALASKA on time Seattle 1841
ALASKA delayed Los Angeles 62
ALASKA delayed Phoenix 12
ALASKA delayed San Diego 20
ALASKA delayed San Francisco 102
ALASKA delayed Seattle 305
AM WEST on time Los Angeles 694
AM WEST on time Phoenix 4840
AM WEST on time San Diego 383
AM WEST on time San Francisco 320
AM WEST on time Seattle 201
AM WEST delayed Los Angeles 117
AM WEST delayed Phoenix 415
AM WEST delayed San Diego 65
AM WEST delayed San Francisco 129
AM WEST delayed Seattle 61

Adding Data

In this code block, I group the flight totals by airline and destination to get the total number of flights from each airline to each city. Then I use a left join to create a new data frame with all of this information. I then add a new column representing the percentage of flights from that airline to that destination that had that status.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ purrr     1.0.1
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
flight_totals <- tidy_data %>%
  group_by(Airline, Destination) %>%
  summarise(total = sum(Count))
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
library(dplyr)

expanded_data <- left_join(tidy_data, flight_totals, by = c("Airline", "Destination"))
expanded_data$percent <- round(expanded_data$Count / expanded_data$total * 100, digits = 1)

kable(expanded_data, format = "pipe", caption = "Additional Flight Data", align = "lllccc")
Additional Flight Data
Airline Status Destination Count total percent
ALASKA on time Los Angeles 497 559 88.9
ALASKA on time Phoenix 221 233 94.8
ALASKA on time San Diego 212 232 91.4
ALASKA on time San Francisco 503 605 83.1
ALASKA on time Seattle 1841 2146 85.8
ALASKA delayed Los Angeles 62 559 11.1
ALASKA delayed Phoenix 12 233 5.2
ALASKA delayed San Diego 20 232 8.6
ALASKA delayed San Francisco 102 605 16.9
ALASKA delayed Seattle 305 2146 14.2
AM WEST on time Los Angeles 694 811 85.6
AM WEST on time Phoenix 4840 5255 92.1
AM WEST on time San Diego 383 448 85.5
AM WEST on time San Francisco 320 449 71.3
AM WEST on time Seattle 201 262 76.7
AM WEST delayed Los Angeles 117 811 14.4
AM WEST delayed Phoenix 415 5255 7.9
AM WEST delayed San Diego 65 448 14.5
AM WEST delayed San Francisco 129 449 28.7
AM WEST delayed Seattle 61 262 23.3

Comparing the Airlines

In this code block, I use pivot_wider to spread the data out again so it can be compared as percentages. Each column is a destination city, and contains the data showing each airline’s on time and delayed percentages for that city.

comparison_data <- expanded_data %>%
  pivot_wider(
    id_cols = Airline:Status,
    names_from = Destination,
    values_from = percent
  )

kable(comparison_data, format = "pipe", caption = "Airline Status Percentages by City", align = "llccccc")
Airline Status Percentages by City
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 88.9 94.8 91.4 83.1 85.8
ALASKA delayed 11.1 5.2 8.6 16.9 14.2
AM WEST on time 85.6 92.1 85.5 71.3 76.7
AM WEST delayed 14.4 7.9 14.5 28.7 23.3

Findings and Recommendations

The table above reveals that Alaska airlines had a higher percentage of its flights classified as “on time” than American West airlines did for every destination city. This suggests that travelers seeking to avoid delays would be better off flying with Alaska, although further analysis with additional data could attempt to control for factors like time of year, time of day, origin city, and other considerations that might affect whether or not a flight is delayed such as flight duration.