Data 607 Assignment 5

Overview

In this assignment, I use tidyverse to “tidy” a data set using pivot_longer, add some summary statistics, and then use pivot_wider to create a new comparison table to look at flight delay rates from two airlines to five airports.

Importing Data

In this code block, I import the original “wide” .csv file from my github repository into a data frame. I also rename the columns.

original_data <- read.csv("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/assignment5csv")

colnames(original_data) <- c("Airline", "Status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")

library(knitr)

kable(original_data, format = "pipe", caption = "Original Flight Data", align = "llccccc")

Original Flight Data
Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
	delayed	62	12	20	102	305
		NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
	delayed	117	415	65	129	61

Pivot Longer

In this code block, I use pivot_longer to tidy the data. I name the column with all the different destination airports “Destination” and the column with the number of flights to each airport “Count”. Since each airline only appeared once in the table, there are some rows that do not have the airline in the data frame (the rows corresponding to the “delayed” flight data).

library(tidyr)

tidy_data <- original_data %>%
  pivot_longer(
    cols = -c("Airline", "Status"),
    names_to = "Destination",
    values_to = "Count",
    values_drop_na = TRUE)

kable(tidy_data, format = "pipe", caption = "Tidy Flight Data (Missing Some Airline Values)", align = "lllc")

Tidy Flight Data (Missing Some Airline Values)
Airline	Status	Destination	Count
ALASKA	on time	Los Angeles	497
ALASKA	on time	Phoenix	221
ALASKA	on time	San Diego	212
ALASKA	on time	San Francisco	503
ALASKA	on time	Seattle	1841
	delayed	Los Angeles	62
	delayed	Phoenix	12
	delayed	San Diego	20
	delayed	San Francisco	102
	delayed	Seattle	305
AM WEST	on time	Los Angeles	694
AM WEST	on time	Phoenix	4840
AM WEST	on time	San Diego	383
AM WEST	on time	San Francisco	320
AM WEST	on time	Seattle	201
	delayed	Los Angeles	117
	delayed	Phoenix	415
	delayed	San Diego	65
	delayed	San Francisco	129
	delayed	Seattle	61

Filling In Missing Data

In this code block, I use a for loop to fill in the missing airline data. If the cell is empty, it uses the value from the row above. This allows the code to “cascade” through the column, filling in the airline for each empty cell until it gets to the next airline.

for (i in 1:nrow(tidy_data)) {
  if(tidy_data$Airline[i] == "") {tidy_data$Airline[i] <- tidy_data$Airline[i-1]}
}

kable(tidy_data, format = "pipe", caption = "Final Tidy Flight Data", align = "lllc")

Final Tidy Flight Data
Airline	Status	Destination	Count
ALASKA	on time	Los Angeles	497
ALASKA	on time	Phoenix	221
ALASKA	on time	San Diego	212
ALASKA	on time	San Francisco	503
ALASKA	on time	Seattle	1841
ALASKA	delayed	Los Angeles	62
ALASKA	delayed	Phoenix	12
ALASKA	delayed	San Diego	20
ALASKA	delayed	San Francisco	102
ALASKA	delayed	Seattle	305
AM WEST	on time	Los Angeles	694
AM WEST	on time	Phoenix	4840
AM WEST	on time	San Diego	383
AM WEST	on time	San Francisco	320
AM WEST	on time	Seattle	201
AM WEST	delayed	Los Angeles	117
AM WEST	delayed	Phoenix	415
AM WEST	delayed	San Diego	65
AM WEST	delayed	San Francisco	129
AM WEST	delayed	Seattle	61

Adding Data

In this code block, I group the flight totals by airline and destination to get the total number of flights from each airline to each city. Then I use a left join to create a new data frame with all of this information. I then add a new column representing the percentage of flights from that airline to that destination that had that status.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ purrr     1.0.1
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

flight_totals <- tidy_data %>%
  group_by(Airline, Destination) %>%
  summarise(total = sum(Count))

## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.

library(dplyr)

expanded_data <- left_join(tidy_data, flight_totals, by = c("Airline", "Destination"))
expanded_data$percent <- round(expanded_data$Count / expanded_data$total * 100, digits = 1)

kable(expanded_data, format = "pipe", caption = "Additional Flight Data", align = "lllccc")

Additional Flight Data
Airline	Status	Destination	Count	total	percent
ALASKA	on time	Los Angeles	497	559	88.9
ALASKA	on time	Phoenix	221	233	94.8
ALASKA	on time	San Diego	212	232	91.4
ALASKA	on time	San Francisco	503	605	83.1
ALASKA	on time	Seattle	1841	2146	85.8
ALASKA	delayed	Los Angeles	62	559	11.1
ALASKA	delayed	Phoenix	12	233	5.2
ALASKA	delayed	San Diego	20	232	8.6
ALASKA	delayed	San Francisco	102	605	16.9
ALASKA	delayed	Seattle	305	2146	14.2
AM WEST	on time	Los Angeles	694	811	85.6
AM WEST	on time	Phoenix	4840	5255	92.1
AM WEST	on time	San Diego	383	448	85.5
AM WEST	on time	San Francisco	320	449	71.3
AM WEST	on time	Seattle	201	262	76.7
AM WEST	delayed	Los Angeles	117	811	14.4
AM WEST	delayed	Phoenix	415	5255	7.9
AM WEST	delayed	San Diego	65	448	14.5
AM WEST	delayed	San Francisco	129	449	28.7
AM WEST	delayed	Seattle	61	262	23.3

Comparing the Airlines

In this code block, I use pivot_wider to spread the data out again so it can be compared as percentages. Each column is a destination city, and contains the data showing each airline’s on time and delayed percentages for that city.

comparison_data <- expanded_data %>%
  pivot_wider(
    id_cols = Airline:Status,
    names_from = Destination,
    values_from = percent
  )

kable(comparison_data, format = "pipe", caption = "Airline Status Percentages by City", align = "llccccc")

Airline Status Percentages by City
Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	88.9	94.8	91.4	83.1	85.8
ALASKA	delayed	11.1	5.2	8.6	16.9	14.2
AM WEST	on time	85.6	92.1	85.5	71.3	76.7
AM WEST	delayed	14.4	7.9	14.5	28.7	23.3

Findings and Recommendations

The table above reveals that Alaska airlines had a higher percentage of its flights classified as “on time” than American West airlines did for every destination city. This suggests that travelers seeking to avoid delays would be better off flying with Alaska, although further analysis with additional data could attempt to control for factors like time of year, time of day, origin city, and other considerations that might affect whether or not a flight is delayed such as flight duration.