In this assignment, I use tidyverse to “tidy” a data set using pivot_longer, add some summary statistics, and then use pivot_wider to create a new comparison table to look at flight delay rates from two airlines to five airports.
In this code block, I import the original “wide” .csv file from my github repository into a data frame. I also rename the columns.
original_data <- read.csv("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/assignment5csv")
colnames(original_data) <- c("Airline", "Status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")
library(knitr)
kable(original_data, format = "pipe", caption = "Original Flight Data", align = "llccccc")
Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
delayed | 62 | 12 | 20 | 102 | 305 | |
NA | NA | NA | NA | NA | ||
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
delayed | 117 | 415 | 65 | 129 | 61 |
In this code block, I use pivot_longer to tidy the data. I name the column with all the different destination airports “Destination” and the column with the number of flights to each airport “Count”. Since each airline only appeared once in the table, there are some rows that do not have the airline in the data frame (the rows corresponding to the “delayed” flight data).
library(tidyr)
tidy_data <- original_data %>%
pivot_longer(
cols = -c("Airline", "Status"),
names_to = "Destination",
values_to = "Count",
values_drop_na = TRUE)
kable(tidy_data, format = "pipe", caption = "Tidy Flight Data (Missing Some Airline Values)", align = "lllc")
Airline | Status | Destination | Count |
---|---|---|---|
ALASKA | on time | Los Angeles | 497 |
ALASKA | on time | Phoenix | 221 |
ALASKA | on time | San Diego | 212 |
ALASKA | on time | San Francisco | 503 |
ALASKA | on time | Seattle | 1841 |
delayed | Los Angeles | 62 | |
delayed | Phoenix | 12 | |
delayed | San Diego | 20 | |
delayed | San Francisco | 102 | |
delayed | Seattle | 305 | |
AM WEST | on time | Los Angeles | 694 |
AM WEST | on time | Phoenix | 4840 |
AM WEST | on time | San Diego | 383 |
AM WEST | on time | San Francisco | 320 |
AM WEST | on time | Seattle | 201 |
delayed | Los Angeles | 117 | |
delayed | Phoenix | 415 | |
delayed | San Diego | 65 | |
delayed | San Francisco | 129 | |
delayed | Seattle | 61 |
In this code block, I use a for loop to fill in the missing airline data. If the cell is empty, it uses the value from the row above. This allows the code to “cascade” through the column, filling in the airline for each empty cell until it gets to the next airline.
for (i in 1:nrow(tidy_data)) {
if(tidy_data$Airline[i] == "") {tidy_data$Airline[i] <- tidy_data$Airline[i-1]}
}
kable(tidy_data, format = "pipe", caption = "Final Tidy Flight Data", align = "lllc")
Airline | Status | Destination | Count |
---|---|---|---|
ALASKA | on time | Los Angeles | 497 |
ALASKA | on time | Phoenix | 221 |
ALASKA | on time | San Diego | 212 |
ALASKA | on time | San Francisco | 503 |
ALASKA | on time | Seattle | 1841 |
ALASKA | delayed | Los Angeles | 62 |
ALASKA | delayed | Phoenix | 12 |
ALASKA | delayed | San Diego | 20 |
ALASKA | delayed | San Francisco | 102 |
ALASKA | delayed | Seattle | 305 |
AM WEST | on time | Los Angeles | 694 |
AM WEST | on time | Phoenix | 4840 |
AM WEST | on time | San Diego | 383 |
AM WEST | on time | San Francisco | 320 |
AM WEST | on time | Seattle | 201 |
AM WEST | delayed | Los Angeles | 117 |
AM WEST | delayed | Phoenix | 415 |
AM WEST | delayed | San Diego | 65 |
AM WEST | delayed | San Francisco | 129 |
AM WEST | delayed | Seattle | 61 |
In this code block, I group the flight totals by airline and destination to get the total number of flights from each airline to each city. Then I use a left join to create a new data frame with all of this information. I then add a new column representing the percentage of flights from that airline to that destination that had that status.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ purrr 1.0.1
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.3 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
flight_totals <- tidy_data %>%
group_by(Airline, Destination) %>%
summarise(total = sum(Count))
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
library(dplyr)
expanded_data <- left_join(tidy_data, flight_totals, by = c("Airline", "Destination"))
expanded_data$percent <- round(expanded_data$Count / expanded_data$total * 100, digits = 1)
kable(expanded_data, format = "pipe", caption = "Additional Flight Data", align = "lllccc")
Airline | Status | Destination | Count | total | percent |
---|---|---|---|---|---|
ALASKA | on time | Los Angeles | 497 | 559 | 88.9 |
ALASKA | on time | Phoenix | 221 | 233 | 94.8 |
ALASKA | on time | San Diego | 212 | 232 | 91.4 |
ALASKA | on time | San Francisco | 503 | 605 | 83.1 |
ALASKA | on time | Seattle | 1841 | 2146 | 85.8 |
ALASKA | delayed | Los Angeles | 62 | 559 | 11.1 |
ALASKA | delayed | Phoenix | 12 | 233 | 5.2 |
ALASKA | delayed | San Diego | 20 | 232 | 8.6 |
ALASKA | delayed | San Francisco | 102 | 605 | 16.9 |
ALASKA | delayed | Seattle | 305 | 2146 | 14.2 |
AM WEST | on time | Los Angeles | 694 | 811 | 85.6 |
AM WEST | on time | Phoenix | 4840 | 5255 | 92.1 |
AM WEST | on time | San Diego | 383 | 448 | 85.5 |
AM WEST | on time | San Francisco | 320 | 449 | 71.3 |
AM WEST | on time | Seattle | 201 | 262 | 76.7 |
AM WEST | delayed | Los Angeles | 117 | 811 | 14.4 |
AM WEST | delayed | Phoenix | 415 | 5255 | 7.9 |
AM WEST | delayed | San Diego | 65 | 448 | 14.5 |
AM WEST | delayed | San Francisco | 129 | 449 | 28.7 |
AM WEST | delayed | Seattle | 61 | 262 | 23.3 |
In this code block, I use pivot_wider to spread the data out again so it can be compared as percentages. Each column is a destination city, and contains the data showing each airline’s on time and delayed percentages for that city.
comparison_data <- expanded_data %>%
pivot_wider(
id_cols = Airline:Status,
names_from = Destination,
values_from = percent
)
kable(comparison_data, format = "pipe", caption = "Airline Status Percentages by City", align = "llccccc")
Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 88.9 | 94.8 | 91.4 | 83.1 | 85.8 |
ALASKA | delayed | 11.1 | 5.2 | 8.6 | 16.9 | 14.2 |
AM WEST | on time | 85.6 | 92.1 | 85.5 | 71.3 | 76.7 |
AM WEST | delayed | 14.4 | 7.9 | 14.5 | 28.7 | 23.3 |
The table above reveals that Alaska airlines had a higher percentage of its flights classified as “on time” than American West airlines did for every destination city. This suggests that travelers seeking to avoid delays would be better off flying with Alaska, although further analysis with additional data could attempt to control for factors like time of year, time of day, origin city, and other considerations that might affect whether or not a flight is delayed such as flight duration.