Tidying and Transforming Data

Step 1:

First, imported the necessary libraries:

library(tidyverse)
library(tidyr)
library(dplyr)
library(ggplot2)

Then, I created and imported the CSV file with flight information in a wide format, with the labels and values in an “uncleaned” state:

flights_wide <- read.csv(file="flights_wk4.csv")
glimpse(flights_wide)
## Rows: 4
## Columns: 7
## $ airline       <chr> "Alaska", "Alaska", "AM West", "AM West"
## $ flight.status <chr> "on time", "delayed", "on time", "delayed"
## $ Los.Angeles   <int> 497, 62, 694, 117
## $ Phoenix       <int> 221, 12, 4840, 415
## $ San.Diego     <int> 212, 20, 383, 65
## $ San.Francisco <int> 503, 102, 320, 129
## $ Seattle       <int> 1841, 305, 201, 61

Step 2:

To begin tidying the data, I first pivoted the dataset longer to create a column for city names:

flights_long <- flights_wide %>%
  pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
               names_to = "city",
               values_to = "count")

knitr::kable(flights_long)
airline flight.status city count
Alaska on time Los.Angeles 497
Alaska on time Phoenix 221
Alaska on time San.Diego 212
Alaska on time San.Francisco 503
Alaska on time Seattle 1841
Alaska delayed Los.Angeles 62
Alaska delayed Phoenix 12
Alaska delayed San.Diego 20
Alaska delayed San.Francisco 102
Alaska delayed Seattle 305
AM West on time Los.Angeles 694
AM West on time Phoenix 4840
AM West on time San.Diego 383
AM West on time San.Francisco 320
AM West on time Seattle 201
AM West delayed Los.Angeles 117
AM West delayed Phoenix 415
AM West delayed San.Diego 65
AM West delayed San.Francisco 129
AM West delayed Seattle 61

Next, the data actually needed to be pivoted wider again based on the values in the flight.status column:

flights_tidy <- flights_long %>%
  pivot_wider(names_from = flight.status,
              values_from = count)

knitr::kable(flights_tidy)
airline city on time delayed
Alaska Los.Angeles 497 62
Alaska Phoenix 221 12
Alaska San.Diego 212 20
Alaska San.Francisco 503 102
Alaska Seattle 1841 305
AM West Los.Angeles 694 117
AM West Phoenix 4840 415
AM West San.Diego 383 65
AM West San.Francisco 320 129
AM West Seattle 201 61

Finally, I cleaned up labels and values to make analysis a bit easier:

colnames(flights_tidy)[which(names(flights_tidy) == "on time")] <- "on_time"

flights_tidy$city <- gsub("\\.", " ", flights_tidy$city)

knitr::kable(flights_tidy)
airline city on_time delayed
Alaska Los Angeles 497 62
Alaska Phoenix 221 12
Alaska San Diego 212 20
Alaska San Francisco 503 102
Alaska Seattle 1841 305
AM West Los Angeles 694 117
AM West Phoenix 4840 415
AM West San Diego 383 65
AM West San Francisco 320 129
AM West Seattle 201 61

Step 3:

Comparing the arrival delays for the 2 airlines could be done in a number of ways.

  • Calculate

Using the sum() function to calculate the total number of delays across all 5 cities shows that AM West had more total delays at 787, versus Alaska’s 501.

al_total <- sum(flights_tidy[which(flights_tidy$airline == "Alaska"), 4])
al_total
## [1] 501
am_total <- sum(flights_tidy[which(flights_tidy$airline == "AM West"), 4])
am_total
## [1] 787
  • Sort

Using dplyr’s mutate and arrange to sort the data by percentage of delayed flights shows that AM West occupied 4 of the top 5 spots.

flights_tidy <- flights_tidy %>%
  mutate(delayed_pct = delayed / (on_time + delayed))

flights_tidy <- flights_tidy %>%
  arrange(desc(delayed_pct))

knitr::kable(flights_tidy)
airline city on_time delayed delayed_pct
AM West San Francisco 320 129 0.2873051
AM West Seattle 201 61 0.2328244
Alaska San Francisco 503 102 0.1685950
AM West San Diego 383 65 0.1450893
AM West Los Angeles 694 117 0.1442663
Alaska Seattle 1841 305 0.1421249
Alaska Los Angeles 497 62 0.1109123
Alaska San Diego 212 20 0.0862069
AM West Phoenix 4840 415 0.0789724
Alaska Phoenix 221 12 0.0515021
  • Visualize

Finally, I created a stacked bar plot to visualize the proportion of delays by airline for each city.

ggplot(flights_tidy, aes(x = city, y = delayed, fill = airline)) + 
    geom_bar(position="stack", stat="identity") +
    labs(x = "City", y = "Number of Delays", fill = "Airline")

Conclusion

Based on both the total number of delays and each airline’s percentage of delayed flights, AM West was more likely to have delayed arrivals than Alaska. There was only one city in which Alaska exceeded AM West’s number of delays in absolute value (Seattle); and there, the percentage of delayed flights was still lower for Alaska at ~14%, with AM West at ~23%.