Stephanie Chiang - Week 4 Assignment

Tidying and Transforming Data

Step 1:

First, imported the necessary libraries:

library(tidyverse)
library(tidyr)
library(dplyr)
library(ggplot2)

Then, I created and imported the CSV file with flight information in a wide format, with the labels and values in an “uncleaned” state:

flights_wide <- read.csv(file="flights_wk4.csv")
glimpse(flights_wide)

## Rows: 4
## Columns: 7
## $ airline       <chr> "Alaska", "Alaska", "AM West", "AM West"
## $ flight.status <chr> "on time", "delayed", "on time", "delayed"
## $ Los.Angeles   <int> 497, 62, 694, 117
## $ Phoenix       <int> 221, 12, 4840, 415
## $ San.Diego     <int> 212, 20, 383, 65
## $ San.Francisco <int> 503, 102, 320, 129
## $ Seattle       <int> 1841, 305, 201, 61

Step 2:

To begin tidying the data, I first pivoted the dataset longer to create a column for city names:

flights_long <- flights_wide %>%
  pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
               names_to = "city",
               values_to = "count")

knitr::kable(flights_long)

airline	flight.status	city	count
Alaska	on time	Los.Angeles	497
Alaska	on time	Phoenix	221
Alaska	on time	San.Diego	212
Alaska	on time	San.Francisco	503
Alaska	on time	Seattle	1841
Alaska	delayed	Los.Angeles	62
Alaska	delayed	Phoenix	12
Alaska	delayed	San.Diego	20
Alaska	delayed	San.Francisco	102
Alaska	delayed	Seattle	305
AM West	on time	Los.Angeles	694
AM West	on time	Phoenix	4840
AM West	on time	San.Diego	383
AM West	on time	San.Francisco	320
AM West	on time	Seattle	201
AM West	delayed	Los.Angeles	117
AM West	delayed	Phoenix	415
AM West	delayed	San.Diego	65
AM West	delayed	San.Francisco	129
AM West	delayed	Seattle	61

Next, the data actually needed to be pivoted wider again based on the values in the flight.status column:

flights_tidy <- flights_long %>%
  pivot_wider(names_from = flight.status,
              values_from = count)

knitr::kable(flights_tidy)

airline	city	on time	delayed
Alaska	Los.Angeles	497	62
Alaska	Phoenix	221	12
Alaska	San.Diego	212	20
Alaska	San.Francisco	503	102
Alaska	Seattle	1841	305
AM West	Los.Angeles	694	117
AM West	Phoenix	4840	415
AM West	San.Diego	383	65
AM West	San.Francisco	320	129
AM West	Seattle	201	61

Finally, I cleaned up labels and values to make analysis a bit easier:

colnames(flights_tidy)[which(names(flights_tidy) == "on time")] <- "on_time"

flights_tidy$city <- gsub("\\.", " ", flights_tidy$city)

knitr::kable(flights_tidy)

airline	city	on_time	delayed
Alaska	Los Angeles	497	62
Alaska	Phoenix	221	12
Alaska	San Diego	212	20
Alaska	San Francisco	503	102
Alaska	Seattle	1841	305
AM West	Los Angeles	694	117
AM West	Phoenix	4840	415
AM West	San Diego	383	65
AM West	San Francisco	320	129
AM West	Seattle	201	61

Step 3:

Comparing the arrival delays for the 2 airlines could be done in a number of ways.

Calculate

Using the sum() function to calculate the total number of delays across all 5 cities shows that AM West had more total delays at 787, versus Alaska’s 501.

al_total <- sum(flights_tidy[which(flights_tidy$airline == "Alaska"), 4])
al_total

## [1] 501

am_total <- sum(flights_tidy[which(flights_tidy$airline == "AM West"), 4])
am_total

## [1] 787

Sort

Using dplyr’s mutate and arrange to sort the data by percentage of delayed flights shows that AM West occupied 4 of the top 5 spots.

flights_tidy <- flights_tidy %>%
  mutate(delayed_pct = delayed / (on_time + delayed))

flights_tidy <- flights_tidy %>%
  arrange(desc(delayed_pct))

knitr::kable(flights_tidy)

airline	city	on_time	delayed	delayed_pct
AM West	San Francisco	320	129	0.2873051
AM West	Seattle	201	61	0.2328244
Alaska	San Francisco	503	102	0.1685950
AM West	San Diego	383	65	0.1450893
AM West	Los Angeles	694	117	0.1442663
Alaska	Seattle	1841	305	0.1421249
Alaska	Los Angeles	497	62	0.1109123
Alaska	San Diego	212	20	0.0862069
AM West	Phoenix	4840	415	0.0789724
Alaska	Phoenix	221	12	0.0515021

Visualize

Finally, I created a stacked bar plot to visualize the proportion of delays by airline for each city.

ggplot(flights_tidy, aes(x = city, y = delayed, fill = airline)) + 
    geom_bar(position="stack", stat="identity") +
    labs(x = "City", y = "Number of Delays", fill = "Airline")

Conclusion

Based on both the total number of delays and each airline’s percentage of delayed flights, AM West was more likely to have delayed arrivals than Alaska. There was only one city in which Alaska exceeded AM West’s number of delays in absolute value (Seattle); and there, the percentage of delayed flights was still lower for Alaska at ~14%, with AM West at ~23%.