First, imported the necessary libraries:
library(tidyverse)
library(tidyr)
library(dplyr)
library(ggplot2)
Then, I created and imported the CSV file with flight information in a wide format, with the labels and values in an “uncleaned” state:
flights_wide <- read.csv(file="flights_wk4.csv")
glimpse(flights_wide)
## Rows: 4
## Columns: 7
## $ airline <chr> "Alaska", "Alaska", "AM West", "AM West"
## $ flight.status <chr> "on time", "delayed", "on time", "delayed"
## $ Los.Angeles <int> 497, 62, 694, 117
## $ Phoenix <int> 221, 12, 4840, 415
## $ San.Diego <int> 212, 20, 383, 65
## $ San.Francisco <int> 503, 102, 320, 129
## $ Seattle <int> 1841, 305, 201, 61
To begin tidying the data, I first pivoted the dataset longer to create a column for city names:
flights_long <- flights_wide %>%
pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
names_to = "city",
values_to = "count")
knitr::kable(flights_long)
airline | flight.status | city | count |
---|---|---|---|
Alaska | on time | Los.Angeles | 497 |
Alaska | on time | Phoenix | 221 |
Alaska | on time | San.Diego | 212 |
Alaska | on time | San.Francisco | 503 |
Alaska | on time | Seattle | 1841 |
Alaska | delayed | Los.Angeles | 62 |
Alaska | delayed | Phoenix | 12 |
Alaska | delayed | San.Diego | 20 |
Alaska | delayed | San.Francisco | 102 |
Alaska | delayed | Seattle | 305 |
AM West | on time | Los.Angeles | 694 |
AM West | on time | Phoenix | 4840 |
AM West | on time | San.Diego | 383 |
AM West | on time | San.Francisco | 320 |
AM West | on time | Seattle | 201 |
AM West | delayed | Los.Angeles | 117 |
AM West | delayed | Phoenix | 415 |
AM West | delayed | San.Diego | 65 |
AM West | delayed | San.Francisco | 129 |
AM West | delayed | Seattle | 61 |
Next, the data actually needed to be pivoted wider again based on the
values in the flight.status
column:
flights_tidy <- flights_long %>%
pivot_wider(names_from = flight.status,
values_from = count)
knitr::kable(flights_tidy)
airline | city | on time | delayed |
---|---|---|---|
Alaska | Los.Angeles | 497 | 62 |
Alaska | Phoenix | 221 | 12 |
Alaska | San.Diego | 212 | 20 |
Alaska | San.Francisco | 503 | 102 |
Alaska | Seattle | 1841 | 305 |
AM West | Los.Angeles | 694 | 117 |
AM West | Phoenix | 4840 | 415 |
AM West | San.Diego | 383 | 65 |
AM West | San.Francisco | 320 | 129 |
AM West | Seattle | 201 | 61 |
Finally, I cleaned up labels and values to make analysis a bit easier:
colnames(flights_tidy)[which(names(flights_tidy) == "on time")] <- "on_time"
flights_tidy$city <- gsub("\\.", " ", flights_tidy$city)
knitr::kable(flights_tidy)
airline | city | on_time | delayed |
---|---|---|---|
Alaska | Los Angeles | 497 | 62 |
Alaska | Phoenix | 221 | 12 |
Alaska | San Diego | 212 | 20 |
Alaska | San Francisco | 503 | 102 |
Alaska | Seattle | 1841 | 305 |
AM West | Los Angeles | 694 | 117 |
AM West | Phoenix | 4840 | 415 |
AM West | San Diego | 383 | 65 |
AM West | San Francisco | 320 | 129 |
AM West | Seattle | 201 | 61 |
Comparing the arrival delays for the 2 airlines could be done in a number of ways.
Using the sum()
function to calculate the total number
of delays across all 5 cities shows that AM West had more total delays
at 787, versus Alaska’s 501.
al_total <- sum(flights_tidy[which(flights_tidy$airline == "Alaska"), 4])
al_total
## [1] 501
am_total <- sum(flights_tidy[which(flights_tidy$airline == "AM West"), 4])
am_total
## [1] 787
Using dplyr
’s mutate
and
arrange
to sort the data by percentage of delayed flights
shows that AM West occupied 4 of the top 5 spots.
flights_tidy <- flights_tidy %>%
mutate(delayed_pct = delayed / (on_time + delayed))
flights_tidy <- flights_tidy %>%
arrange(desc(delayed_pct))
knitr::kable(flights_tidy)
airline | city | on_time | delayed | delayed_pct |
---|---|---|---|---|
AM West | San Francisco | 320 | 129 | 0.2873051 |
AM West | Seattle | 201 | 61 | 0.2328244 |
Alaska | San Francisco | 503 | 102 | 0.1685950 |
AM West | San Diego | 383 | 65 | 0.1450893 |
AM West | Los Angeles | 694 | 117 | 0.1442663 |
Alaska | Seattle | 1841 | 305 | 0.1421249 |
Alaska | Los Angeles | 497 | 62 | 0.1109123 |
Alaska | San Diego | 212 | 20 | 0.0862069 |
AM West | Phoenix | 4840 | 415 | 0.0789724 |
Alaska | Phoenix | 221 | 12 | 0.0515021 |
Finally, I created a stacked bar plot to visualize the proportion of delays by airline for each city.
ggplot(flights_tidy, aes(x = city, y = delayed, fill = airline)) +
geom_bar(position="stack", stat="identity") +
labs(x = "City", y = "Number of Delays", fill = "Airline")
Based on both the total number of delays and each airline’s percentage of delayed flights, AM West was more likely to have delayed arrivals than Alaska. There was only one city in which Alaska exceeded AM West’s number of delays in absolute value (Seattle); and there, the percentage of delayed flights was still lower for Alaska at ~14%, with AM West at ~23%.