Below, the packages required for data analysis and visualization are loaded.
library(magrittr)
library(tidyverse)
library(DT)
Below, we load the messy data regarding delayed vs. on time arrivals for two airlines (ALASKA and AMWEST).
my_url <- "https://raw.githubusercontent.com/geedoubledee/data607_week5/main/week5_flights.csv"
wk5_flights <- read.csv(my_url)
as_tibble(wk5_flights)
## # A tibble: 5 × 7
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 "ALASKA" "on time" 497 221 212 503 1841
## 2 "" "delayed" 62 12 20 102 305
## 3 "" "" NA NA NA NA NA
## 4 "AMWEST" "on time" 694 4840 383 320 201
## 5 "" "delayed" 117 415 65 129 61
To tidy up the messy arrivals data, we delete one unnecessary row and add airline values to two empty cells. Then we pivot the wide data into a longer format, where each row represents an observation and every column represents a variable. We rename the columns to represent these variables: airline, arrival status, destination, and count.
wk5_flights_tidy <- wk5_flights
wk5_flights_tidy <- subset(wk5_flights_tidy[-3, ])
wk5_flights_tidy[2, 1] <- "ALASKA"
wk5_flights_tidy[4, 1] <- "AMWEST"
wk5_flights_tidy %<>%
pivot_longer(cols = !starts_with("X"),
names_to = "destination",
values_to = "count")
cols <- c("airline", "arrival_status", "destination", "count")
colnames(wk5_flights_tidy) <- cols
datatable(wk5_flights_tidy, options = list(pageLength = 10))
We group the tidy data by airline and arrival status so that we can then sum the total delayed arrivals and the total on time arrivals for each airline. These sums are displayed in a summary table.
wk5_flights_tidy_analysis <- wk5_flights_tidy
wk5_flights_tidy_analysis %<>%
group_by(airline, arrival_status) %>%
summarize(total = sum(count))
datatable(wk5_flights_tidy_analysis)
Looking at the summary table, AMWEST recorded more delayed arrivals than ALASKA. However, AMWEST also recorded a greater number of arrivals total. So we need to compare the proportion of arrivals that were delayed for each airline instead. To easily compare the proportions visually without having to calculate them, we create a percent stacked bar plot.
ggplot(wk5_flights_tidy_analysis, aes(x = airline, y = total,
fill = arrival_status)) +
geom_bar(position = "fill", stat = "identity") +
ggtitle("Proportion of Delayed vs. On Time Arrivals By Airline") +
xlab("airline") +
ylab("proportion")
By looking at the percent stacked bar plot, we see that ALASKA recorded a slightly greater proportion of delayed arrivals than AMWEST.
To confirm what we see and find out how big the difference actually is, we calculate the exact proportions:
alaska_delayed <- unlist(wk5_flights_tidy_analysis[1, 3] /
sum(wk5_flights_tidy_analysis[1:2, 3]))
amwest_delayed <- unlist(wk5_flights_tidy_analysis[3, 3] /
sum(wk5_flights_tidy_analysis[3:4, 3]))
diff <- alaska_delayed - amwest_delayed
printable_vars <- list(
a = round(alaska_delayed, 3),
b = round(amwest_delayed, 3),
c = round(diff, 3)
)
Proportion of ALASKA arrivals delayed: 0.133 | Proportion of AMWEST arrivals delayed: 0.109 | Difference: 0.024
ALASKA had roughly 2% more delayed arrivals than AMWEST.