Data 607 - Week 5

Load the Packages:

Below, the packages required for data analysis and visualization are loaded.

library(magrittr)
library(tidyverse)
library(DT)

Load the Messy Data:

Below, we load the messy data regarding delayed vs. on time arrivals for two airlines (ALASKA and AMWEST).

my_url <- "https://raw.githubusercontent.com/geedoubledee/data607_week5/main/week5_flights.csv"
wk5_flights <- read.csv(my_url)
as_tibble(wk5_flights)

## # A tibble: 5 × 7
##   X        X.1       Los.Angeles Phoenix San.Diego San.Francisco Seattle
##   <chr>    <chr>           <int>   <int>     <int>         <int>   <int>
## 1 "ALASKA" "on time"         497     221       212           503    1841
## 2 ""       "delayed"          62      12        20           102     305
## 3 ""       ""                 NA      NA        NA            NA      NA
## 4 "AMWEST" "on time"         694    4840       383           320     201
## 5 ""       "delayed"         117     415        65           129      61

Tidy Up the Data:

To tidy up the messy arrivals data, we delete one unnecessary row and add airline values to two empty cells. Then we pivot the wide data into a longer format, where each row represents an observation and every column represents a variable. We rename the columns to represent these variables: airline, arrival status, destination, and count.

wk5_flights_tidy <- wk5_flights
wk5_flights_tidy <- subset(wk5_flights_tidy[-3, ])
wk5_flights_tidy[2, 1] <- "ALASKA"
wk5_flights_tidy[4, 1] <- "AMWEST"
wk5_flights_tidy %<>%
    pivot_longer(cols = !starts_with("X"),
                 names_to = "destination",
                 values_to = "count")
cols <- c("airline", "arrival_status", "destination", "count")
colnames(wk5_flights_tidy) <- cols
datatable(wk5_flights_tidy, options = list(pageLength = 10))

Analysis:

We group the tidy data by airline and arrival status so that we can then sum the total delayed arrivals and the total on time arrivals for each airline. These sums are displayed in a summary table.

wk5_flights_tidy_analysis <- wk5_flights_tidy
wk5_flights_tidy_analysis %<>%
    group_by(airline, arrival_status) %>%
    summarize(total = sum(count))
datatable(wk5_flights_tidy_analysis)

Looking at the summary table, AMWEST recorded more delayed arrivals than ALASKA. However, AMWEST also recorded a greater number of arrivals total. So we need to compare the proportion of arrivals that were delayed for each airline instead. To easily compare the proportions visually without having to calculate them, we create a percent stacked bar plot.

ggplot(wk5_flights_tidy_analysis, aes(x = airline, y = total,
                                      fill = arrival_status)) +
    geom_bar(position = "fill", stat = "identity") +
    ggtitle("Proportion of Delayed vs. On Time Arrivals By Airline") +
    xlab("airline") +
    ylab("proportion")

By looking at the percent stacked bar plot, we see that ALASKA recorded a slightly greater proportion of delayed arrivals than AMWEST.

Conclusions:

To confirm what we see and find out how big the difference actually is, we calculate the exact proportions:

alaska_delayed <- unlist(wk5_flights_tidy_analysis[1, 3] /
    sum(wk5_flights_tidy_analysis[1:2, 3]))
amwest_delayed <- unlist(wk5_flights_tidy_analysis[3, 3] /
    sum(wk5_flights_tidy_analysis[3:4, 3]))

diff <- alaska_delayed - amwest_delayed

printable_vars <- list(
    a = round(alaska_delayed, 3),
    b = round(amwest_delayed, 3),
    c = round(diff, 3)
)

Proportion of ALASKA arrivals delayed: 0.133 | Proportion of AMWEST arrivals delayed: 0.109 | Difference: 0.024

ALASKA had roughly 2% more delayed arrivals than AMWEST.