Setup the enviroment and importing the CSV

The untidy CSV data is imported from the Github Repo. To make transformations easier downstream, empty strings were converted to NA values.

#Load the libraries
library (dplyr)
library(tidyr)

#Import the CSV
mydata <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/data607_week6/main/untidycsv.csv", na.strings=c("","NA"))
knitr::kable(mydata)
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61


Tiday Data Transformation

A series of transformations were applied to convert the untidy data to a tidy format.

#Converts data to a table for ease in piping.
mydata <- tibble::as_tibble(mydata)

#Renames columns.
mydata <- mydata %>% 
  rename(Airline = X, Status = X.1)

#Fill missing values in Airline column using preceding value.
mydata <- mydata %>%
  fill(Airline)

#Removes row with NA value in Status column
mydata <- mydata %>% 
  na.omit(Status)

#Converts the individual destination columns into a long table.
mytidydata <- mydata %>% 
  gather("Destinations", "Count",  3:7)
knitr::kable(mytidydata)
Airline Status Destinations Count
ALASKA on time Los.Angeles 497
ALASKA delayed Los.Angeles 62
AM WEST on time Los.Angeles 694
AM WEST delayed Los.Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on time San.Diego 212
ALASKA delayed San.Diego 20
AM WEST on time San.Diego 383
AM WEST delayed San.Diego 65
ALASKA on time San.Francisco 503
ALASKA delayed San.Francisco 102
AM WEST on time San.Francisco 320
AM WEST delayed San.Francisco 129
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Seattle 201
AM WEST delayed Seattle 61
#Converts the long table into a wider one by assigning to Status values into its own columnns
mytidydata <- mytidydata %>% 
  spread("Status", "Count")
knitr::kable(mytidydata)
Airline Destinations delayed on time
ALASKA Los.Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San.Diego 20 212
ALASKA San.Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los.Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San.Diego 65 383
AM WEST San.Francisco 129 320
AM WEST Seattle 61 201


Delayed Flights Analysis

There were two methods used to calculate the ratio of delayed flights by Airline. The first method calculated the delay ratio for each airline per destination. To perform this step, the total number of flights per airline, per destination was calculated and then the ratio of delays over total flights to that destination was derived.

#Build columns for calculation
mytidydata <- mytidydata %>% 
  mutate(allflights = rowSums(across(c('delayed', 'on time')))) %>% 
  mutate(delay_ratio = delayed / allflights)

knitr::kable(mytidydata)
Airline Destinations delayed on time allflights delay_ratio
ALASKA Los.Angeles 62 497 559 0.1109123
ALASKA Phoenix 12 221 233 0.0515021
ALASKA San.Diego 20 212 232 0.0862069
ALASKA San.Francisco 102 503 605 0.1685950
ALASKA Seattle 305 1841 2146 0.1421249
AM WEST Los.Angeles 117 694 811 0.1442663
AM WEST Phoenix 415 4840 5255 0.0789724
AM WEST San.Diego 65 383 448 0.1450893
AM WEST San.Francisco 129 320 449 0.2873051
AM WEST Seattle 61 201 262 0.2328244


Per Destination, Alaska Airlines is delayed less often

The data is summarized to take the average of the delay ratios for all of the destinations. That is, sum up all ratios per destination and then divide by the number of destinations.

Alaska Airline’s delay ratio mean is 11.1%. AM West’s delay ratio is 17.7%.

AirlineRates <- mytidydata %>% 
  group_by(Airline) %>% 
  summarise(final = mean(delay_ratio)) %>% 
  arrange(final)

knitr::kable(AirlineRates)
Airline final
ALASKA 0.1118683
AM WEST 0.1776915


But when totaling number of flights, AM WEST is delayed less often

The second method to calculate this ratio is to sum the total number of flights and sum the total number of delays and derive the delay ratio using the two totals.

This method takes the table and sums the number flights and then the number of delays to create a new column of totals used to calculate the ratio.

Using this method, AM West has a 10.8% delay rate. Alaska Airlines has a delay rate of 13.2%

TotalRates <- mytidydata %>%
  group_by(Airline) %>% 
  summarise(Totalflights = sum(allflights), Totaldelays = sum(delayed)) %>% 
  mutate(ALLdelay_ratio = Totaldelays / Totalflights) %>% 
  arrange(ALLdelay_ratio)

knitr::kable(TotalRates)
Airline Totalflights Totaldelays ALLdelay_ratio
AM WEST 7225 787 0.1089273
ALASKA 3775 501 0.1327152


Conclusion - Which is correct?

In this case, it is unclear. This data represents an example of Simpson’s Paradox - where patterns/trends that appear in groups of data reverse or disappear when combined. In this case, the pattern of better delay rates for Alaska Airlines that appear when calculated per destination disappear when the values are calculated using the total number of flights and the total number of delays. Likely there are unknown/confounding variables not available in this data. Perhaps weather, distances, type of planes, and airport variables also play a role in the delays.

Source: https://en.wikipedia.org/wiki/Simpson%27s_paradox