Data607_Week6Assignment

Setup the enviroment and importing the CSV

The untidy CSV data is imported from the Github Repo. To make transformations easier downstream, empty strings were converted to NA values.

#Load the libraries
library (dplyr)
library(tidyr)

#Import the CSV
mydata <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/data607_week6/main/untidycsv.csv", na.strings=c("","NA"))
knitr::kable(mydata)

X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

Tiday Data Transformation

A series of transformations were applied to convert the untidy data to a tidy format.

#Converts data to a table for ease in piping.
mydata <- tibble::as_tibble(mydata)

#Renames columns.
mydata <- mydata %>% 
  rename(Airline = X, Status = X.1)

#Fill missing values in Airline column using preceding value.
mydata <- mydata %>%
  fill(Airline)

#Removes row with NA value in Status column
mydata <- mydata %>% 
  na.omit(Status)

#Converts the individual destination columns into a long table.
mytidydata <- mydata %>% 
  gather("Destinations", "Count",  3:7)
knitr::kable(mytidydata)

Airline	Status	Destinations	Count
ALASKA	on time	Los.Angeles	497
ALASKA	delayed	Los.Angeles	62
AM WEST	on time	Los.Angeles	694
AM WEST	delayed	Los.Angeles	117
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12
AM WEST	on time	Phoenix	4840
AM WEST	delayed	Phoenix	415
ALASKA	on time	San.Diego	212
ALASKA	delayed	San.Diego	20
AM WEST	on time	San.Diego	383
AM WEST	delayed	San.Diego	65
ALASKA	on time	San.Francisco	503
ALASKA	delayed	San.Francisco	102
AM WEST	on time	San.Francisco	320
AM WEST	delayed	San.Francisco	129
ALASKA	on time	Seattle	1841
ALASKA	delayed	Seattle	305
AM WEST	on time	Seattle	201
AM WEST	delayed	Seattle	61

#Converts the long table into a wider one by assigning to Status values into its own columnns
mytidydata <- mytidydata %>% 
  spread("Status", "Count")
knitr::kable(mytidydata)

Airline	Destinations	delayed	on time
ALASKA	Los.Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San.Diego	20	212
ALASKA	San.Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los.Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San.Diego	65	383
AM WEST	San.Francisco	129	320
AM WEST	Seattle	61	201

Delayed Flights Analysis

There were two methods used to calculate the ratio of delayed flights by Airline. The first method calculated the delay ratio for each airline per destination. To perform this step, the total number of flights per airline, per destination was calculated and then the ratio of delays over total flights to that destination was derived.

#Build columns for calculation
mytidydata <- mytidydata %>% 
  mutate(allflights = rowSums(across(c('delayed', 'on time')))) %>% 
  mutate(delay_ratio = delayed / allflights)

knitr::kable(mytidydata)

Airline	Destinations	delayed	on time	allflights	delay_ratio
ALASKA	Los.Angeles	62	497	559	0.1109123
ALASKA	Phoenix	12	221	233	0.0515021
ALASKA	San.Diego	20	212	232	0.0862069
ALASKA	San.Francisco	102	503	605	0.1685950
ALASKA	Seattle	305	1841	2146	0.1421249
AM WEST	Los.Angeles	117	694	811	0.1442663
AM WEST	Phoenix	415	4840	5255	0.0789724
AM WEST	San.Diego	65	383	448	0.1450893
AM WEST	San.Francisco	129	320	449	0.2873051
AM WEST	Seattle	61	201	262	0.2328244

Per Destination, Alaska Airlines is delayed less often

The data is summarized to take the average of the delay ratios for all of the destinations. That is, sum up all ratios per destination and then divide by the number of destinations.

Alaska Airline’s delay ratio mean is 11.1%. AM West’s delay ratio is 17.7%.

AirlineRates <- mytidydata %>% 
  group_by(Airline) %>% 
  summarise(final = mean(delay_ratio)) %>% 
  arrange(final)

knitr::kable(AirlineRates)

Airline	final
ALASKA	0.1118683
AM WEST	0.1776915

But when totaling number of flights, AM WEST is delayed less often

The second method to calculate this ratio is to sum the total number of flights and sum the total number of delays and derive the delay ratio using the two totals.

This method takes the table and sums the number flights and then the number of delays to create a new column of totals used to calculate the ratio.

Using this method, AM West has a 10.8% delay rate. Alaska Airlines has a delay rate of 13.2%

TotalRates <- mytidydata %>%
  group_by(Airline) %>% 
  summarise(Totalflights = sum(allflights), Totaldelays = sum(delayed)) %>% 
  mutate(ALLdelay_ratio = Totaldelays / Totalflights) %>% 
  arrange(ALLdelay_ratio)

knitr::kable(TotalRates)

Airline	Totalflights	Totaldelays	ALLdelay_ratio
AM WEST	7225	787	0.1089273
ALASKA	3775	501	0.1327152

Conclusion - Which is correct?

In this case, it is unclear. This data represents an example of Simpson’s Paradox - where patterns/trends that appear in groups of data reverse or disappear when combined. In this case, the pattern of better delay rates for Alaska Airlines that appear when calculated per destination disappear when the values are calculated using the total number of flights and the total number of delays. Likely there are unknown/confounding variables not available in this data. Perhaps weather, distances, type of planes, and airport variables also play a role in the delays.

Source: https://en.wikipedia.org/wiki/Simpson%27s_paradox