The untidy CSV data is imported from the Github Repo. To make transformations easier downstream, empty strings were converted to NA values.
#Load the libraries
library (dplyr)
library(tidyr)
#Import the CSV
mydata <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/data607_week6/main/untidycsv.csv", na.strings=c("","NA"))
knitr::kable(mydata)
X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
NA | delayed | 62 | 12 | 20 | 102 | 305 |
NA | NA | NA | NA | NA | NA | NA |
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
NA | delayed | 117 | 415 | 65 | 129 | 61 |
A series of transformations were applied to convert the untidy data to a tidy format.
#Converts data to a table for ease in piping.
mydata <- tibble::as_tibble(mydata)
#Renames columns.
mydata <- mydata %>%
rename(Airline = X, Status = X.1)
#Fill missing values in Airline column using preceding value.
mydata <- mydata %>%
fill(Airline)
#Removes row with NA value in Status column
mydata <- mydata %>%
na.omit(Status)
#Converts the individual destination columns into a long table.
mytidydata <- mydata %>%
gather("Destinations", "Count", 3:7)
knitr::kable(mytidydata)
Airline | Status | Destinations | Count |
---|---|---|---|
ALASKA | on time | Los.Angeles | 497 |
ALASKA | delayed | Los.Angeles | 62 |
AM WEST | on time | Los.Angeles | 694 |
AM WEST | delayed | Los.Angeles | 117 |
ALASKA | on time | Phoenix | 221 |
ALASKA | delayed | Phoenix | 12 |
AM WEST | on time | Phoenix | 4840 |
AM WEST | delayed | Phoenix | 415 |
ALASKA | on time | San.Diego | 212 |
ALASKA | delayed | San.Diego | 20 |
AM WEST | on time | San.Diego | 383 |
AM WEST | delayed | San.Diego | 65 |
ALASKA | on time | San.Francisco | 503 |
ALASKA | delayed | San.Francisco | 102 |
AM WEST | on time | San.Francisco | 320 |
AM WEST | delayed | San.Francisco | 129 |
ALASKA | on time | Seattle | 1841 |
ALASKA | delayed | Seattle | 305 |
AM WEST | on time | Seattle | 201 |
AM WEST | delayed | Seattle | 61 |
#Converts the long table into a wider one by assigning to Status values into its own columnns
mytidydata <- mytidydata %>%
spread("Status", "Count")
knitr::kable(mytidydata)
Airline | Destinations | delayed | on time |
---|---|---|---|
ALASKA | Los.Angeles | 62 | 497 |
ALASKA | Phoenix | 12 | 221 |
ALASKA | San.Diego | 20 | 212 |
ALASKA | San.Francisco | 102 | 503 |
ALASKA | Seattle | 305 | 1841 |
AM WEST | Los.Angeles | 117 | 694 |
AM WEST | Phoenix | 415 | 4840 |
AM WEST | San.Diego | 65 | 383 |
AM WEST | San.Francisco | 129 | 320 |
AM WEST | Seattle | 61 | 201 |
There were two methods used to calculate the ratio of delayed flights by Airline. The first method calculated the delay ratio for each airline per destination. To perform this step, the total number of flights per airline, per destination was calculated and then the ratio of delays over total flights to that destination was derived.
#Build columns for calculation
mytidydata <- mytidydata %>%
mutate(allflights = rowSums(across(c('delayed', 'on time')))) %>%
mutate(delay_ratio = delayed / allflights)
knitr::kable(mytidydata)
Airline | Destinations | delayed | on time | allflights | delay_ratio |
---|---|---|---|---|---|
ALASKA | Los.Angeles | 62 | 497 | 559 | 0.1109123 |
ALASKA | Phoenix | 12 | 221 | 233 | 0.0515021 |
ALASKA | San.Diego | 20 | 212 | 232 | 0.0862069 |
ALASKA | San.Francisco | 102 | 503 | 605 | 0.1685950 |
ALASKA | Seattle | 305 | 1841 | 2146 | 0.1421249 |
AM WEST | Los.Angeles | 117 | 694 | 811 | 0.1442663 |
AM WEST | Phoenix | 415 | 4840 | 5255 | 0.0789724 |
AM WEST | San.Diego | 65 | 383 | 448 | 0.1450893 |
AM WEST | San.Francisco | 129 | 320 | 449 | 0.2873051 |
AM WEST | Seattle | 61 | 201 | 262 | 0.2328244 |
The data is summarized to take the average of the delay ratios for all of the destinations. That is, sum up all ratios per destination and then divide by the number of destinations.
Alaska Airline’s delay ratio mean is 11.1%. AM West’s delay ratio is 17.7%.
AirlineRates <- mytidydata %>%
group_by(Airline) %>%
summarise(final = mean(delay_ratio)) %>%
arrange(final)
knitr::kable(AirlineRates)
Airline | final |
---|---|
ALASKA | 0.1118683 |
AM WEST | 0.1776915 |
The second method to calculate this ratio is to sum the total number of flights and sum the total number of delays and derive the delay ratio using the two totals.
This method takes the table and sums the number flights and then the number of delays to create a new column of totals used to calculate the ratio.
Using this method, AM West has a 10.8% delay rate. Alaska Airlines has a delay rate of 13.2%
TotalRates <- mytidydata %>%
group_by(Airline) %>%
summarise(Totalflights = sum(allflights), Totaldelays = sum(delayed)) %>%
mutate(ALLdelay_ratio = Totaldelays / Totalflights) %>%
arrange(ALLdelay_ratio)
knitr::kable(TotalRates)
Airline | Totalflights | Totaldelays | ALLdelay_ratio |
---|---|---|---|
AM WEST | 7225 | 787 | 0.1089273 |
ALASKA | 3775 | 501 | 0.1327152 |
In this case, it is unclear. This data represents an example of Simpson’s Paradox - where patterns/trends that appear in groups of data reverse or disappear when combined. In this case, the pattern of better delay rates for Alaska Airlines that appear when calculated per destination disappear when the values are calculated using the total number of flights and the total number of delays. Likely there are unknown/confounding variables not available in this data. Perhaps weather, distances, type of planes, and airport variables also play a role in the delays.