Abstract

In this project our goal is to take a .csv of untidy data showing counts of whether two airlines were on time or delayed flying into multiple destinations, transform and tidy the data, and then do an analysis to compare the arrival delays for both airlines.

Preparation

Environment Prep

if (!require("dplyr")) install.packages('dplyr')
if (!require("tidyr")) install.packages('tidyr')

Data Import

airlines <- read.csv("airline.csv")
knitr::kable(airlines)
X X.1 LosAngeles Phoenix SanDiego SanFrancisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
NA NA NA NA NA
AMWEST on time 694 4840 383 320 201
delayed 117 415 65 129 61

Data Transformation

# remove middle NA obs
airlines <- airlines[-3,]

# add in missing airline values
airlines[2,1] <- "ALASKA"
airlines[4,1] <- "AMWEST"

# rename cols
names(airlines)[c(1,2)] <- c("airline", "status")

# tidy data
tidy <- airlines %>% 
    gather(cities, count, 3:7) %>% 
    spread(status, count)

#update col
names(tidy)[4] <- "ontime"

# let's review our progress
knitr::kable(tidy)
airline cities delayed ontime
ALASKA LosAngeles 62 497
ALASKA Phoenix 12 221
ALASKA SanDiego 20 212
ALASKA SanFrancisco 102 503
ALASKA Seattle 305 1841
AMWEST LosAngeles 117 694
AMWEST Phoenix 415 4840
AMWEST SanDiego 65 383
AMWEST SanFrancisco 129 320
AMWEST Seattle 61 201

Data Analysis

Ratios by City

#let's get some delay ratios
tidy <- tidy %>%
    mutate(total=ontime + delayed, delayratio=round(delayed/total, 3)) 

#let's plot these delay ratios
dotchart(tidy$delayratio, groups = tidy$cities, labels=tidy$airline, main = "Ratio of Delayed Flights by City")

Ratios in Total

summary <- tidy %>% 
    group_by(airline) %>% 
    summarise(delayed = sum(delayed), ontime = sum(ontime), total = sum(total)) %>%
    mutate(delayratio = round(delayed/total, 3))

knitr::kable(summary)
airline delayed ontime total delayratio
ALASKA 501 3274 3775 0.133
AMWEST 787 6438 7225 0.109
#let's plot delay ratios
dotchart(summary$delayratio, labels=summary$airline, main = "Ratio of Delayed Flights Across All Cities")

Conclusion

The individual city delay ratios by airline show that ALASKA airlines had a lower individual delay ratio than AMWEST in every single city. However the aggregated data shows that AMWEST had a lower overall delay ratio, due to the larger volume of flights.

This is an example of Simpson’s paradox.