In this project our goal is to take a .csv of untidy data showing counts of whether two airlines were on time or delayed flying into multiple destinations, transform and tidy the data, and then do an analysis to compare the arrival delays for both airlines.
if (!require("dplyr")) install.packages('dplyr')
if (!require("tidyr")) install.packages('tidyr')
airlines <- read.csv("airline.csv")
knitr::kable(airlines)
| X | X.1 | LosAngeles | Phoenix | SanDiego | SanFrancisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| delayed | 62 | 12 | 20 | 102 | 305 | |
| NA | NA | NA | NA | NA | ||
| AMWEST | on time | 694 | 4840 | 383 | 320 | 201 |
| delayed | 117 | 415 | 65 | 129 | 61 |
# remove middle NA obs
airlines <- airlines[-3,]
# add in missing airline values
airlines[2,1] <- "ALASKA"
airlines[4,1] <- "AMWEST"
# rename cols
names(airlines)[c(1,2)] <- c("airline", "status")
# tidy data
tidy <- airlines %>%
gather(cities, count, 3:7) %>%
spread(status, count)
#update col
names(tidy)[4] <- "ontime"
# let's review our progress
knitr::kable(tidy)
| airline | cities | delayed | ontime |
|---|---|---|---|
| ALASKA | LosAngeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | SanDiego | 20 | 212 |
| ALASKA | SanFrancisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AMWEST | LosAngeles | 117 | 694 |
| AMWEST | Phoenix | 415 | 4840 |
| AMWEST | SanDiego | 65 | 383 |
| AMWEST | SanFrancisco | 129 | 320 |
| AMWEST | Seattle | 61 | 201 |
#let's get some delay ratios
tidy <- tidy %>%
mutate(total=ontime + delayed, delayratio=round(delayed/total, 3))
#let's plot these delay ratios
dotchart(tidy$delayratio, groups = tidy$cities, labels=tidy$airline, main = "Ratio of Delayed Flights by City")
summary <- tidy %>%
group_by(airline) %>%
summarise(delayed = sum(delayed), ontime = sum(ontime), total = sum(total)) %>%
mutate(delayratio = round(delayed/total, 3))
knitr::kable(summary)
| airline | delayed | ontime | total | delayratio |
|---|---|---|---|---|
| ALASKA | 501 | 3274 | 3775 | 0.133 |
| AMWEST | 787 | 6438 | 7225 | 0.109 |
#let's plot delay ratios
dotchart(summary$delayratio, labels=summary$airline, main = "Ratio of Delayed Flights Across All Cities")
The individual city delay ratios by airline show that ALASKA airlines had a lower individual delay ratio than AMWEST in every single city. However the aggregated data shows that AMWEST had a lower overall delay ratio, due to the larger volume of flights.
This is an example of Simpson’s paradox.