The given chart describes arrival information for two different airlines at five different locations. The goal of the assignment is to use tidyr and dplyr as needed to tidy and transform the data and then to perform analysis to compare the arrival delays for the two airlines.
A copy of this R Markdown file and the associated .csv file are located in my Github directory at:
https://github.com/stipton/CUNY-SPS/tree/master/DATA%20607/Week%205
rm(list = ls())
library(plyr)
library(tidyr)
library(ggplot2)
library(reshape2)##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
After copying the information in a .csv file, I use the read.csv function to import the infomation into R.
arrivals.raw <- read.csv("DATA607HW5data.csv", header = TRUE, stringsAsFactors = FALSE)
arrivals.raw## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
As an initial step in cleaning up the data, I add names to the first two columns, fill in the missing airline names, and remove the empty row from the data.
names(arrivals.raw)[1:2] <- c("airline", "status")
arrivals.raw[2,1] <- "ALASKA"
arrivals.raw[5,1] <- "AM WEST"
arrivals <- arrivals.raw[-3,]
arrivals## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
The information in the file is cross-tabulated - helpful for viewing and presenting to end users, but not the best format for performing analysis. Examining the data reveals several variables that should be separated into their own columns:
The gather function melts the wide data set into the long form.
(arrivals <- gather(arrivals, "city", "n", 3:7))## airline status city n
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
Since each airline and city combination has values for both on time and delayed arrivals, it is more appropriate to have columns for that information to aid in the analysis section. The spread function brings the data into two columns.
(arrivals <- spread(arrivals, status, n))## airline city delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
As one final piece of tidying, I use the make.names function to ensure that all the column names created from the data are in the correct format (to wit, the space must be removed from the on time column name).
names(arrivals) <- make.names(names(arrivals))
arrivals## airline city delayed on.time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
To begin comparing the arrival delays between the two airlines, I used the mutate function to add columns for the total number of flights per row, as well as the frequency of delayed and on-time arrivals.
(arrivals <- mutate(arrivals,
total = delayed + on.time,
freq.del = delayed / total,
freq.on = on.time / total))## airline city delayed on.time total freq.del freq.on
## 1 ALASKA Los.Angeles 62 497 559 0.11091234 0.8890877
## 2 ALASKA Phoenix 12 221 233 0.05150215 0.9484979
## 3 ALASKA San.Diego 20 212 232 0.08620690 0.9137931
## 4 ALASKA San.Francisco 102 503 605 0.16859504 0.8314050
## 5 ALASKA Seattle 305 1841 2146 0.14212488 0.8578751
## 6 AM WEST Los.Angeles 117 694 811 0.14426634 0.8557337
## 7 AM WEST Phoenix 415 4840 5255 0.07897241 0.9210276
## 8 AM WEST San.Diego 65 383 448 0.14508929 0.8549107
## 9 AM WEST San.Francisco 129 320 449 0.28730512 0.7126949
## 10 AM WEST Seattle 61 201 262 0.23282443 0.7671756
Subsetting the dataset by airline, I calculated the frequency of delayed flights for each by taking the total number of delayed flights over the total number of flights.
arrivals.alaska <- subset(arrivals, airline == "ALASKA")
arrivals.amwest <- subset(arrivals, airline == "AM WEST")
(ak.freq.del <- sum(arrivals.alaska$delayed) / sum(arrivals.alaska$total))## [1] 0.1327152
(am.freq.del <- sum(arrivals.amwest$delayed) / sum(arrivals.amwest$total))## [1] 0.1089273
freq.overall <- data.frame(airline = c("Alaska", "AM West"), freq.del = c(ak.freq.del, am.freq.del))
ggplot(freq.overall, aes(airline, freq.del, fill = airline)) +
geom_col() +
labs(title = "Overall Frequency of Delays by Airline", x = "Airline", y = "Frequency of Delays")Overall, Alaska Airlines, with 13.3% delayed arrivals for these cities, fares more poorly than AM West Airlines, with 10.9% delayed arrivals for these cities.
To see how each airline performs within each city, I constructed a barplot to compare the frequency of delays. From the barplot, we can see that the frequency of delayed arrivals is lower in each city for Alaska Airlines than AM West Airlines. This result is unexpected, given that the frequency of delayed arrivals for Alaska Airlines overall was higher than AM West Airlines.
arrivals.long <- melt(arrivals, id.vars = c("airline", "city", "freq.del"))
ggplot(arrivals.long, aes(city, freq.del, fill = airline)) +
geom_col(position = "dodge") +
labs(title = "Frequency of Delays by City and Airline", x = "City", y = "Frequency of Delays")This exercise highlights an example of perspective in statistics, where the outcome of the statistics changes depending on how you look at the problem. In such cases, it is important to evaluate what factors are at play within the data that create apparently conflicting results.
In this case, the large number of flights that AM West services into Phoenix is affecting the overall result. Because the scale of that number is much larger than the other values in the data set, the large number of on-time flights for AM West into Phoenix is able to absorb the airline’s delayed flights in other cities for the overall frequency of delays. As a result, the overall frequency of delays for AM West is lower than the overall frequency for Alaska Airlines. Breaking down the arrivals by city properly aligns the data to compare the airlines along the same factor.
As a matter of perspective, this method of comparing delays by city may be best for a consumer deciding on which plane ticket to buy. However, clever marketers (or in other cases, clever politicians) may choose the overall frequency as a truthful but slightly more opaque statistic.