The chart below describes arrival delays for two airlines across five destinations. Your task is to: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information below. You’re encouraged to use a “wide” structure similar to how the information appears below, so that you can practice tidying and transformations. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. Perform analysis to compare the arrival delays for the two airlines. Your should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Los Angeles | Phoenix | San Diego | San Francisco | Seattle | ||
---|---|---|---|---|---|---|
ALASKA | On Time | 497 | 221 | 212 | 503 | 1841 |
Delayed | 62 | 12 | 20 | 102 | 305 | |
AM WEST | On Time | 694 | 4840 | 383 | 320 | 201 |
Delayed | 117 | 415 | 65 | 129 | 61 |
csv <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
c("ALASKA", "On Time", 497, 221, 212, 503, 1841),
c(NA, "Delayed", 62, 12, 20, 102, 305),
c("AM WEST", "On Time", 694, 4840, 383, 320, 201),
c(NA, "Delayed", 117, 415, 65, 129, 61))
write.table(csv, file = "DATA_607_Assignment3.csv", sep = ",", col.names=F, row.names=F)
wide_data <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
"master/DATA%20607/DATA_607_Assignment3.csv"),
stringsAsFactors = F); wide_data
## NA. NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 <NA> Delayed 62 12 20 102 305
## 3 AM WEST On Time 694 4840 383 320 201
## 4 <NA> Delayed 117 415 65 129 61
library(tidyr)
library(dplyr)
long_data <- wide_data %>%
mutate(Airlines1 = NA., Airlines2=lag(NA.)) %>%
mutate(Airline = coalesce(Airlines1,Airlines2), Status = NA..1) %>%
gather("Destination", "Flights", 3:7) %>%
select(Airline:Flights) %>%
arrange(Airline, desc(Status), Destination); long_data
## Airline Status Destination Flights
## 1 ALASKA On Time Los.Angeles 497
## 2 ALASKA On Time Phoenix 221
## 3 ALASKA On Time San.Diego 212
## 4 ALASKA On Time San.Francisco 503
## 5 ALASKA On Time Seattle 1841
## 6 ALASKA Delayed Los.Angeles 62
## 7 ALASKA Delayed Phoenix 12
## 8 ALASKA Delayed San.Diego 20
## 9 ALASKA Delayed San.Francisco 102
## 10 ALASKA Delayed Seattle 305
## 11 AM WEST On Time Los.Angeles 694
## 12 AM WEST On Time Phoenix 4840
## 13 AM WEST On Time San.Diego 383
## 14 AM WEST On Time San.Francisco 320
## 15 AM WEST On Time Seattle 201
## 16 AM WEST Delayed Los.Angeles 117
## 17 AM WEST Delayed Phoenix 415
## 18 AM WEST Delayed San.Diego 65
## 19 AM WEST Delayed San.Francisco 129
## 20 AM WEST Delayed Seattle 61
The data are first pulled from GitHub in a “wide” structure format. Then then data is converted into a “long” structure using tidyr and dplyr. The mutate function is used to duplicate the first column twice (once with the cells shifted down). The mutate function is then used again with coalesce to combine the new columns into one column without null values. The gather function transforms the data from its crosstab appearance to a more normalized format. Finally, the select function is used to query the columns we want in the final “long” structure. Last, the filter function is used to split the data into the two groups that will be compared.
Performance <- long_data %>%
group_by(Airline) %>%
mutate(A_Total = sum(Flights)) %>%
group_by(Airline, Status) %>%
mutate(S_Total = sum(Flights), A_Perf = S_Total / A_Total) %>%
group_by(Airline, Destination) %>%
mutate(D_Total = sum(Flights), D_Perf = Flights / D_Total)
# Alaska Airlines and AM West Overall Performance
data.frame(Performance[c(1,10,11,20), c(1,2,7)])
## Airline Status A_Perf
## 1 ALASKA On Time 0.8672848
## 2 ALASKA Delayed 0.1327152
## 3 AM WEST On Time 0.8910727
## 4 AM WEST Delayed 0.1089273
# Alaska Airlines and AM West by Destination
Perf_by_Dest <- data.frame(Performance %>%
filter(Status == "On Time") %>%
group_by(Destination) %>%
unite(temp, D_Total, D_Perf) %>%
spread(Status, temp) %>%
select(Destination, Airline, On_Time = `On Time`) %>%
spread(Airline, On_Time) %>%
separate(ALASKA, into=c("Flights.AL", "Perf.AL"), sep = "_", convert=T) %>%
separate(`AM WEST`, into=c("Flights.AM", "Perf.AM"), sep = "_", convert=T) %>%
mutate(Difference = Perf.AL - Perf.AM) %>%
arrange(Destination)); Perf_by_Dest
## Destination Flights.AL Perf.AL Flights.AM Perf.AM Difference
## 1 Los.Angeles 559 0.8890877 811 0.8557337 0.03335399
## 2 Phoenix 233 0.9484979 5255 0.9210276 0.02747026
## 3 San.Diego 232 0.9137931 448 0.8549107 0.05888239
## 4 San.Francisco 605 0.8314050 449 0.7126949 0.11871008
## 5 Seattle 2146 0.8578751 262 0.7671756 0.09069954
First the functions group_by and mutate are used multiple times on the data in order to calculate and append aggregate values to the data set. The data for each airline are then summarized by arrival status (“On Time”, “Delayed”). Then the data are manipulated as follows in order to allow for comparisons:
After all this, the summarized comparison of performance for each airline at each destination is displayed.
Alaska Airlines’ 3775 flights were on-time 86.73% of the time. AM West’s (acquired by American Airlines) 7225 flights were on-time 89.11% of the time. The pattern persists even when looking at individual destinations. Yet for some destinations the difference in performance is larger than others. Compare Phoenix with the smallest performance difference of 0.0274703 and San Francisco with the largest performance difference of 0.1187101. A cursory review appears to indicate–holding all else equal– that Alaska Airlines performs better when a fewer number of flights are involved and that AM West performs better when larger number of flights are involved. Correlating performance on the number of flights, the data appear to support this conclusion with \(r = -0.5373998\) for Alaska Airlines and \(r = 0.7069816\) for AM West.