Assignment - Tidying and Transforming Data

Arrival Delays

		Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	On Time	497	221	212	503	1841
	Delayed	62	12	20	102	305
AM WEST	On Time	694	4840	383	320	201
	Delayed	117	415	65	129	61

Create a .CSV file that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations.

csv <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
             c("ALASKA", "On Time", 497, 221, 212, 503, 1841),
             c(NA, "Delayed", 62, 12, 20, 102, 305),
             c("AM WEST", "On Time", 694, 4840, 383, 320, 201),
             c(NA, "Delayed", 117, 415, 65, 129, 61))

write.table(csv, file = "DATA_607_Assignment3.csv", sep = ",", col.names=F, row.names=F)

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

wide_data <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
                             "master/DATA%20607/DATA_607_Assignment3.csv"), 
                             stringsAsFactors = F); wide_data

##       NA.   NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA On Time         497     221       212           503    1841
## 2    <NA> Delayed          62      12        20           102     305
## 3 AM WEST On Time         694    4840       383           320     201
## 4    <NA> Delayed         117     415        65           129      61

library(tidyr)
library(dplyr)

long_data <- wide_data %>%
mutate(Airlines1 = NA., Airlines2=lag(NA.)) %>% 
mutate(Airline = coalesce(Airlines1,Airlines2), Status = NA..1)  %>%
gather("Destination", "Flights", 3:7) %>% 
select(Airline:Flights) %>% 
arrange(Airline, desc(Status), Destination); long_data

##    Airline  Status   Destination Flights
## 1   ALASKA On Time   Los.Angeles     497
## 2   ALASKA On Time       Phoenix     221
## 3   ALASKA On Time     San.Diego     212
## 4   ALASKA On Time San.Francisco     503
## 5   ALASKA On Time       Seattle    1841
## 6   ALASKA Delayed   Los.Angeles      62
## 7   ALASKA Delayed       Phoenix      12
## 8   ALASKA Delayed     San.Diego      20
## 9   ALASKA Delayed San.Francisco     102
## 10  ALASKA Delayed       Seattle     305
## 11 AM WEST On Time   Los.Angeles     694
## 12 AM WEST On Time       Phoenix    4840
## 13 AM WEST On Time     San.Diego     383
## 14 AM WEST On Time San.Francisco     320
## 15 AM WEST On Time       Seattle     201
## 16 AM WEST Delayed   Los.Angeles     117
## 17 AM WEST Delayed       Phoenix     415
## 18 AM WEST Delayed     San.Diego      65
## 19 AM WEST Delayed San.Francisco     129
## 20 AM WEST Delayed       Seattle      61

The data are first pulled from GitHub in a “wide” structure format. Then then data is converted into a “long” structure using tidyr and dplyr. The mutate function is used to duplicate the first column twice (once with the cells shifted down). The mutate function is then used again with coalesce to combine the new columns into one column without null values. The gather function transforms the data from its crosstab appearance to a more normalized format. Finally, the select function is used to query the columns we want in the final “long” structure. Last, the filter function is used to split the data into the two groups that will be compared.

Perform analysis to compare the arrival delays for the two airlines.

Performance <- long_data %>%
  group_by(Airline) %>%
  mutate(A_Total = sum(Flights)) %>% 
  group_by(Airline, Status) %>%
  mutate(S_Total = sum(Flights), A_Perf = S_Total / A_Total) %>% 
  group_by(Airline, Destination) %>%
  mutate(D_Total = sum(Flights), D_Perf = Flights / D_Total)

# Alaska Airlines and AM West Overall Performance
data.frame(Performance[c(1,10,11,20), c(1,2,7)])

##   Airline  Status    A_Perf
## 1  ALASKA On Time 0.8672848
## 2  ALASKA Delayed 0.1327152
## 3 AM WEST On Time 0.8910727
## 4 AM WEST Delayed 0.1089273

# Alaska Airlines and AM West by Destination
Perf_by_Dest <- data.frame(Performance %>%
  filter(Status == "On Time") %>% 
  group_by(Destination) %>% 
  unite(temp, D_Total, D_Perf) %>%
  spread(Status, temp) %>%
  select(Destination, Airline, On_Time = `On Time`) %>% 
  spread(Airline, On_Time) %>%
  separate(ALASKA, into=c("Flights.AL", "Perf.AL"), sep = "_", convert=T) %>% 
  separate(`AM WEST`, into=c("Flights.AM", "Perf.AM"), sep = "_", convert=T) %>% 
  mutate(Difference = Perf.AL - Perf.AM)  %>% 
  arrange(Destination)); Perf_by_Dest

##     Destination Flights.AL   Perf.AL Flights.AM   Perf.AM Difference
## 1   Los.Angeles        559 0.8890877        811 0.8557337 0.03335399
## 2       Phoenix        233 0.9484979       5255 0.9210276 0.02747026
## 3     San.Diego        232 0.9137931        448 0.8549107 0.05888239
## 4 San.Francisco        605 0.8314050        449 0.7126949 0.11871008
## 5       Seattle       2146 0.8578751        262 0.7671756 0.09069954

First the functions group_by and mutate are used multiple times on the data in order to calculate and append aggregate values to the data set. The data for each airline are then summarized by arrival status (“On Time”, “Delayed”). Then the data are manipulated as follows in order to allow for comparisons:

filter is used to extract only On Time data.
group_by is used to view the data by destination.
unite is used to combine two columns into one.
spread is used to flatten the data on united column.
select is used to narrow the data and rename the united column.
spread is used to flatten the data further airline.
separate is used twice to split the united column into four.
mutate is used to add a calculated column to the data.
arrange is used to sort the data.

After all this, the summarized comparison of performance for each airline at each destination is displayed.

Conclusions.

Alaska Airlines’ 3775 flights were on-time 86.73% of the time. AM West’s (acquired by American Airlines) 7225 flights were on-time 89.11% of the time. The pattern persists even when looking at individual destinations. Yet for some destinations the difference in performance is larger than others. Compare Phoenix with the smallest performance difference of 0.0274703 and San Francisco with the largest performance difference of 0.1187101. A cursory review appears to indicate–holding all else equal– that Alaska Airlines performs better when a fewer number of flights are involved and that AM West performs better when larger number of flights are involved. Correlating performance on the number of flights, the data appear to support this conclusion with \(r = -0.5373998\) for Alaska Airlines and \(r = 0.7069816\) for AM West.

Assignment - Tidying and Transforming Data

Jose Zuniga

Assignment Instructions

Arrival Delays