Ch5 Assignment – Tidying and Transforming Data

Performing data transformations on a small dataset of arrival delays for two airlines across five destinations. Using the following packages:

require(tidyr)
require(dplyr)
require(stringr)
require(ggplot2)

Read the information from a .CSV file (constructed in MySQL and uploaded to Github) into R.

##            city Alaska_ontime Alaska_delay AMWest_ontime AMWest_delay
## 1   Los Angeles           497           62           694          117
## 2       Phoenix           221           12          4840          415
## 3     San Diego           212           20           383           65
## 4 San Francisco           503          102           320          129
## 5       Seattle          1841          305           201           61

Here, I use dyplr’s gather and separate functions to make the previously wide data long. The original columns were named according to airline and departure status. I first lengthen the data by breaking them into a temporary variable (Condition), then divide them into two separate variables, Airline and Status.

dly2 <- dly %>% 
  gather("Condition", "Freq", 2:5) %>% 
  separate("Condition", c("Airline", "Status"), sep = "_")
dly2
##             city Airline Status Freq
## 1    Los Angeles  Alaska ontime  497
## 2        Phoenix  Alaska ontime  221
## 3      San Diego  Alaska ontime  212
## 4  San Francisco  Alaska ontime  503
## 5        Seattle  Alaska ontime 1841
## 6    Los Angeles  Alaska  delay   62
## 7        Phoenix  Alaska  delay   12
## 8      San Diego  Alaska  delay   20
## 9  San Francisco  Alaska  delay  102
## 10       Seattle  Alaska  delay  305
## 11   Los Angeles  AMWest ontime  694
## 12       Phoenix  AMWest ontime 4840
## 13     San Diego  AMWest ontime  383
## 14 San Francisco  AMWest ontime  320
## 15       Seattle  AMWest ontime  201
## 16   Los Angeles  AMWest  delay  117
## 17       Phoenix  AMWest  delay  415
## 18     San Diego  AMWest  delay   65
## 19 San Francisco  AMWest  delay  129
## 20       Seattle  AMWest  delay   61

Next, I re-arrange the data so that the observational unit is flights out of a certain city by a certain airline. Frequency of delays and ontime departures are listed in the last two columns.

I further mutate the dataframe, adding a column with total flights and likelihood of being delayed.

dly3 <- dly2 %>% 
  spread(Status, Freq) %>% 
  mutate(total = delay + ontime, pct.dly = round((delay / total),4))
dly3
##             city Airline delay ontime total pct.dly
## 1    Los Angeles  Alaska    62    497   559  0.1109
## 2    Los Angeles  AMWest   117    694   811  0.1443
## 3        Phoenix  Alaska    12    221   233  0.0515
## 4        Phoenix  AMWest   415   4840  5255  0.0790
## 5      San Diego  Alaska    20    212   232  0.0862
## 6      San Diego  AMWest    65    383   448  0.1451
## 7  San Francisco  Alaska   102    503   605  0.1686
## 8  San Francisco  AMWest   129    320   449  0.2873
## 9        Seattle  Alaska   305   1841  2146  0.1421
## 10       Seattle  AMWest    61    201   262  0.2328

As part of my analysis, I want to sum the total amount of ontime and delayed departures for the two airlines, across all flights in the five cities.

AL.ontime <- sum(dly3$ontime[dly3$Airline == "Alaska"])
AM.ontime <- sum(dly3$ontime[dly3$Airline == "AMWest"])
AL.dly <- sum(dly3$delay[dly3$Airline == "Alaska"])
AM.dly <- sum(dly3$delay[dly3$Airline == "AMWest"])
  
dly.sum <- data.frame( ontime = c(AL.ontime,AM.ontime),
                       delayed = c(AL.dly, AM.dly))
rownames(dly.sum) <- c("Alaska", "AMWest")
dly.sum
##        ontime delayed
## Alaska   3274     501
## AMWest   6438     787

I then use this simplified dataframe to construct a mosaicplot, which reveals that the smaller carrier, Alaska Airlines has a slightly higher percentage of delayed flights.

mosaicplot(dly.sum, color = c("Blue","Red"), main = "Flight Departure Status for Two Airlines out of the SW")

Now, I calculate the mean percentage of flights delayed, grouped by city, an aggregate of the data from the two airlines.

dly4 <- dly3 %>% 
  group_by(city) %>% 
  summarise(sum(delay), sum(total), pct.dly = round(sum(delay)/sum(total),4))
dly4
## # A tibble: 5 x 4
##   city          `sum(delay)` `sum(total)` pct.dly
##   <fct>                <int>        <int>   <dbl>
## 1 Los Angeles            179         1370  0.131 
## 2 Phoenix                427         5488  0.0778
## 3 San Diego               85          680  0.125 
## 4 San Francisco          231         1054  0.219 
## 5 Seattle                366         2408  0.152

My aim is to determine, and then visualize, which of the airlines perform better than average in terms of delayed departures. To do so, I repeat the aggregate percent-delayed-by-city across each two rows for the 5 cities, then subtract it from the percent-delayed-by-city-by-airline to get a residual. This residual indicates the relative performance of each airline at the five hubs, with negatives indicating a positive feature (relatively less incidence of delays).

dly3$mean.dly <- rep(dly4$pct.dly, each=2)
dly3$mean.diff <- dly3$pct.dly - dly3$mean.dly
dly3
##             city Airline delay ontime total pct.dly mean.dly mean.diff
## 1    Los Angeles  Alaska    62    497   559  0.1109   0.1307   -0.0198
## 2    Los Angeles  AMWest   117    694   811  0.1443   0.1307    0.0136
## 3        Phoenix  Alaska    12    221   233  0.0515   0.0778   -0.0263
## 4        Phoenix  AMWest   415   4840  5255  0.0790   0.0778    0.0012
## 5      San Diego  Alaska    20    212   232  0.0862   0.1250   -0.0388
## 6      San Diego  AMWest    65    383   448  0.1451   0.1250    0.0201
## 7  San Francisco  Alaska   102    503   605  0.1686   0.2192   -0.0506
## 8  San Francisco  AMWest   129    320   449  0.2873   0.2192    0.0681
## 9        Seattle  Alaska   305   1841  2146  0.1421   0.1520   -0.0099
## 10       Seattle  AMWest    61    201   262  0.2328   0.1520    0.0808

Finally, I visualize the differential using ggplot2.

fig1 <- ggplot(dly3, aes(x = factor(city), fill = Airline, y = mean.diff)) +
  geom_dotplot(binaxis = "y", stackgroups = TRUE, binpositions="all") + 
  geom_hline(yintercept=0, linetype="dashed", color = "red", size=1)

fig2 <- fig1 + labs(x = "Cities", y = "+/- over Mean Delay", 
                    title = "Comparing Relative Lateness of Two Airlines", subtitle= "Residual over Mean Percentage of Delayed Flights by City")

fig2
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

This visualization reveals the Simpson’s Paradox. Whereas the mosiacplot revealed that Alaska Airlines did overally worse in the aggregate, it did better than AMWest at each of the five airports tested! This is due to the sheer volume of AMWest flights departing Pheonix which had a relatively low rate of delays, which although worse than Alaska at that location, brought down AMWest’s overall delay incidence greatly.