Load Libraries

library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(data.table)

Read File

The file will be read using base R read.csv because it is small enough. As that function puts in a period where spaces are found in the column names, the names will be adjusted.

RawD <- read.csv("./DT607_HW5.csv", header = TRUE)
names(RawD) <- c("airline", "status",
                 gsub("\\.", " ", names(RawD)[3:length(RawD)]))
RawD
##   airline  status Los Angeles Phoenix San Diego San Fransisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Munge Data

I will attempt everything in both the data.table and tidyverse milieus, although I am much more comfortable in the former. If done consistently, the results should be the same, barring differences in data table and tibble implementations.

Pre-reshaping

There may be a more elegant way, but for this small data set, manual adjustment is the quickest and easiest. Sometimes, looking at the data and identifying small adjustments can make the downstream analysis easier. Adding the airline name for the delayed line before reshaping obviates the need to add it afterwards, which is more difficult.

RawD[2, 1] <- RawD[1, 1]
RawD[5, 1] <- RawD[4, 1]
RawD
##   airline  status Los Angeles Phoenix San Diego San Fransisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

Reshape with data.table

The data.tabe package has a very fast implementation of reshape2 coded in C. Its verbs for reshaping are melt and cast. See ?melt for more information. The reshaping here will be to keep the airline and status values for every row, and melt the individual destination values into one long column—named destination—with its count of flights in a vector named count.

RawDT <- as.data.table(RawD)
DT <- melt(RawDT, id.vars = c("airline", "status"),
           variable.name = "destination", value.name = "count", na.rm = TRUE)
DT
##     airline  status   destination count
##  1:  ALASKA on time   Los Angeles   497
##  2:  ALASKA delayed   Los Angeles    62
##  3: AM WEST on time   Los Angeles   694
##  4: AM WEST delayed   Los Angeles   117
##  5:  ALASKA on time       Phoenix   221
##  6:  ALASKA delayed       Phoenix    12
##  7: AM WEST on time       Phoenix  4840
##  8: AM WEST delayed       Phoenix   415
##  9:  ALASKA on time     San Diego   212
## 10:  ALASKA delayed     San Diego    20
## 11: AM WEST on time     San Diego   383
## 12: AM WEST delayed     San Diego    65
## 13:  ALASKA on time San Fransisco   503
## 14:  ALASKA delayed San Fransisco   102
## 15: AM WEST on time San Fransisco   320
## 16: AM WEST delayed San Fransisco   129
## 17:  ALASKA on time       Seattle  1841
## 18:  ALASKA delayed       Seattle   305
## 19: AM WEST on time       Seattle   201
## 20: AM WEST delayed       Seattle    61

Reshape with tidyr

The tidyr package was re-released this month with a new set of verbs. Instead of gather and spread, the new verbs are pivot_longer and pivot_wider respectively. See ?pivot_longer for more information. As per above, the reshaping will be to keep the airline and status values for every row, and pivot the individual destination values into one longer column—named destination—with its count of flights in a vector named count.

TB <- RawD %>% pivot_longer(cols = c(-airline, -status),
                            names_to = "destination", values_to = "count",
                            values_drop_na = TRUE)
TB
## # A tibble: 20 x 4
##    airline status  destination   count
##    <fct>   <fct>   <chr>         <int>
##  1 ALASKA  on time Los Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San Diego       212
##  4 ALASKA  on time San Fransisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San Diego        20
##  9 ALASKA  delayed San Fransisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los Angeles     694
## 12 AM WEST on time Phoenix        4840
## 13 AM WEST on time San Diego       383
## 14 AM WEST on time San Fransisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San Diego        65
## 19 AM WEST delayed San Fransisco   129
## 20 AM WEST delayed Seattle          61

Analyze Data

With only counts, the analysis will be limited to the empirical probabilities of being on-time or late. Only on-time will be shown, as being late is simply the complementary value.

Overall

This section will calculate the number of on-time flights as a percentage of the number of total flights per carrier.

In the data.table version, use is made of curly braces ({}) which allows calculations (or even functions) to be defined, created, and then used in downstream aggregations/joins/etc. Here the the total flights for each carrier is calculated as totflights and then used downstream. The outer keyby not only controls the inner call to .SD, it also controls the totflights variable.

As dplyr has the verb mutate, which allows newly-created columns to be used immediately, there is no need for intermediate calculations, and the total and totflights variables can be piped downstream right away.

DT[, {
  totflights = sum(count);
  .SD[status == "on time", .(OnTimePerc = sum(count) / totflights, totflights)]
  }, keyby = airline]
##    airline OnTimePerc totflights
## 1:  ALASKA  0.8672848       3775
## 2: AM WEST  0.8910727       7225
TB %>% group_by(airline, status) %>% summarize(total = sum(count)) %>%
  mutate(totflights = sum(total), OnTimePerc = total / totflights) %>%
  filter(status == "on time") %>% select(airline, OnTimePerc, totflights)
## # A tibble: 2 x 3
## # Groups:   airline [2]
##   airline OnTimePerc totflights
##   <fct>        <dbl>      <int>
## 1 ALASKA       0.867       3775
## 2 AM WEST      0.891       7225

By destination

This is fundamentally the same question as above, except that now a additional level of grouping is added—that of the destination.

In the data.table code, this is manually set by the outer keyby grouping and ordering variables to be destination and then airline. The total such flights are counted first and saved as totflights. Then, the data table is restricted to the on-time flights, and the count of those is divided into the total flights per airline per destination for the percentage.

The dplyr code is very similar, other than the splitting out of various verbs including the reordering.

DT1 <- DT[, {
  totflights = sum(count);
      .SD[status == "on time", .(OnTimePerc = count / totflights, totflights)]
    }, keyby = c("destination", "airline")]
DT1
##       destination airline OnTimePerc totflights
##  1:   Los Angeles  ALASKA  0.8890877        559
##  2:   Los Angeles AM WEST  0.8557337        811
##  3:       Phoenix  ALASKA  0.9484979        233
##  4:       Phoenix AM WEST  0.9210276       5255
##  5:     San Diego  ALASKA  0.9137931        232
##  6:     San Diego AM WEST  0.8549107        448
##  7: San Fransisco  ALASKA  0.8314050        605
##  8: San Fransisco AM WEST  0.7126949        449
##  9:       Seattle  ALASKA  0.8578751       2146
## 10:       Seattle AM WEST  0.7671756        262
TB1 <- TB %>% group_by(destination, airline) %>%
  mutate(totflights = sum(count), OnTimePerc = count / totflights) %>%
  filter(status == "on time") %>%
  select(destination, airline, OnTimePerc, totflights) %>% arrange(destination)
TB1
## # A tibble: 10 x 4
## # Groups:   destination, airline [10]
##    destination   airline OnTimePerc totflights
##    <chr>         <fct>        <dbl>      <int>
##  1 Los Angeles   ALASKA       0.889        559
##  2 Los Angeles   AM WEST      0.856        811
##  3 Phoenix       ALASKA       0.948        233
##  4 Phoenix       AM WEST      0.921       5255
##  5 San Diego     ALASKA       0.914        232
##  6 San Diego     AM WEST      0.855        448
##  7 San Fransisco ALASKA       0.831        605
##  8 San Fransisco AM WEST      0.713        449
##  9 Seattle       ALASKA       0.858       2146
## 10 Seattle       AM WEST      0.767        262

Graphical display

This section demonstrates that the resulting data table or tibble can be supplied to ggplot2 as an enhanced data frame for plotting purposes. The plots will be identical if the analysis was performed consistently between ecosystems.

ggplot(DT1, aes(x = destination, y = OnTimePerc, fill = airline)) +
  geom_col(position = "dodge") + scale_y_continuous(labels = percent)

ggplot(TB1, aes(x = destination, y = OnTimePerc, fill = airline)) +
  geom_col(position = "dodge") + scale_y_continuous(labels = percent)

Observations—Simpson’s Paradox

Even though Alaska Air has the higher on-time percentage for each city, AM West has the better overall on-time percentage. This counter-intuitive result is an example of Simpson’s Paradox. The outsized presence AM West has in Phoenix is the root cause. While the few Alaska Air flights out of Phoenix are more on-time than the AM West flights, those AM West flights are still more on-time than the average Alaska Air flight excluding Phoenix. Since there are so many of them, they bring AM West’s overall average up above than that of Alaska Air’s. The batting average example in the Wikipedia article linked above demonstrates this very effect.

Session Info

sessionInfo()
## R version 3.6.1 Patched (2019-08-04 r76915)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.12.2 scales_1.0.0      ggplot2_3.2.1     tidyr_1.0.0      
## [5] dplyr_0.8.3      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2       pillar_1.4.2     compiler_3.6.1   tools_3.6.1     
##  [5] zeallot_0.1.0    digest_0.6.21    evaluate_0.14    lifecycle_0.1.0 
##  [9] tibble_2.1.3     gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.0     
## [13] cli_1.1.0        yaml_2.2.0       xfun_0.9         withr_2.1.2     
## [17] stringr_1.4.0    knitr_1.25       vctrs_0.2.0      grid_3.6.1      
## [21] tidyselect_0.2.5 glue_1.3.1       R6_2.4.0         fansi_0.4.0     
## [25] rmarkdown_1.15   purrr_0.3.2      magrittr_1.5     backports_1.1.4 
## [29] htmltools_0.3.6  assertthat_0.2.1 colorspace_1.4-1 labeling_0.3    
## [33] utf8_1.1.4       stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0   
## [37] crayon_1.3.4