library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(data.table)
The file will be read using base R read.csv because it is small enough. As that function puts in a period where spaces are found in the column names, the names will be adjusted.
RawD <- read.csv("./DT607_HW5.csv", header = TRUE)
names(RawD) <- c("airline", "status",
gsub("\\.", " ", names(RawD)[3:length(RawD)]))
RawD
## airline status Los Angeles Phoenix San Diego San Fransisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
I will attempt everything in both the data.table and tidyverse milieus, although I am much more comfortable in the former. If done consistently, the results should be the same, barring differences in data table and tibble implementations.
There may be a more elegant way, but for this small data set, manual adjustment is the quickest and easiest. Sometimes, looking at the data and identifying small adjustments can make the downstream analysis easier. Adding the airline name for the delayed line before reshaping obviates the need to add it afterwards, which is more difficult.
RawD[2, 1] <- RawD[1, 1]
RawD[5, 1] <- RawD[4, 1]
RawD
## airline status Los Angeles Phoenix San Diego San Fransisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
data.tableThe data.tabe package has a very fast implementation of reshape2 coded in C. Its verbs for reshaping are melt and cast. See ?melt for more information. The reshaping here will be to keep the airline and status values for every row, and melt the individual destination values into one long column—named destination—with its count of flights in a vector named count.
RawDT <- as.data.table(RawD)
DT <- melt(RawDT, id.vars = c("airline", "status"),
variable.name = "destination", value.name = "count", na.rm = TRUE)
DT
## airline status destination count
## 1: ALASKA on time Los Angeles 497
## 2: ALASKA delayed Los Angeles 62
## 3: AM WEST on time Los Angeles 694
## 4: AM WEST delayed Los Angeles 117
## 5: ALASKA on time Phoenix 221
## 6: ALASKA delayed Phoenix 12
## 7: AM WEST on time Phoenix 4840
## 8: AM WEST delayed Phoenix 415
## 9: ALASKA on time San Diego 212
## 10: ALASKA delayed San Diego 20
## 11: AM WEST on time San Diego 383
## 12: AM WEST delayed San Diego 65
## 13: ALASKA on time San Fransisco 503
## 14: ALASKA delayed San Fransisco 102
## 15: AM WEST on time San Fransisco 320
## 16: AM WEST delayed San Fransisco 129
## 17: ALASKA on time Seattle 1841
## 18: ALASKA delayed Seattle 305
## 19: AM WEST on time Seattle 201
## 20: AM WEST delayed Seattle 61
tidyrThe tidyr package was re-released this month with a new set of verbs. Instead of gather and spread, the new verbs are pivot_longer and pivot_wider respectively. See ?pivot_longer for more information. As per above, the reshaping will be to keep the airline and status values for every row, and pivot the individual destination values into one longer column—named destination—with its count of flights in a vector named count.
TB <- RawD %>% pivot_longer(cols = c(-airline, -status),
names_to = "destination", values_to = "count",
values_drop_na = TRUE)
TB
## # A tibble: 20 x 4
## airline status destination count
## <fct> <fct> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Fransisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Fransisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Fransisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Fransisco 129
## 20 AM WEST delayed Seattle 61
With only counts, the analysis will be limited to the empirical probabilities of being on-time or late. Only on-time will be shown, as being late is simply the complementary value.
This section will calculate the number of on-time flights as a percentage of the number of total flights per carrier.
In the data.table version, use is made of curly braces ({}) which allows calculations (or even functions) to be defined, created, and then used in downstream aggregations/joins/etc. Here the the total flights for each carrier is calculated as totflights and then used downstream. The outer keyby not only controls the inner call to .SD, it also controls the totflights variable.
As dplyr has the verb mutate, which allows newly-created columns to be used immediately, there is no need for intermediate calculations, and the total and totflights variables can be piped downstream right away.
DT[, {
totflights = sum(count);
.SD[status == "on time", .(OnTimePerc = sum(count) / totflights, totflights)]
}, keyby = airline]
## airline OnTimePerc totflights
## 1: ALASKA 0.8672848 3775
## 2: AM WEST 0.8910727 7225
TB %>% group_by(airline, status) %>% summarize(total = sum(count)) %>%
mutate(totflights = sum(total), OnTimePerc = total / totflights) %>%
filter(status == "on time") %>% select(airline, OnTimePerc, totflights)
## # A tibble: 2 x 3
## # Groups: airline [2]
## airline OnTimePerc totflights
## <fct> <dbl> <int>
## 1 ALASKA 0.867 3775
## 2 AM WEST 0.891 7225
This is fundamentally the same question as above, except that now a additional level of grouping is added—that of the destination.
In the data.table code, this is manually set by the outer keyby grouping and ordering variables to be destination and then airline. The total such flights are counted first and saved as totflights. Then, the data table is restricted to the on-time flights, and the count of those is divided into the total flights per airline per destination for the percentage.
The dplyr code is very similar, other than the splitting out of various verbs including the reordering.
DT1 <- DT[, {
totflights = sum(count);
.SD[status == "on time", .(OnTimePerc = count / totflights, totflights)]
}, keyby = c("destination", "airline")]
DT1
## destination airline OnTimePerc totflights
## 1: Los Angeles ALASKA 0.8890877 559
## 2: Los Angeles AM WEST 0.8557337 811
## 3: Phoenix ALASKA 0.9484979 233
## 4: Phoenix AM WEST 0.9210276 5255
## 5: San Diego ALASKA 0.9137931 232
## 6: San Diego AM WEST 0.8549107 448
## 7: San Fransisco ALASKA 0.8314050 605
## 8: San Fransisco AM WEST 0.7126949 449
## 9: Seattle ALASKA 0.8578751 2146
## 10: Seattle AM WEST 0.7671756 262
TB1 <- TB %>% group_by(destination, airline) %>%
mutate(totflights = sum(count), OnTimePerc = count / totflights) %>%
filter(status == "on time") %>%
select(destination, airline, OnTimePerc, totflights) %>% arrange(destination)
TB1
## # A tibble: 10 x 4
## # Groups: destination, airline [10]
## destination airline OnTimePerc totflights
## <chr> <fct> <dbl> <int>
## 1 Los Angeles ALASKA 0.889 559
## 2 Los Angeles AM WEST 0.856 811
## 3 Phoenix ALASKA 0.948 233
## 4 Phoenix AM WEST 0.921 5255
## 5 San Diego ALASKA 0.914 232
## 6 San Diego AM WEST 0.855 448
## 7 San Fransisco ALASKA 0.831 605
## 8 San Fransisco AM WEST 0.713 449
## 9 Seattle ALASKA 0.858 2146
## 10 Seattle AM WEST 0.767 262
This section demonstrates that the resulting data table or tibble can be supplied to ggplot2 as an enhanced data frame for plotting purposes. The plots will be identical if the analysis was performed consistently between ecosystems.
ggplot(DT1, aes(x = destination, y = OnTimePerc, fill = airline)) +
geom_col(position = "dodge") + scale_y_continuous(labels = percent)
ggplot(TB1, aes(x = destination, y = OnTimePerc, fill = airline)) +
geom_col(position = "dodge") + scale_y_continuous(labels = percent)
Even though Alaska Air has the higher on-time percentage for each city, AM West has the better overall on-time percentage. This counter-intuitive result is an example of Simpson’s Paradox. The outsized presence AM West has in Phoenix is the root cause. While the few Alaska Air flights out of Phoenix are more on-time than the AM West flights, those AM West flights are still more on-time than the average Alaska Air flight excluding Phoenix. Since there are so many of them, they bring AM West’s overall average up above than that of Alaska Air’s. The batting average example in the Wikipedia article linked above demonstrates this very effect.
sessionInfo()
## R version 3.6.1 Patched (2019-08-04 r76915)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.12.2 scales_1.0.0 ggplot2_3.2.1 tidyr_1.0.0
## [5] dplyr_0.8.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 pillar_1.4.2 compiler_3.6.1 tools_3.6.1
## [5] zeallot_0.1.0 digest_0.6.21 evaluate_0.14 lifecycle_0.1.0
## [9] tibble_2.1.3 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.0
## [13] cli_1.1.0 yaml_2.2.0 xfun_0.9 withr_2.1.2
## [17] stringr_1.4.0 knitr_1.25 vctrs_0.2.0 grid_3.6.1
## [21] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0 fansi_0.4.0
## [25] rmarkdown_1.15 purrr_0.3.2 magrittr_1.5 backports_1.1.4
## [29] htmltools_0.3.6 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
## [33] utf8_1.1.4 stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0
## [37] crayon_1.3.4