Data is tidy if:
Working with tidy data is advantageous, because tidy data is more amenable to transformation and analysis. Many widely-used statistical packages presume that input data will be more-or-less tidy. Combining and comparing distinct datasets is also simpler when they conform to the conditions listed above.
This assignment begins with untidy data, shown below.
untidy_flights <- read_csv('https://raw.githubusercontent.com/dmoscoe/SPS/main/DATA607/untidied.csv')
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_character(),
## X2 = col_character(),
## `Los Angeles` = col_double(),
## Phoenix = col_double(),
## `San Diego` = col_double(),
## `San Francisco` = col_double(),
## Seattle = col_double()
## )
head(untidy_flights)
## # A tibble: 5 x 7
## X1 X2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
In what follows, I transform the data into a tidy form. Then I analyze the performance of Alaska Airlines and AM WEST based on the data.
tidy_flights <- untidy_flights %>%
rename(airline = X1, status = X2) %>%
bind_cols(flight_count = NA)
tidy_flights <- tidy_flights %>%
gather(
`Los Angeles`:`Seattle`,
key = dest, value = flight_count,
na.rm = FALSE, convert = TRUE
) %>%
fill(airline, .direction = "down") %>%
drop_na()
head(tidy_flights)
## # A tibble: 6 x 4
## airline status dest flight_count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
tidy_flights <- tidy_flights %>%
spread(key = status, value = flight_count)
tidy_flights <- tidy_flights %>%
mutate(fract_delayed = `delayed` / (`on time` + delayed))
head(tidy_flights)
## # A tibble: 6 x 5
## airline dest delayed `on time` fract_delayed
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA Los Angeles 62 497 0.111
## 2 ALASKA Phoenix 12 221 0.0515
## 3 ALASKA San Diego 20 212 0.0862
## 4 ALASKA San Francisco 102 503 0.169
## 5 ALASKA Seattle 305 1841 0.142
## 6 AM WEST Los Angeles 117 694 0.144
The plot below shows the fraction of flights delayed at each destination for each airline. Columns are labeled with the total number of the airline’s flights to that destination contained in the dataset.
ggplot(data = tidy_flights, mapping = aes(x = fct_reorder(dest, fract_delayed), y = fract_delayed, fill = airline)) +
geom_col(position = "dodge") +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
labs(title = "Delay Rate By Destination",
subtitle = "Alaska Airlines and AM West",
x = "Destination",
y = "Fraction of Flights Delayed") +
geom_label(aes(label = `on time` + delayed), show.legend = FALSE, vjust = 1)
The plot shows that, at every destination, Alaska airlines has a lower fraction of flights delayed, and therefore better on-time performance. At its hub in Seattle, Alaska’s fraction of flights delayed is about 0.09 less than AM West’s. Even at AM West’s hub, Phoenix, Alaska outperforms.
What about overall performance?
overall_performance <- tidy_flights %>%
group_by(airline) %>%
summarize(tot_delayed = sum(delayed), tot_ontime = sum(`on time`), overall_delay_rate = round((sum(delayed)/(sum(`on time`) + sum(delayed))),2))
overall_performance
## # A tibble: 2 x 4
## airline tot_delayed tot_ontime overall_delay_rate
## * <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 501 3274 0.13
## 2 AM WEST 787 6438 0.11
\(\texttt{overall_performance}\) shows that, even though Alaska demonstrates better on-time performance at each destination, its overall delay rate is higher than that of AM West. This may seem like a paradox: how could an airline with better performance at any given destination be worse overall? The apparent paradox arises from the intuition that the overall delay rate must be an average of individual delay rates. Actually, the overall delay rate is a weighted average rather than a straight average. Each delay rate is weighted by the number of flights it summarizes. For AM West, this means there is a very large weight on their low delay-rate at Phoenix. For Alaska, the largest weight lies on their relatively high delay-rate at Seattle. These weights mean that AM West’s overall delay rate is less than it would be as a straight average of its delay rates at each destination. Alaska’s delay rate is greater.
The surprising consequence is that the overall average delay rate is misleading when it comes to choosing flights with low likelihoods of delay. Even though Alaska’s overall delay-rate is higher, it remains the best choice for any destination.