Week 5 - Tidying and Transforming Data

library(tidyverse)
library(reshape2)
library(scales)

Read in CSV

First, I’ll read in the data from github. The flights data comes in .csv format, formatted exactly as provided in the assignment.

data <- read_csv('https://raw.githubusercontent.com/kac624/cuny/main/D607/data/week5_flights.csv')

## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`

data

## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

Tidy data

Next, I’ll clean the data. The series of piped commands performs the following:
1. The “spacer” row from the original data (containing all NAs) is removed.
2. NAs in the Airlines column are filled in with the value above them.
3. The counts of flights going to each destination are melted to reduce the dataframe’s width.
4. The count of on-time and delayed flights are cast as separate variables.

Finally, the columns are renamed to provide transparency and consistency.

data <- data %>%
  filter(rowSums(is.na(data)) != ncol(data)) %>%
  fill(...1, .direction = c('down')) %>%
  melt(id.vars = 1:2) %>%
  pivot_wider(id_cols = c(...1, variable),
              names_from = ...2)

colnames(data) <- c('Airline',
                    'Destination',
                    'OnTime',
                    'Delayed')

data

## # A tibble: 10 × 4
##    Airline Destination   OnTime Delayed
##    <chr>   <fct>          <dbl>   <dbl>
##  1 ALASKA  Los Angeles      497      62
##  2 AM WEST Los Angeles      694     117
##  3 ALASKA  Phoenix          221      12
##  4 AM WEST Phoenix         4840     415
##  5 ALASKA  San Diego        212      20
##  6 AM WEST San Diego        383      65
##  7 ALASKA  San Francisco    503     102
##  8 AM WEST San Francisco    320     129
##  9 ALASKA  Seattle         1841     305
## 10 AM WEST Seattle          201      61

Analysis

We first visualize the number of on-time flights from each airline, broken out by destination. However, the mismatched proportions inhibit our ability to infer much. Looking at Phoenix for example, it is difficult to tell which airline had more on time flights in a relative sense, given AM WEST’s domination at that destination.

data %>%
  melt(id.vars = 1:2) %>%
  mutate(OnTimeStatus = variable) %>%
  ggplot(aes(x = Destination, y = value, fill = OnTimeStatus)) +
  geom_col(color = 'black', position = 'dodge') +
  scale_fill_manual(values=c('chartreuse3','firebrick3')) +
  facet_grid(.~Airline) +
  coord_flip()

So, we add a new column to calculate the percentage of on-time flights for each row in the data. This gives us an “apples-to-apples” way to compare each airline’s rate of delays.

We then create a second plot, showing each airline’s on-time rate side-by-side for each destination. From this plot, a clear trend emerges: for all destinations, ALASKA airlines has a higher on-time rate than AM WEST.

data <- data %>%
  mutate(OnTimeRate = OnTime / (OnTime + Delayed))

ggplot(data, aes(x = Destination, y = OnTimeRate, fill = Airline)) +
  geom_col(color = 'black', position='dodge') +
  scale_fill_manual(values=c('deepskyblue','goldenrod1')) +
  coord_flip()

If we summarize our data to compare the two airlines across all destinations, the trend reverses. Interestingly, AM WEST’s aggregate on-time rate is higher than ALASKA’s.

data %>%
  group_by(Airline) %>% 
  summarize(TotalOnTime = sum(OnTime),
            TotalDelayed = sum(Delayed),
            TotalOnTimeRate = percent(TotalOnTime / (TotalOnTime + TotalDelayed), 0.01), 
            .groups = 'keep')

## # A tibble: 2 × 4
## # Groups:   Airline [2]
##   Airline TotalOnTime TotalDelayed TotalOnTimeRate
##   <chr>         <dbl>        <dbl> <chr>          
## 1 ALASKA         3274          501 86.73%         
## 2 AM WEST        6438          787 89.11%

Finally, we take a quick look at on-time rates for each destination. There do appear to be some significant differences. Phoenix has the highest overall on-time rate, which helps explain why AM WEST has a higher over all on-time rate (as AM WEST has a very large volume of flights into Phoenix). On the other hand, Seattle, where ALASKA dominates, has the second lowest on-time rate.

data %>%
  group_by(Destination) %>%
  summarize(TotalOnTime = sum(OnTime),
            TotalDelayed = sum(Delayed),
            TotalOnTimeRate = TotalOnTime / (TotalOnTime + TotalDelayed), 
            .groups = 'keep') %>%
  ggplot(aes(x = Destination, y = TotalOnTimeRate)) +
  geom_col(color = 'black', fill = 'cornflowerblue', position='dodge') +
  coord_flip()

Conclusion

From the above, we can infer that, overall, AM WEST has a great on-time arrival rate. However, a lot of that advantage comes from what appears to be greater efficiency at destinations to which AM WEST flies more (primarily Phoenix). When “controlling” for these destination-specific factors, ALASKA appears to have the better on-time rate.

Week 5 - Tidying and Transforming Data

Keith Colella

2023-02-26

Read in CSV

Tidy data

Analysis

Conclusion