library(tidyverse)
library(reshape2)
library(scales)
First, I’ll read in the data from github. The flights data comes in .csv format, formatted exactly as provided in the assignment.
data <- read_csv('https://raw.githubusercontent.com/kac624/cuny/main/D607/data/week5_flights.csv')
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
data
## # A tibble: 5 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Next, I’ll clean the data. The series of piped commands performs the
following:
1. The “spacer” row from the original data (containing all NAs) is
removed.
2. NAs in the Airlines column are filled in with the value above
them.
3. The counts of flights going to each destination are melted to reduce
the dataframe’s width.
4. The count of on-time and delayed flights are cast as separate
variables.
Finally, the columns are renamed to provide transparency and consistency.
data <- data %>%
filter(rowSums(is.na(data)) != ncol(data)) %>%
fill(...1, .direction = c('down')) %>%
melt(id.vars = 1:2) %>%
pivot_wider(id_cols = c(...1, variable),
names_from = ...2)
colnames(data) <- c('Airline',
'Destination',
'OnTime',
'Delayed')
data
## # A tibble: 10 × 4
## Airline Destination OnTime Delayed
## <chr> <fct> <dbl> <dbl>
## 1 ALASKA Los Angeles 497 62
## 2 AM WEST Los Angeles 694 117
## 3 ALASKA Phoenix 221 12
## 4 AM WEST Phoenix 4840 415
## 5 ALASKA San Diego 212 20
## 6 AM WEST San Diego 383 65
## 7 ALASKA San Francisco 503 102
## 8 AM WEST San Francisco 320 129
## 9 ALASKA Seattle 1841 305
## 10 AM WEST Seattle 201 61
We first visualize the number of on-time flights from each airline, broken out by destination. However, the mismatched proportions inhibit our ability to infer much. Looking at Phoenix for example, it is difficult to tell which airline had more on time flights in a relative sense, given AM WEST’s domination at that destination.
data %>%
melt(id.vars = 1:2) %>%
mutate(OnTimeStatus = variable) %>%
ggplot(aes(x = Destination, y = value, fill = OnTimeStatus)) +
geom_col(color = 'black', position = 'dodge') +
scale_fill_manual(values=c('chartreuse3','firebrick3')) +
facet_grid(.~Airline) +
coord_flip()
So, we add a new column to calculate the percentage of on-time flights for each row in the data. This gives us an “apples-to-apples” way to compare each airline’s rate of delays.
We then create a second plot, showing each airline’s on-time rate side-by-side for each destination. From this plot, a clear trend emerges: for all destinations, ALASKA airlines has a higher on-time rate than AM WEST.
data <- data %>%
mutate(OnTimeRate = OnTime / (OnTime + Delayed))
ggplot(data, aes(x = Destination, y = OnTimeRate, fill = Airline)) +
geom_col(color = 'black', position='dodge') +
scale_fill_manual(values=c('deepskyblue','goldenrod1')) +
coord_flip()
If we summarize our data to compare the two airlines across all destinations, the trend reverses. Interestingly, AM WEST’s aggregate on-time rate is higher than ALASKA’s.
data %>%
group_by(Airline) %>%
summarize(TotalOnTime = sum(OnTime),
TotalDelayed = sum(Delayed),
TotalOnTimeRate = percent(TotalOnTime / (TotalOnTime + TotalDelayed), 0.01),
.groups = 'keep')
## # A tibble: 2 × 4
## # Groups: Airline [2]
## Airline TotalOnTime TotalDelayed TotalOnTimeRate
## <chr> <dbl> <dbl> <chr>
## 1 ALASKA 3274 501 86.73%
## 2 AM WEST 6438 787 89.11%
Finally, we take a quick look at on-time rates for each destination. There do appear to be some significant differences. Phoenix has the highest overall on-time rate, which helps explain why AM WEST has a higher over all on-time rate (as AM WEST has a very large volume of flights into Phoenix). On the other hand, Seattle, where ALASKA dominates, has the second lowest on-time rate.
data %>%
group_by(Destination) %>%
summarize(TotalOnTime = sum(OnTime),
TotalDelayed = sum(Delayed),
TotalOnTimeRate = TotalOnTime / (TotalOnTime + TotalDelayed),
.groups = 'keep') %>%
ggplot(aes(x = Destination, y = TotalOnTimeRate)) +
geom_col(color = 'black', fill = 'cornflowerblue', position='dodge') +
coord_flip()
From the above, we can infer that, overall, AM WEST has a great on-time arrival rate. However, a lot of that advantage comes from what appears to be greater efficiency at destinations to which AM WEST flies more (primarily Phoenix). When “controlling” for these destination-specific factors, ALASKA appears to have the better on-time rate.