PART 1: Tidy a csv table
(flts <- read.csv('FlightTimes.csv'))
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Make those blank fields NA’s
(flts <- read.csv('FlightTimes.csv', na.strings = ""))
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Remove the NA row and fill the airline names into all rows
(flts <- flts %>%
fill('X') %>%
drop_na())
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Name first 2 columns and pivot destinations
# Give the first 2 cols names
flts <- rename(flts, Carrier = X, Status = X.1)
# Tidy
flts <- pivot_longer(flts, Los.Angeles:Seattle, names_to = 'Destination',
values_to = 'Count')
head(flts)
## # A tibble: 6 x 4
## Carrier Status Destination Count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
Remove dots from Destination names
flts$Destination <- sapply(flts$Destination, function(x) {gsub('\\.', ' ', x)})
head(flts)
## # A tibble: 6 x 4
## Carrier Status Destination Count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
PART 2: Analyze ==> Compare the arrival delays for the two airlines
# First instinct is to group the flights by delays and carriers:
ontimeAK <- flts %>% filter(Carrier == 'ALASKA' & Status == 'on time')
delayedAK <- flts %>% filter(Carrier == 'ALASKA' & Status == 'delayed')
ontimeAW <- flts %>% filter(Carrier != 'ALASKA' & Status == 'on time')
(delayedAW <- flts %>% filter(Carrier != 'ALASKA' & Status == 'delayed'))
## # A tibble: 5 x 4
## Carrier Status Destination Count
## <chr> <chr> <chr> <int>
## 1 AM WEST delayed Los Angeles 117
## 2 AM WEST delayed Phoenix 415
## 3 AM WEST delayed San Diego 65
## 4 AM WEST delayed San Francisco 129
## 5 AM WEST delayed Seattle 61
Unless I’m only interested in a particular destination, my first thought when analyzing delays by airline is to see which airline has the highest ratio of on-time flights to delayed flights.
cat("AM WEST's on-time/delayed ratio to these destinations was",
round(sum(ontimeAW$Count) / sum(delayedAW$Count), 2),
"\nwhile ALASKA's was slightly worse, at",
round(sum(ontimeAK$Count) / sum(delayedAK$Count), 2))
## AM WEST's on-time/delayed ratio to these destinations was 8.18
## while ALASKA's was slightly worse, at 6.53
While the overall ratio is ~ 25% better for AM WEST, it’s not a sizeable enough difference that I’d use it as a measure for some kind of presentation or important decision. Probably it will be much more interesting to inspect the ratios for the individual destinations. For example, just glancing at the AM WEST delays displayed above, you can see that Phoenix has 415 delays where Seattle and San Diego each have less than 1/6 that number. Whether that’s just a function of the overall numbers of AM WEST flights to the respective destinations, or whether there’s a marked difference in on-time ratios, either way an analysis of the individual ratios will show something important pertaining to the overall stat calculated above.
But after breaking down the data into 4 tidy subsets above, it becomes clear that the one ratio of interest here, on-time vs. delayed, is harder to calculate once we’ve split the data into separate rows for on-time and delayed counts.
Rather than 20 rows of Carrier | Status | Destination | Count, it seems like what we’re after is 10 rows of Carrier | Destination | Ratio | TotalCount, where the TotalCount is the sum of the 2 counts that will get condensed into 1 new row, for each Carrier/Destination combo, and the Ratio will be calculated from those same 2 counts. This is actually just as tidy, considering our purposes, i.e. what we consider “one observation”. For us, an observation isn’t the status of one flight, it’s the on-time/delayed ratio of one airline going to one destination. The reason we need to keep the TotalCount for each of those observations is so that we can put each ratio into meaningful perspective.
# pivot back out the Status column and calculate the needed vars
(new_flts <- flts %>%
pivot_wider(names_from = Status, values_from = Count) %>%
transmute(Carrier, Destination, Ratio = `on time` / delayed,
TotalFlts = `on time` + delayed))
## # A tibble: 10 x 4
## Carrier Destination Ratio TotalFlts
## <chr> <chr> <dbl> <int>
## 1 ALASKA Los Angeles 8.02 559
## 2 ALASKA Phoenix 18.4 233
## 3 ALASKA San Diego 10.6 232
## 4 ALASKA San Francisco 4.93 605
## 5 ALASKA Seattle 6.04 2146
## 6 AM WEST Los Angeles 5.93 811
## 7 AM WEST Phoenix 11.7 5255
## 8 AM WEST San Diego 5.89 448
## 9 AM WEST San Francisco 2.48 449
## 10 AM WEST Seattle 3.30 262
The fact that this new version of our data contains a ratio column makes it seem untidy, but armed with that ratio and with the total number of flights per carrier/destination combo, it actually encodes all the info from the original csv, in the most compact form possible.
ggplot(new_flts, aes(x=Destination, y=Ratio, fill=Carrier,
width = TotalFlts/3333)) +
geom_bar(stat = 'identity') +
theme(panel.grid.major = element_blank(),
panel.background = element_blank()) +
ylab('On-Time / Late Ratio') +
theme(plot.title = element_text(hjust = 0.6)) +
ggtitle('Bar Width Proportional to Number of Flights')

since ALASKA has a better ratio into every destination, yet we saw that AM WEST wins the overall ratio. The widths of the bars above show us very clearly how this happens, since they’re proportional to the total number of flights per carrier/destination combo. In short, Phoenix is the easiest airport to fly into, and AM WEST sends the vast majority of their flights there. ALASKA, meanwhile focuses on Seattle, where it’s apparently much harder to land on-time.
So which carrier would you choose? All other things being equal, you’d want to fly ALASKA, although clearly all other things aren’t equal, since AM WEST is going to offer you a lot more flight times into Phoenix, and with 12 times more flights arriving on-time than late there, you have a pretty good chance of succeeding.
=========================================================================================