Flight Delays

PART 1: Tidy a csv table

(flts <- read.csv('FlightTimes.csv'))

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Make those blank fields NA’s

(flts <- read.csv('FlightTimes.csv', na.strings = ""))

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3    <NA>    <NA>          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

Remove the NA row and fill the airline names into all rows

(flts <- flts %>% 
   fill('X') %>% 
   drop_na())

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Name first 2 columns and pivot destinations

# Give the first 2 cols names
flts <- rename(flts, Carrier = X, Status = X.1)
# Tidy
flts <- pivot_longer(flts, Los.Angeles:Seattle, names_to = 'Destination',
                      values_to = 'Count')
head(flts)

## # A tibble: 6 x 4
##   Carrier Status  Destination   Count
##   <chr>   <chr>   <chr>         <int>
## 1 ALASKA  on time Los.Angeles     497
## 2 ALASKA  on time Phoenix         221
## 3 ALASKA  on time San.Diego       212
## 4 ALASKA  on time San.Francisco   503
## 5 ALASKA  on time Seattle        1841
## 6 ALASKA  delayed Los.Angeles      62

Remove dots from Destination names

flts$Destination <- sapply(flts$Destination, function(x) {gsub('\\.', ' ', x)})
head(flts)

## # A tibble: 6 x 4
##   Carrier Status  Destination   Count
##   <chr>   <chr>   <chr>         <int>
## 1 ALASKA  on time Los Angeles     497
## 2 ALASKA  on time Phoenix         221
## 3 ALASKA  on time San Diego       212
## 4 ALASKA  on time San Francisco   503
## 5 ALASKA  on time Seattle        1841
## 6 ALASKA  delayed Los Angeles      62

PART 2: Analyze ==> Compare the arrival delays for the two airlines

# First instinct is to group the flights by delays and carriers: 

ontimeAK <- flts %>% filter(Carrier == 'ALASKA' & Status == 'on time')
delayedAK <- flts %>% filter(Carrier == 'ALASKA' & Status == 'delayed')
ontimeAW <- flts %>% filter(Carrier != 'ALASKA' & Status == 'on time')
(delayedAW <- flts %>% filter(Carrier != 'ALASKA' & Status == 'delayed'))

## # A tibble: 5 x 4
##   Carrier Status  Destination   Count
##   <chr>   <chr>   <chr>         <int>
## 1 AM WEST delayed Los Angeles     117
## 2 AM WEST delayed Phoenix         415
## 3 AM WEST delayed San Diego        65
## 4 AM WEST delayed San Francisco   129
## 5 AM WEST delayed Seattle          61

Unless I’m only interested in a particular destination, my first thought when analyzing delays by airline is to see which airline has the highest ratio of on-time flights to delayed flights.

cat("AM WEST's on-time/delayed ratio to these destinations was",
    round(sum(ontimeAW$Count) / sum(delayedAW$Count), 2), 
    "\nwhile ALASKA's was slightly worse, at",
    round(sum(ontimeAK$Count) / sum(delayedAK$Count), 2))

## AM WEST's on-time/delayed ratio to these destinations was 8.18 
## while ALASKA's was slightly worse, at 6.53

While the overall ratio is ~ 25% better for AM WEST, it’s not a sizeable enough difference that I’d use it as a measure for some kind of presentation or important decision. Probably it will be much more interesting to inspect the ratios for the individual destinations. For example, just glancing at the AM WEST delays displayed above, you can see that Phoenix has 415 delays where Seattle and San Diego each have less than 1/6 that number. Whether that’s just a function of the overall numbers of AM WEST flights to the respective destinations, or whether there’s a marked difference in on-time ratios, either way an analysis of the individual ratios will show something important pertaining to the overall stat calculated above.

But after breaking down the data into 4 tidy subsets above, it becomes clear that the one ratio of interest here, on-time vs. delayed, is harder to calculate once we’ve split the data into separate rows for on-time and delayed counts.

Rather than 20 rows of Carrier | Status | Destination | Count, it seems like what we’re after is 10 rows of Carrier | Destination | Ratio | TotalCount, where the TotalCount is the sum of the 2 counts that will get condensed into 1 new row, for each Carrier/Destination combo, and the Ratio will be calculated from those same 2 counts. This is actually just as tidy, considering our purposes, i.e. what we consider “one observation”. For us, an observation isn’t the status of one flight, it’s the on-time/delayed ratio of one airline going to one destination. The reason we need to keep the TotalCount for each of those observations is so that we can put each ratio into meaningful perspective.

# pivot back out the Status column and calculate the needed vars
(new_flts <- flts %>% 
  pivot_wider(names_from = Status, values_from = Count) %>%
  transmute(Carrier, Destination, Ratio = `on time` / delayed, 
            TotalFlts = `on time` + delayed))

## # A tibble: 10 x 4
##    Carrier Destination   Ratio TotalFlts
##    <chr>   <chr>         <dbl>     <int>
##  1 ALASKA  Los Angeles    8.02       559
##  2 ALASKA  Phoenix       18.4        233
##  3 ALASKA  San Diego     10.6        232
##  4 ALASKA  San Francisco  4.93       605
##  5 ALASKA  Seattle        6.04      2146
##  6 AM WEST Los Angeles    5.93       811
##  7 AM WEST Phoenix       11.7       5255
##  8 AM WEST San Diego      5.89       448
##  9 AM WEST San Francisco  2.48       449
## 10 AM WEST Seattle        3.30       262

The fact that this new version of our data contains a ratio column makes it seem untidy, but armed with that ratio and with the total number of flights per carrier/destination combo, it actually encodes all the info from the original csv, in the most compact form possible.

ggplot(new_flts, aes(x=Destination, y=Ratio, fill=Carrier,
                     width = TotalFlts/3333)) + 
  geom_bar(stat = 'identity') + 
  theme(panel.grid.major = element_blank(),
        panel.background = element_blank()) + 
  ylab('On-Time / Late Ratio') +
  theme(plot.title = element_text(hjust = 0.6)) + 
  ggtitle('Bar Width Proportional to Number of Flights')

That’s a really nice example of Simpson’s Paradox

since ALASKA has a better ratio into every destination, yet we saw that AM WEST wins the overall ratio. The widths of the bars above show us very clearly how this happens, since they’re proportional to the total number of flights per carrier/destination combo. In short, Phoenix is the easiest airport to fly into, and AM WEST sends the vast majority of their flights there. ALASKA, meanwhile focuses on Seattle, where it’s apparently much harder to land on-time.

So which carrier would you choose? All other things being equal, you’d want to fly ALASKA, although clearly all other things aren’t equal, since AM WEST is going to offer you a lot more flight times into Phoenix, and with 12 times more flights arriving on-time than late there, you have a pretty good chance of succeeding.

Flight Delays

Ethan Haley

3/6/2021

PART 1: Tidy a csv table

Make those blank fields NA’s

Remove the NA row and fill the airline names into all rows

Name first 2 columns and pivot destinations

Remove dots from Destination names

PART 2: Analyze ==> Compare the arrival delays for the two airlines

Unless I’m only interested in a particular destination, my first thought when analyzing delays by airline is to see which airline has the highest ratio of on-time flights to delayed flights.

But after breaking down the data into 4 tidy subsets above, it becomes clear that the one ratio of interest here, on-time vs. delayed, is harder to calculate once we’ve split the data into separate rows for on-time and delayed counts.

The fact that this new version of our data contains a ratio column makes it seem untidy, but armed with that ratio and with the total number of flights per carrier/destination combo, it actually encodes all the info from the original csv, in the most compact form possible.

That’s a really nice example of Simpson’s Paradox

=========================================================================================