Load libraries

library(stringr)
library(dplyr)
library(tidyr)
library(zoo)
library(ggplot2)
library(knitr)

Get data

We will be working with a toy dataset of airlines. The csv file can be retrieved directly from Github. We can see that the data is in a “wide” format.

airline_raw <- read.csv("https://raw.githubusercontent.com/mehtablocker/cuny_607/master/airlines_HW_5.csv", stringsAsFactors = F)
airline_raw %>% kable()
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
delayed 117 415 65 129 61

Column headers and NAs

Let’s start by cleaning up the column names and getting rid of any NA values. We can “copy down” the missing airline names by using the na.locf() function in the zoo package.

airline_df <- airline_raw %>% rename(airline=1, status=2) %>% na.omit()
airline_df <- airline_df %>% mutate(airline=ifelse(airline=="", NA, airline))
airline_df$airline <- na.locf(airline_df$airline)
airline_df %>% kable()
airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Reduce width and clean

We use the gather() function to go from wide to narrow. Then we use regex to clean up the city names.

airline_df <- airline_df %>% gather(key = city, value = n, Los.Angeles:Seattle)
airline_df <- airline_df %>% mutate(city=gsub("\\.", " ", city))
airline_df %>% head() %>% kable()
airline status city n
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12

That’s much better. Now we are ready to do some analysis.

Analysis

How many total flights did each airline provide?

airline_df %>% group_by(airline) %>% summarise(total_flights=sum(n)) %>% kable()
airline total_flights
ALASKA 3775
AM WEST 7225

We can see America West had almost twice has many flights.


What percentage of flights were delayed on each airline?

To answer that, we can spread out the status column to make our analysis easier.

airline_df_spread <- airline_df %>% 
  mutate(status=ifelse(status=="on time", "n_on_time", "n_delayed")) %>% 
  spread(status, n)
airline_df_spread %>% group_by(airline) %>% 
  summarise(total_delayed=sum(n_delayed), total_on_time=sum(n_on_time)) %>% ungroup() %>% 
  mutate(delay_pct = round(100*total_delayed/(total_delayed+total_on_time),1)) %>% kable()
airline total_delayed total_on_time delay_pct
ALASKA 501 3274 13.3
AM WEST 787 6438 10.9

Despite providing twice as many flights, America West had a lower delay percentage than Alaska.


Which airline / city combinations had the highest delay percentages?

airline_df_spread <- airline_df_spread %>% 
  mutate(n = n_delayed + n_on_time, delay_pct = round(100*n_delayed/n,1))
airline_df_spread %>% arrange(desc(delay_pct)) %>% kable()
airline city n_delayed n_on_time n delay_pct
AM WEST San Francisco 129 320 449 28.7
AM WEST Seattle 61 201 262 23.3
ALASKA San Francisco 102 503 605 16.9
AM WEST San Diego 65 383 448 14.5
AM WEST Los Angeles 117 694 811 14.4
ALASKA Seattle 305 1841 2146 14.2
ALASKA Los Angeles 62 497 559 11.1
ALASKA San Diego 20 212 232 8.6
AM WEST Phoenix 415 4840 5255 7.9
ALASKA Phoenix 12 221 233 5.2

We can see that the two data points with the highest delay percentage belong to America West despite AW having an overall lower delay percentage than Alaska.

A further inspection of the latter table, particularly the sample sizes, reveals that America West’s low overall delay percentage is largely influenced by its Phoenix data point. This is even more obvious if we plot sample size against delay percentage for America West.

airline_df_spread %>% filter(airline=="AM WEST") %>% ggplot(aes(x=n, y=delay_pct))+geom_point()

If we compare each airline by city, we see that Alaska actually outperforms America West in every city:

airline_df_spread %>% ggplot(aes(x=city, y=delay_pct, fill=airline)) + geom_bar(stat="identity", position=position_dodge())


Are the observed differences statistically significant?

Finally, we explore the important question about the role of chance as it pertains to the empirical data.

Comparing the two airlines, we see America West’s San Francisco delay percentage of 28.7% is much worse than Alaska’s San Francisco delay percentage of 16.9%. But in sample sizes of 449 and 605, respectively, is this difference statistically significant? In other words, if we assumed that both airlines actually had the same true San Francisco delay percentage, how often would we observe a difference this large (or greater) simply by chance?

We can calculate the p-values of differences between two proportions for each city. To do so, we’ll first make a dataframe joining the two airlines by city.

airlines_joined <- airline_df_spread %>% filter(airline=="ALASKA") %>% 
  left_join(airline_df_spread %>% filter(airline=="AM WEST"), by="city")
airlines_joined %>% kable()
airline.x city n_delayed.x n_on_time.x n.x delay_pct.x airline.y n_delayed.y n_on_time.y n.y delay_pct.y
ALASKA Los Angeles 62 497 559 11.1 AM WEST 117 694 811 14.4
ALASKA Phoenix 12 221 233 5.2 AM WEST 415 4840 5255 7.9
ALASKA San Diego 20 212 232 8.6 AM WEST 65 383 448 14.5
ALASKA San Francisco 102 503 605 16.9 AM WEST 129 320 449 28.7
ALASKA Seattle 305 1841 2146 14.2 AM WEST 61 201 262 23.3

Next we’ll calculate the necessary statistics (weighted average delay percentage, observed difference, and combined standard error) to obtain the p-value for each row.

airlines_joined %>% 
  mutate(weighted_pct = round(100*(n_delayed.x+n_delayed.y)/(n.x+n.y),1), 
         standard_error_pct = round(sqrt(weighted_pct*(100-weighted_pct)/n.x+weighted_pct*(100-weighted_pct)/n.y),2), 
         observed_diff = abs(delay_pct.x-delay_pct.y), 
         p_value_obs_diff = round(pnorm(-1*observed_diff, 0, standard_error_pct)*2, 2)) %>% 
  select(city, airline.x, n.x, delay_pct.x, airline.y, n.y, delay_pct.y, observed_diff, p_value_obs_diff) %>% 
  kable()
city airline.x n.x delay_pct.x airline.y n.y delay_pct.y observed_diff p_value_obs_diff
Los Angeles ALASKA 559 11.1 AM WEST 811 14.4 3.3 0.07
Phoenix ALASKA 233 5.2 AM WEST 5255 7.9 2.7 0.13
San Diego ALASKA 232 8.6 AM WEST 448 14.5 5.9 0.03
San Francisco ALASKA 605 16.9 AM WEST 449 28.7 11.8 0.00
Seattle ALASKA 2146 14.2 AM WEST 262 23.3 9.1 0.00

According to the p-values, the probabilities of observing the differences by chance alone range from 0 to 13%. (Note, this methodology takes a frequentist approach, i.e., we assume no prior information.)

Conclusion

We took a raw, “wide” dataset and cleaned it up to make it easy to use for analysis.

We than ran a few analyses and learned that America West had a lower overall delay percentage than Alaska despite having twice as many flights in this sample dataset. However, the primary reason for that is America West’s outstanding delay percentage in Phoenix over a very large number of flights.

Alaska outperformed America West in every city. And after calculating statistical significance values, the overall numbers do not appear to be heavily influenced by chance.