library(stringr)
library(dplyr)
library(tidyr)
library(zoo)
library(ggplot2)
library(knitr)
We will be working with a toy dataset of airlines. The csv file can be retrieved directly from Github. We can see that the data is in a “wide” format.
airline_raw <- read.csv("https://raw.githubusercontent.com/mehtablocker/cuny_607/master/airlines_HW_5.csv", stringsAsFactors = F)
airline_raw %>% kable()
| X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| delayed | 62 | 12 | 20 | 102 | 305 | |
| NA | NA | NA | NA | NA | ||
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| delayed | 117 | 415 | 65 | 129 | 61 |
Let’s start by cleaning up the column names and getting rid of any NA values. We can “copy down” the missing airline names by using the na.locf() function in the zoo package.
airline_df <- airline_raw %>% rename(airline=1, status=2) %>% na.omit()
airline_df <- airline_df %>% mutate(airline=ifelse(airline=="", NA, airline))
airline_df$airline <- na.locf(airline_df$airline)
airline_df %>% kable()
| airline | status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
We use the gather() function to go from wide to narrow. Then we use regex to clean up the city names.
airline_df <- airline_df %>% gather(key = city, value = n, Los.Angeles:Seattle)
airline_df <- airline_df %>% mutate(city=gsub("\\.", " ", city))
airline_df %>% head() %>% kable()
| airline | status | city | n |
|---|---|---|---|
| ALASKA | on time | Los Angeles | 497 |
| ALASKA | delayed | Los Angeles | 62 |
| AM WEST | on time | Los Angeles | 694 |
| AM WEST | delayed | Los Angeles | 117 |
| ALASKA | on time | Phoenix | 221 |
| ALASKA | delayed | Phoenix | 12 |
That’s much better. Now we are ready to do some analysis.
How many total flights did each airline provide?
airline_df %>% group_by(airline) %>% summarise(total_flights=sum(n)) %>% kable()
| airline | total_flights |
|---|---|
| ALASKA | 3775 |
| AM WEST | 7225 |
We can see America West had almost twice has many flights.
What percentage of flights were delayed on each airline?
To answer that, we can spread out the status column to make our analysis easier.
airline_df_spread <- airline_df %>%
mutate(status=ifelse(status=="on time", "n_on_time", "n_delayed")) %>%
spread(status, n)
airline_df_spread %>% group_by(airline) %>%
summarise(total_delayed=sum(n_delayed), total_on_time=sum(n_on_time)) %>% ungroup() %>%
mutate(delay_pct = round(100*total_delayed/(total_delayed+total_on_time),1)) %>% kable()
| airline | total_delayed | total_on_time | delay_pct |
|---|---|---|---|
| ALASKA | 501 | 3274 | 13.3 |
| AM WEST | 787 | 6438 | 10.9 |
Despite providing twice as many flights, America West had a lower delay percentage than Alaska.
Which airline / city combinations had the highest delay percentages?
airline_df_spread <- airline_df_spread %>%
mutate(n = n_delayed + n_on_time, delay_pct = round(100*n_delayed/n,1))
airline_df_spread %>% arrange(desc(delay_pct)) %>% kable()
| airline | city | n_delayed | n_on_time | n | delay_pct |
|---|---|---|---|---|---|
| AM WEST | San Francisco | 129 | 320 | 449 | 28.7 |
| AM WEST | Seattle | 61 | 201 | 262 | 23.3 |
| ALASKA | San Francisco | 102 | 503 | 605 | 16.9 |
| AM WEST | San Diego | 65 | 383 | 448 | 14.5 |
| AM WEST | Los Angeles | 117 | 694 | 811 | 14.4 |
| ALASKA | Seattle | 305 | 1841 | 2146 | 14.2 |
| ALASKA | Los Angeles | 62 | 497 | 559 | 11.1 |
| ALASKA | San Diego | 20 | 212 | 232 | 8.6 |
| AM WEST | Phoenix | 415 | 4840 | 5255 | 7.9 |
| ALASKA | Phoenix | 12 | 221 | 233 | 5.2 |
We can see that the two data points with the highest delay percentage belong to America West despite AW having an overall lower delay percentage than Alaska.
A further inspection of the latter table, particularly the sample sizes, reveals that America West’s low overall delay percentage is largely influenced by its Phoenix data point. This is even more obvious if we plot sample size against delay percentage for America West.
airline_df_spread %>% filter(airline=="AM WEST") %>% ggplot(aes(x=n, y=delay_pct))+geom_point()
If we compare each airline by city, we see that Alaska actually outperforms America West in every city:
airline_df_spread %>% ggplot(aes(x=city, y=delay_pct, fill=airline)) + geom_bar(stat="identity", position=position_dodge())
Are the observed differences statistically significant?
Finally, we explore the important question about the role of chance as it pertains to the empirical data.
Comparing the two airlines, we see America West’s San Francisco delay percentage of 28.7% is much worse than Alaska’s San Francisco delay percentage of 16.9%. But in sample sizes of 449 and 605, respectively, is this difference statistically significant? In other words, if we assumed that both airlines actually had the same true San Francisco delay percentage, how often would we observe a difference this large (or greater) simply by chance?
We can calculate the p-values of differences between two proportions for each city. To do so, we’ll first make a dataframe joining the two airlines by city.
airlines_joined <- airline_df_spread %>% filter(airline=="ALASKA") %>%
left_join(airline_df_spread %>% filter(airline=="AM WEST"), by="city")
airlines_joined %>% kable()
| airline.x | city | n_delayed.x | n_on_time.x | n.x | delay_pct.x | airline.y | n_delayed.y | n_on_time.y | n.y | delay_pct.y |
|---|---|---|---|---|---|---|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 | 559 | 11.1 | AM WEST | 117 | 694 | 811 | 14.4 |
| ALASKA | Phoenix | 12 | 221 | 233 | 5.2 | AM WEST | 415 | 4840 | 5255 | 7.9 |
| ALASKA | San Diego | 20 | 212 | 232 | 8.6 | AM WEST | 65 | 383 | 448 | 14.5 |
| ALASKA | San Francisco | 102 | 503 | 605 | 16.9 | AM WEST | 129 | 320 | 449 | 28.7 |
| ALASKA | Seattle | 305 | 1841 | 2146 | 14.2 | AM WEST | 61 | 201 | 262 | 23.3 |
Next we’ll calculate the necessary statistics (weighted average delay percentage, observed difference, and combined standard error) to obtain the p-value for each row.
airlines_joined %>%
mutate(weighted_pct = round(100*(n_delayed.x+n_delayed.y)/(n.x+n.y),1),
standard_error_pct = round(sqrt(weighted_pct*(100-weighted_pct)/n.x+weighted_pct*(100-weighted_pct)/n.y),2),
observed_diff = abs(delay_pct.x-delay_pct.y),
p_value_obs_diff = round(pnorm(-1*observed_diff, 0, standard_error_pct)*2, 2)) %>%
select(city, airline.x, n.x, delay_pct.x, airline.y, n.y, delay_pct.y, observed_diff, p_value_obs_diff) %>%
kable()
| city | airline.x | n.x | delay_pct.x | airline.y | n.y | delay_pct.y | observed_diff | p_value_obs_diff |
|---|---|---|---|---|---|---|---|---|
| Los Angeles | ALASKA | 559 | 11.1 | AM WEST | 811 | 14.4 | 3.3 | 0.07 |
| Phoenix | ALASKA | 233 | 5.2 | AM WEST | 5255 | 7.9 | 2.7 | 0.13 |
| San Diego | ALASKA | 232 | 8.6 | AM WEST | 448 | 14.5 | 5.9 | 0.03 |
| San Francisco | ALASKA | 605 | 16.9 | AM WEST | 449 | 28.7 | 11.8 | 0.00 |
| Seattle | ALASKA | 2146 | 14.2 | AM WEST | 262 | 23.3 | 9.1 | 0.00 |
According to the p-values, the probabilities of observing the differences by chance alone range from 0 to 13%. (Note, this methodology takes a frequentist approach, i.e., we assume no prior information.)
We took a raw, “wide” dataset and cleaned it up to make it easy to use for analysis.
We than ran a few analyses and learned that America West had a lower overall delay percentage than Alaska despite having twice as many flights in this sample dataset. However, the primary reason for that is America West’s outstanding delay percentage in Phoenix over a very large number of flights.
Alaska outperformed America West in every city. And after calculating statistical significance values, the overall numbers do not appear to be heavily influenced by chance.