Libraries Used

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(scales)

Reading in the csv data from GitHub

airline_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Homework4/airlinedata.csv')
# printing out data from raw csv
airline_data
##        X     X.1 Los.Angeles Pheonix San.Diego San.Francisco Seattle
## 1 ALASKA on time         497     221       212           503    1841
## 2        delayed          62      12        20           102     305
## 3                         NA      NA        NA            NA      NA
## 4 AMWEST on time         694    4840       383           320     201
## 5        delayed         117     415        65           129      61

Looking at delay percentages

Initially, I thought it would be helpful to take a look at percentages of delays for each location.

You can see from the data wrangling below, I first dropped the blank row, renamed some of the column headers, added “Alaska” and “AMWest” for the rows in the first column to make wrangling easier in the future, and then did a series of gather and spread functions in order to manipulate the data into a form conducive for calculating percentages. Then, I chained the percentage calculation and selected only the relevant columns for the following visualization. You can see the output of the data frame I created below, as ‘delay_prct_df’.


delay_prct_df <- airline_data %>% 
  drop_na() %>%
  rename("Airlines" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airlines = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals)  %>% 
  mutate(totalflights = delayed + `on time`, delay_prct = delayed / totalflights) %>%
  mutate(prct_format = percent(delay_prct)) %>%
  select(Airlines, Location, delay_prct, prct_format)

kable(delay_prct_df, align = rep('c', 4)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)
Airlines Location delay_prct prct_format
Alaska Los_Angeles 0.1109123 11.1%
Alaska Pheonix 0.0515021 5.2%
Alaska San_Diego 0.0862069 8.6%
Alaska San_Francisco 0.1685950 16.9%
Alaska Seattle 0.1421249 14.2%
AMWest Los_Angeles 0.1442663 14.4%
AMWest Pheonix 0.0789724 7.9%
AMWest San_Diego 0.1450893 14.5%
AMWest San_Francisco 0.2873051 28.7%
AMWest Seattle 0.2328244 23.3%

Plotting delay percentages by location

Now that the ‘delay_prct_df’ data frame was in a convenient format to plot, I used ggplot to plot these delay percentages by airline and city (location).

plot <- ggplot(delay_prct_df, aes(x = Airlines, y = delay_prct, fill=Airlines))
plot <- plot + scale_y_continuous(labels = scales::percent)
plot <- plot + theme(legend.position = "none")
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd") + ylab("Percent delayed flights") + xlab("Airline")
plot <- plot + facet_grid(. ~ Location)
plot <- plot + geom_text(aes(label=prct_format), vjust=1.5, hjust=0.45, position = position_dodge(width = 0.9), color="white", fontface="bold")
plot

From the above plot, we can see that AMWest Airlines in San Francisco had the highest percentage of delayed flights than any other location in the dataset. Additionally, Alaska Airlines had the lowest percentage of delayed flights in Pheonix. We can also see, that overall, it looks like AMWest had higher percentages of delayed flights at each location relative to Alaska Airlines.


I thought it would be helpful to also display this data as a stacked bar chart by location to get a better understanding of the total delays by this variable, and to confirm the above, that San Francisco had the highest percentage of delayed flights, when looking at the combined delayed flight percentages across both airlines. You can find a plot of this data below in a stacked bar chart.

stacked_bar_df <- airline_data %>% 
  drop_na() %>%
  rename("Airlines" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airlines = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals) %>% 
  mutate(totalflights = delayed + `on time`, delay_prct = delayed / totalflights) %>%
    mutate(prct_format = percent(delay_prct)) %>%
  select(Airlines, Location, delay_prct, prct_format)
library(RColorBrewer)

# Small multiple
ggplot(stacked_bar_df, aes(fill=Airlines, y=delay_prct, x=Location)) + xlab("Location") + ylab("Percent of flights that were delayed") +
    geom_bar(position="stack", stat="identity", color="#dddddd") +
    scale_fill_brewer(palette = "YlGnBu") +
    geom_text(aes(label=prct_format), vjust=2.5, hjust=0.5, position = position_stack(), color="black", fontface="bold")

This is indeed confirmed. San Francisco had the highest percentage of delayed flights out of all other locations in the dataset, and Pheonix had the lowest percentage of delayed flights relative to all other locations in the dataset.


Average number of delays experience in each location

Next, I thought it would be interesting to take a look at the average number of delays experienced in each location. For this, I did the same data wrangling practices as above, until I decided to group the data by location and then summarise with mean calculations by location. See below:

num_delays_loc_df <- airline_data %>% 
  drop_na() %>%
  rename("Airlines" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airlines = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals) %>% 
  group_by(Location) %>% 
  summarise(`Mean flights delayed` = mean(delayed))

kable(num_delays_loc_df, align = rep('c', 2)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)
Location Mean flights delayed
Los_Angeles 89.5
Pheonix 213.5
San_Diego 42.5
San_Francisco 115.5
Seattle 183.0
g <- ggplot(num_delays_loc_df, aes(x = reorder(Location, `Mean flights delayed`), y = `Mean flights delayed`, label=`Mean flights delayed`)) + 
  geom_point(stat="identity", colour="#111111", fill="#00BFC4", stroke=2, size=2) + 
  geom_segment(aes(x=Location, xend=Location, y=0, yend=`Mean flights delayed`), color="#111111") +
  geom_text(color="#111111", hjust=-0.5, size=4, fontface="bold") + ylim(0, 250) +
  xlab("Location") +
  coord_flip()

g


From this plot above, we can see that when taking the average delays across both airlines by location, Pheonix had the highest average number of flights that were delayed relative to the other cities in the dataset. But, wait a minute! How can Pheonix have the highest average number of flights that were delayed relative to all other cities but also have the lowest percentage of delayed flights out of all other cities in the dataset? Well, as a reminder, the percentage of delayed flights is take from the total number of flights. See below for some data wrangling:

total_flights_df <- airline_data %>% 
  drop_na() %>%
  rename("Airlines" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airlines = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals)  %>% 
  mutate(totalflights = delayed + `on time`) %>% 
  group_by(Location) %>% 
  summarise(`Total Flights` = sum(totalflights)) %>% 
  arrange(-`Total Flights`)

kable(total_flights_df, align = rep('c', 2)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)
Location Total Flights
Pheonix 5488
Seattle 2408
Los_Angeles 1370
San_Francisco 1054
San_Diego 680

As you can see in the above table, since Pheonix had the highest total number of flights, it makes sense that the mean number of delayed flights turned out to be highest at this location and that the percentage of delayed flights out of the total number of flights is the lowest.


Summary statistics for both airlines

We can also take a look at other delay summary statistics for Alaska Airlines and AMWest Airlines. To do this, we can manipulate the data frame similar to above, however instead of grouping by location, we group by Airline and then summarise this data to create columns that show the number of total delays, total flights, the mean number of delays, the location with the least delays the location with the most delays, and the overall percentage of delays, by airline.

airline_stats_df <- airline_data %>% 
  drop_na() %>%
  rename("Airline" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airline = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals) %>% 
  group_by(Airline) %>% 
  summarise(`Total Delays (All Locations)` = sum(delayed), 
            `Total Flights (All Locations)` = sum(delayed + `on time`),
            `Average Number of Flights Delayed` = mean(delayed), 
            `Minimum Flights Delayed in a Location` = min(delayed), 
            `Maximum Flights Delayed in a Location` = max(delayed),
            `Total Percent of Flights Delayed` = percent((sum(delayed) / sum(delayed + `on time`))))

kable(airline_stats_df, align = rep('c', 7)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)
Airline Total Delays (All Locations) Total Flights (All Locations) Average Number of Flights Delayed Minimum Flights Delayed in a Location Maximum Flights Delayed in a Location Total Percent of Flights Delayed
Alaska 501 3775 100.2 12 305 13.3%
AMWest 787 7225 157.4 61 415 10.9%

Interestingly, we can see that on average, AMWest Airlines had a higher number of mean flights that were delayed than Alaska Airlines, however the percentage of delayed flights out of total flights was higher for Alaska Airlines (~13%) relative to AMWest Airlines (~11%).