DATA 607 - Week 5 Homework

Libraries Used

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(scales)

Reading in the csv data from GitHub

airline_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Homework4/airlinedata.csv')
# printing out data from raw csv
airline_data

##        X     X.1 Los.Angeles Pheonix San.Diego San.Francisco Seattle
## 1 ALASKA on time         497     221       212           503    1841
## 2        delayed          62      12        20           102     305
## 3                         NA      NA        NA            NA      NA
## 4 AMWEST on time         694    4840       383           320     201
## 5        delayed         117     415        65           129      61

Looking at delay percentages

Initially, I thought it would be helpful to take a look at percentages of delays for each location.

You can see from the data wrangling below, I first dropped the blank row, renamed some of the column headers, added “Alaska” and “AMWest” for the rows in the first column to make wrangling easier in the future, and then did a series of gather and spread functions in order to manipulate the data into a form conducive for calculating percentages. Then, I chained the percentage calculation and selected only the relevant columns for the following visualization. You can see the output of the data frame I created below, as ‘delay_prct_df’.

delay_prct_df <- airline_data %>% 
  drop_na() %>%
  rename("Airlines" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airlines = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals)  %>% 
  mutate(totalflights = delayed + `on time`, delay_prct = delayed / totalflights) %>%
  mutate(prct_format = percent(delay_prct)) %>%
  select(Airlines, Location, delay_prct, prct_format)

kable(delay_prct_df, align = rep('c', 4)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)

Airlines	Location	delay_prct	prct_format
Alaska	Los_Angeles	0.1109123	11.1%
Alaska	Pheonix	0.0515021	5.2%
Alaska	San_Diego	0.0862069	8.6%
Alaska	San_Francisco	0.1685950	16.9%
Alaska	Seattle	0.1421249	14.2%
AMWest	Los_Angeles	0.1442663	14.4%
AMWest	Pheonix	0.0789724	7.9%
AMWest	San_Diego	0.1450893	14.5%
AMWest	San_Francisco	0.2873051	28.7%
AMWest	Seattle	0.2328244	23.3%

Plotting delay percentages by location

Now that the ‘delay_prct_df’ data frame was in a convenient format to plot, I used ggplot to plot these delay percentages by airline and city (location).

plot <- ggplot(delay_prct_df, aes(x = Airlines, y = delay_prct, fill=Airlines))
plot <- plot + scale_y_continuous(labels = scales::percent)
plot <- plot + theme(legend.position = "none")
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd") + ylab("Percent delayed flights") + xlab("Airline")
plot <- plot + facet_grid(. ~ Location)
plot <- plot + geom_text(aes(label=prct_format), vjust=1.5, hjust=0.45, position = position_dodge(width = 0.9), color="white", fontface="bold")
plot

From the above plot, we can see that AMWest Airlines in San Francisco had the highest percentage of delayed flights than any other location in the dataset. Additionally, Alaska Airlines had the lowest percentage of delayed flights in Pheonix. We can also see, that overall, it looks like AMWest had higher percentages of delayed flights at each location relative to Alaska Airlines.

I thought it would be helpful to also display this data as a stacked bar chart by location to get a better understanding of the total delays by this variable, and to confirm the above, that San Francisco had the highest percentage of delayed flights, when looking at the combined delayed flight percentages across both airlines. You can find a plot of this data below in a stacked bar chart.

stacked_bar_df <- airline_data %>% 
  drop_na() %>%
  rename("Airlines" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airlines = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals) %>% 
  mutate(totalflights = delayed + `on time`, delay_prct = delayed / totalflights) %>%
    mutate(prct_format = percent(delay_prct)) %>%
  select(Airlines, Location, delay_prct, prct_format)

library(RColorBrewer)

# Small multiple
ggplot(stacked_bar_df, aes(fill=Airlines, y=delay_prct, x=Location)) + xlab("Location") + ylab("Percent of flights that were delayed") +
    geom_bar(position="stack", stat="identity", color="#dddddd") +
    scale_fill_brewer(palette = "YlGnBu") +
    geom_text(aes(label=prct_format), vjust=2.5, hjust=0.5, position = position_stack(), color="black", fontface="bold")

This is indeed confirmed. San Francisco had the highest percentage of delayed flights out of all other locations in the dataset, and Pheonix had the lowest percentage of delayed flights relative to all other locations in the dataset.

Average number of delays experience in each location

Next, I thought it would be interesting to take a look at the average number of delays experienced in each location. For this, I did the same data wrangling practices as above, until I decided to group the data by location and then summarise with mean calculations by location. See below:

num_delays_loc_df <- airline_data %>% 
  drop_na() %>%
  rename("Airlines" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airlines = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals) %>% 
  group_by(Location) %>% 
  summarise(`Mean flights delayed` = mean(delayed))

kable(num_delays_loc_df, align = rep('c', 2)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)

Location	Mean flights delayed
Los_Angeles	89.5
Pheonix	213.5
San_Diego	42.5
San_Francisco	115.5
Seattle	183.0

g <- ggplot(num_delays_loc_df, aes(x = reorder(Location, `Mean flights delayed`), y = `Mean flights delayed`, label=`Mean flights delayed`)) + 
  geom_point(stat="identity", colour="#111111", fill="#00BFC4", stroke=2, size=2) + 
  geom_segment(aes(x=Location, xend=Location, y=0, yend=`Mean flights delayed`), color="#111111") +
  geom_text(color="#111111", hjust=-0.5, size=4, fontface="bold") + ylim(0, 250) +
  xlab("Location") +
  coord_flip()

g

From this plot above, we can see that when taking the average delays across both airlines by location, Pheonix had the highest average number of flights that were delayed relative to the other cities in the dataset. But, wait a minute! How can Pheonix have the highest average number of flights that were delayed relative to all other cities but also have the lowest percentage of delayed flights out of all other cities in the dataset? Well, as a reminder, the percentage of delayed flights is take from the total number of flights. See below for some data wrangling:

total_flights_df <- airline_data %>% 
  drop_na() %>%
  rename("Airlines" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airlines = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals)  %>% 
  mutate(totalflights = delayed + `on time`) %>% 
  group_by(Location) %>% 
  summarise(`Total Flights` = sum(totalflights)) %>% 
  arrange(-`Total Flights`)

kable(total_flights_df, align = rep('c', 2)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)

Location	Total Flights
Pheonix	5488
Seattle	2408
Los_Angeles	1370
San_Francisco	1054
San_Diego	680

As you can see in the above table, since Pheonix had the highest total number of flights, it makes sense that the mean number of delayed flights turned out to be highest at this location and that the percentage of delayed flights out of the total number of flights is the lowest.

Summary statistics for both airlines

We can also take a look at other delay summary statistics for Alaska Airlines and AMWest Airlines. To do this, we can manipulate the data frame similar to above, however instead of grouping by location, we group by Airline and then summarise this data to create columns that show the number of total delays, total flights, the mean number of delays, the location with the least delays the location with the most delays, and the overall percentage of delays, by airline.

airline_stats_df <- airline_data %>% 
  drop_na() %>%
  rename("Airline" = X, "Status" = X.1, "Los_Angeles" = Los.Angeles, "San_Diego" = San.Diego, "San_Francisco" = San.Francisco) %>% 
  mutate(Airline = c("Alaska", "Alaska", "AMWest", "AMWest")) %>% 
  gather("Location", "Arrivals", 3:7) %>% 
  spread(Status, Arrivals) %>% 
  group_by(Airline) %>% 
  summarise(`Total Delays (All Locations)` = sum(delayed), 
            `Total Flights (All Locations)` = sum(delayed + `on time`),
            `Average Number of Flights Delayed` = mean(delayed), 
            `Minimum Flights Delayed in a Location` = min(delayed), 
            `Maximum Flights Delayed in a Location` = max(delayed),
            `Total Percent of Flights Delayed` = percent((sum(delayed) / sum(delayed + `on time`))))

kable(airline_stats_df, align = rep('c', 7)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)

Airline	Total Delays (All Locations)	Total Flights (All Locations)	Average Number of Flights Delayed	Minimum Flights Delayed in a Location	Maximum Flights Delayed in a Location	Total Percent of Flights Delayed
Alaska	501	3775	100.2	12	305	13.3%
AMWest	787	7225	157.4	61	415	10.9%

Interestingly, we can see that on average, AMWest Airlines had a higher number of mean flights that were delayed than Alaska Airlines, however the percentage of delayed flights out of total flights was higher for Alaska Airlines (~13%) relative to AMWest Airlines (~11%).