Required Libraries

library(dplyr)
library(tidyr)
library(knitr)
library(stringr)
library(ggplot2)

Data Import

flights <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/flightinfo.csv", 
                    sep = ",", header = TRUE)
Table 1. Raw Data
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Data Cleanup and Tidying

# Rename fields
flights <- rename(flights, airline = X, delay = X.1)

# Tidy up data, so that every variable is in a column and every observation is in a row
flights <- flights %>% 
  gather(city, freq, 3:length(flights)) %>%   # Transfer 'city' variable into a column
  spread(delay, freq)                         # Split 'delayed'/'on time' variables into 2 columns

# Adjust city names (get rid of a period)
flights$city <- str_replace_all(flights$city, "\\.", " ")

# Remove space in the 'on time' variable name
flights <- rename(flights, ontime = `on time`)

At this point the initial data is tidy. Each variable - airline, city, delayed, ontime - is in its own column and each observation is in its own row.

airline city delayed ontime
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seattle 61 201

Data Analysis

Let us start with getting the ratio of delayed flights for all airlines and cities. We’ll sort the data by delay starting with the worst airline/city combo.

flights <- flights %>% 
  mutate(flights = delayed + ontime, delayed_ratio = delayed / flights) %>% 
  arrange(desc(delayed_ratio))
airline city delayed ontime flights delayed_ratio
AM WEST San Francisco 129 320 449 0.2873051
AM WEST Seattle 61 201 262 0.2328244
ALASKA San Francisco 102 503 605 0.1685950
AM WEST San Diego 65 383 448 0.1450893
AM WEST Los Angeles 117 694 811 0.1442663
ALASKA Seattle 305 1841 2146 0.1421249
ALASKA Los Angeles 62 497 559 0.1109123
ALASKA San Diego 20 212 232 0.0862069
AM WEST Phoenix 415 4840 5255 0.0789724
ALASKA Phoenix 12 221 233 0.0515021

Let us look compare how delays stack up against each other using various criteria.

# Average delay across all cities and airlines
flights %>% summarise(mean(delayed_ratio))
##   mean(delayed_ratio)
## 1           0.1447799
delay_city <- flights %>% 
  group_by(city) %>% 
  summarise(avg_delay = mean(delayed_ratio)) %>% 
  arrange(desc(avg_delay))
Table 4. Delayed Flights by City
City Avg Delay
San Francisco 0.2279501
Seattle 0.1874747
Los Angeles 0.1275893
San Diego 0.1156481
Phoenix 0.0652373
delay_airline <- flights %>% 
  group_by(airline) %>% 
  summarise(avg_delay = mean(delayed_ratio)) %>% 
  arrange(desc(avg_delay))
Table 5. Delayed Flights by Airline
Airline Avg Delay
AM WEST 0.1776915
ALASKA 0.1118683

Map delays per city and per airline.

ggplot(flights, aes(x = city, y = delayed_ratio, group = airline, color = airline, shape = airline)) + 
  geom_point(aes(shape = airline)) + 
  geom_line() + 
  labs(x = "City", y = "Delay") + 
  theme(legend.title=element_blank())

Conclusion

It appear that Alaska Airlines enjoys better performances comparing to America West Airlines for all cities we have the data for. In my opinion this is generally in line with conventional knowledge of Alaska Airlines being the better of the two airlines. Even though the Alaska Airlines seems to have less delays that America West Airlines, the pattern of delays is similar between various cities with San Francisco experiencing the most delayed flights. This indicates that delays are most likely not only due to airline performance, but also due to airport performance.