library(dplyr)
library(tidyr)
library(knitr)
library(stringr)
library(ggplot2)
flights <- read.csv("https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/flightinfo.csv",
sep = ",", header = TRUE)
| X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
# Rename fields
flights <- rename(flights, airline = X, delay = X.1)
# Tidy up data, so that every variable is in a column and every observation is in a row
flights <- flights %>%
gather(city, freq, 3:length(flights)) %>% # Transfer 'city' variable into a column
spread(delay, freq) # Split 'delayed'/'on time' variables into 2 columns
# Adjust city names (get rid of a period)
flights$city <- str_replace_all(flights$city, "\\.", " ")
# Remove space in the 'on time' variable name
flights <- rename(flights, ontime = `on time`)
At this point the initial data is tidy. Each variable - airline, city, delayed, ontime - is in its own column and each observation is in its own row.
| airline | city | delayed | ontime |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San Diego | 20 | 212 |
| ALASKA | San Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4840 |
| AM WEST | San Diego | 65 | 383 |
| AM WEST | San Francisco | 129 | 320 |
| AM WEST | Seattle | 61 | 201 |
Let us start with getting the ratio of delayed flights for all airlines and cities. We’ll sort the data by delay starting with the worst airline/city combo.
flights <- flights %>%
mutate(flights = delayed + ontime, delayed_ratio = delayed / flights) %>%
arrange(desc(delayed_ratio))
| airline | city | delayed | ontime | flights | delayed_ratio |
|---|---|---|---|---|---|
| AM WEST | San Francisco | 129 | 320 | 449 | 0.2873051 |
| AM WEST | Seattle | 61 | 201 | 262 | 0.2328244 |
| ALASKA | San Francisco | 102 | 503 | 605 | 0.1685950 |
| AM WEST | San Diego | 65 | 383 | 448 | 0.1450893 |
| AM WEST | Los Angeles | 117 | 694 | 811 | 0.1442663 |
| ALASKA | Seattle | 305 | 1841 | 2146 | 0.1421249 |
| ALASKA | Los Angeles | 62 | 497 | 559 | 0.1109123 |
| ALASKA | San Diego | 20 | 212 | 232 | 0.0862069 |
| AM WEST | Phoenix | 415 | 4840 | 5255 | 0.0789724 |
| ALASKA | Phoenix | 12 | 221 | 233 | 0.0515021 |
Let us look compare how delays stack up against each other using various criteria.
# Average delay across all cities and airlines
flights %>% summarise(mean(delayed_ratio))
## mean(delayed_ratio)
## 1 0.1447799
delay_city <- flights %>%
group_by(city) %>%
summarise(avg_delay = mean(delayed_ratio)) %>%
arrange(desc(avg_delay))
| City | Avg Delay |
|---|---|
| San Francisco | 0.2279501 |
| Seattle | 0.1874747 |
| Los Angeles | 0.1275893 |
| San Diego | 0.1156481 |
| Phoenix | 0.0652373 |
delay_airline <- flights %>%
group_by(airline) %>%
summarise(avg_delay = mean(delayed_ratio)) %>%
arrange(desc(avg_delay))
| Airline | Avg Delay |
|---|---|
| AM WEST | 0.1776915 |
| ALASKA | 0.1118683 |
Map delays per city and per airline.
ggplot(flights, aes(x = city, y = delayed_ratio, group = airline, color = airline, shape = airline)) +
geom_point(aes(shape = airline)) +
geom_line() +
labs(x = "City", y = "Delay") +
theme(legend.title=element_blank())
It appear that Alaska Airlines enjoys better performances comparing to America West Airlines for all cities we have the data for. In my opinion this is generally in line with conventional knowledge of Alaska Airlines being the better of the two airlines. Even though the Alaska Airlines seems to have less delays that America West Airlines, the pattern of delays is similar between various cities with San Francisco experiencing the most delayed flights. This indicates that delays are most likely not only due to airline performance, but also due to airport performance.