DATA 607 Assignment 5

Source files: https://github.com/djlofland/DATA607_F2019/tree/master/Assignment5

Assignment

Airline Data

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository and the URL for your rpubs.com web page.

Project

Setup R

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(RMySQL)      # Older MysQL library (note: I could NOT get the RMariaDB lib to work)

## Loading required package: DBI

library(formattable) # Pretty print tables

Create Database

I created a MySQL database and loaded the table data from above. Here is the DB structure behind the scenes:

airline data

SQL source file: airline_data.sql

Load Data

We first want to check if the data has already been extracted from our DB and cached as a local .csv file. If the csv file is available, use that. If it’s not, then we will load the data from our DB and cache as a local csv. This allows me to develop the RMarkdown locally, where I have access to the DB, then publish and allow others to run the code without access the the DB.

Note: I created several .env variables (MYSQL_USER, MYSQL_PASSWORD, and MYSQL_HOST) in my .Renviron file in my home folder. This prevents leaking sensitive info in code.

# cached local airline csv data file
csv_fn <- "airline_data.csv"

# Some variables to help with flow control
foundCSV <- FALSE
dbError <- FALSE

# Check if we have a local copy of the data available to load
if(file.exists(csv_fn)) {
  msg <- paste('Using locally cached', csv_fn)
  foundCSV <- TRUE
} else {
  msg <- paste('Cached copy of', csv_fn, 'not found.  Try loading from DB ...')
}

print(msg)

## [1] "Using locally cached airline_data.csv"

# The local csv wasn't found - try loading the data from MySQL and saving cached copy for future runs.
if (!foundCSV) {

  # Connect to our MySQL DB
  con <- dbConnect(RMySQL::MySQL(), user=Sys.getenv("MYSQL_USER"), password=Sys.getenv("MYSQL_PASSWORD"), host=Sys.getenv("MYSQL_HOST"))

  # Attempt to connect to DATA607 DB
  result = tryCatch({
      res <- dbSendQuery(con, 'USE DATA607;')
  }, error = function(e) {
      dbError <- TRUE
  })

  # If we succcessfully connected to the DB (no errors), then attempt to query data
  if (!dbError) {
    # Load the airline data into a local DF
    sql <- "SELECT * FROM airline_data ORDER BY id;"
    res <- dbSendQuery(con, sql)

    airline_data_df <- dbFetch(res)
    
    # Save the data to a local file
    write.csv(airline_data_df,csv_fn, row.names=FALSE )
  }

# CSV file found - load into a DF    
} else {
  airline_data_df <- read.csv(csv_fn, header=TRUE)
}

# Did we encounter any errors?
if (dbError & !foundCSV) {
  msg <- "We encountered an error and couldn't load the airline data."
} else {
  msg <- "Data loaded."
}

print(msg)

## [1] "Data loaded."

Verify Airline Data

airline_data <- as_tibble(airline_data_df)
formattable(airline_data)

id	airline	status	city_los_angeles	city_phoenix	city_san_diego	city_san_francisco	city_seattle
1	ALASKA	on time	497	221	212	503	1841
2	ALASKA	delayed	62	12	20	102	305
3	AM WEST	on time	694	4840	383	320	201
4	AM WEST	delayed	117	415	65	129	61

Tidy & Transform

# Steps:
# 1. use gather() to convert from wide to long and turn city columns in to a variable
# 2. use select() to remove the unecessary id column brought in from the DB
# 3. use spread() to widen status into columns (delayed and `on time`)
airline_data2 <- airline_data %>% 
  gather(city, count, -id, -airline, -status) %>% 
  select(airline, status, city, count) %>% 
  spread(status, count)

formattable(airline_data2)

airline	city	delayed	on time
ALASKA	city_los_angeles	62	497
ALASKA	city_phoenix	12	221
ALASKA	city_san_diego	20	212
ALASKA	city_san_francisco	102	503
ALASKA	city_seattle	305	1841
AM WEST	city_los_angeles	117	694
AM WEST	city_phoenix	415	4840
AM WEST	city_san_diego	65	383
AM WEST	city_san_francisco	129	320
AM WEST	city_seattle	61	201

Add Additional Features

Add summary columns for delay_rate and ontime_rate for each Airline/City

airline_data3 <- airline_data2 %>%
  mutate(delay_rate = delayed / (delayed + `on time`),
         ontime_rate = `on time` / (delayed + `on time`),
         flights = delayed + `on time`)
formattable(airline_data3)

airline	city	delayed	on time	delay_rate	ontime_rate	flights
ALASKA	city_los_angeles	62	497	0.11091234	0.8890877	559
ALASKA	city_phoenix	12	221	0.05150215	0.9484979	233
ALASKA	city_san_diego	20	212	0.08620690	0.9137931	232
ALASKA	city_san_francisco	102	503	0.16859504	0.8314050	605
ALASKA	city_seattle	305	1841	0.14212488	0.8578751	2146
AM WEST	city_los_angeles	117	694	0.14426634	0.8557337	811
AM WEST	city_phoenix	415	4840	0.07897241	0.9210276	5255
AM WEST	city_san_diego	65	383	0.14508929	0.8549107	448
AM WEST	city_san_francisco	129	320	0.28730512	0.7126949	449
AM WEST	city_seattle	61	201	0.23282443	0.7671756	262

Analysis

Delays by Airline (Overall)

I’m curious overall how the airlines performed.

# Group the data by airline
airline_data4 <- airline_data3 %>%
  select(airline, delay_rate, ontime_rate) %>%
  group_by(airline) 

# Calculate the mean of the rates over cities
tbl <- airline_data4 %>% 
  summarize(delay = mean(delay_rate), ontime=mean(ontime_rate)) %>%
  arrange(desc(delay))

# Print the table
formattable(tbl)

airline	delay	ontime
AM WEST	0.1776915	0.8223085
ALASKA	0.1118683	0.8881317

# Barplot
ggplot(airline_data4, aes(x=reorder(airline, -delay_rate), y=delay_rate, fill=airline)) +
  ggtitle("Delay Rate by Airline (All Cities)") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_bar(position='dodge', stat="identity") + 
  xlab('Airline') + 
  ylab('Percent of Flights Delayed')

Delays by City (Overall)

Let’s see if there are differences in flight delays by city.

# Group the data by airline
airline_data4 <- airline_data3 %>%
  select(airline, city, delay_rate, ontime_rate) %>%
  group_by(city) 

# Calculate the mean of the rates over airlines
tbl <- airline_data4 %>% 
  summarize(delay = mean(delay_rate), ontime=mean(ontime_rate)) %>%
  arrange(desc(delay))

# Print the table
formattable(tbl)

city	delay	ontime
city_san_francisco	0.22795008	0.7720499
city_seattle	0.18747466	0.8125253
city_los_angeles	0.12758934	0.8724107
city_san_diego	0.11564809	0.8843519
city_phoenix	0.06523728	0.9347627

# barplot
ggplot(airline_data4, aes(x=reorder(city, -delay_rate), y=delay_rate, fill=city)) +
  ggtitle("Delay Rate by City (All Airlines)") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_bar(position='dodge', stat="identity") + 
  xlab('City') + 
  ylab('Percent of Flights Delayed')

Delays by Airline & City

Let’s look side-by-side for each city and see how the airlines performed against each other.

# Paired bar plot
ggplot(airline_data3, aes(fill=airline, y=delay_rate, x=reorder(city, -delay_rate))) +
  ggtitle("Delay Rate by City") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_bar(position='dodge', stat="identity") +
  xlab('City') + 
  ylab('Percent of Flights Delayed')

Delays vs Total Flights

I wonder if delays just track with overall traffic, i.e. as # of flights increase, there are more delays? Each point represents a single airpoint and we are plotting the delay rate vs the total number of flights by that airline into each airport.

ggplot(airline_data3, aes(y=delay_rate, x=flights, color=airline)) +
  ggtitle("Delay vs Total Flights") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_point() +
  xlab('Total Flights') + 
  ylab('Delay Rate')

Conclusions

We see that overall AM WEST has significantly higher delays across all cities compared with ALASKA airlines.
There is a clear difference in delay rate by city ranging from ~8% up to ~30%.
There doesn’t seem to be any correleation between the total number of flights by an airline into a specific airport and the percent of delayed planes they experience.