Source files: https://github.com/djlofland/DATA607_F2019/tree/master/Assignment5

Assignment

Airline Data

Airline Data

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  3. Perform analysis to compare the arrival delays for the two airlines.

  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository and the URL for your rpubs.com web page.

Project

Setup R

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Loading required package: DBI

Create Database

I created a MySQL database and loaded the table data from above. Here is the DB structure behind the scenes:

airline data

airline data

SQL source file: airline_data.sql

Load Data

We first want to check if the data has already been extracted from our DB and cached as a local .csv file. If the csv file is available, use that. If it’s not, then we will load the data from our DB and cache as a local csv. This allows me to develop the RMarkdown locally, where I have access to the DB, then publish and allow others to run the code without access the the DB.

Note: I created several .env variables (MYSQL_USER, MYSQL_PASSWORD, and MYSQL_HOST) in my .Renviron file in my home folder. This prevents leaking sensitive info in code.

## [1] "Using locally cached airline_data.csv"
## [1] "Data loaded."

Verify Airline Data

id airline status city_los_angeles city_phoenix city_san_diego city_san_francisco city_seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 AM WEST delayed 117 415 65 129 61

Tidy & Transform

airline city delayed on time
ALASKA city_los_angeles 62 497
ALASKA city_phoenix 12 221
ALASKA city_san_diego 20 212
ALASKA city_san_francisco 102 503
ALASKA city_seattle 305 1841
AM WEST city_los_angeles 117 694
AM WEST city_phoenix 415 4840
AM WEST city_san_diego 65 383
AM WEST city_san_francisco 129 320
AM WEST city_seattle 61 201

Add Additional Features

Add summary columns for delay_rate and ontime_rate for each Airline/City

airline city delayed on time delay_rate ontime_rate flights
ALASKA city_los_angeles 62 497 0.11091234 0.8890877 559
ALASKA city_phoenix 12 221 0.05150215 0.9484979 233
ALASKA city_san_diego 20 212 0.08620690 0.9137931 232
ALASKA city_san_francisco 102 503 0.16859504 0.8314050 605
ALASKA city_seattle 305 1841 0.14212488 0.8578751 2146
AM WEST city_los_angeles 117 694 0.14426634 0.8557337 811
AM WEST city_phoenix 415 4840 0.07897241 0.9210276 5255
AM WEST city_san_diego 65 383 0.14508929 0.8549107 448
AM WEST city_san_francisco 129 320 0.28730512 0.7126949 449
AM WEST city_seattle 61 201 0.23282443 0.7671756 262

Analysis

Delays vs Total Flights

I wonder if delays just track with overall traffic, i.e. as # of flights increase, there are more delays? Each point represents a single airpoint and we are plotting the delay rate vs the total number of flights by that airline into each airport.

Conclusions

  • We see that overall AM WEST has significantly higher delays across all cities compared with ALASKA airlines.
  • There is a clear difference in delay rate by city ranging from ~8% up to ~30%.
  • There doesn’t seem to be any correleation between the total number of flights by an airline into a specific airport and the percent of delayed planes they experience.