Source files: https://github.com/djlofland/DATA607_F2019/tree/master/Assignment5
Airline Data
The chart above describes arrival delays for two airlines across five destinations. Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository and the URL for your rpubs.com web page.
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: DBI
I created a MySQL database and loaded the table data from above. Here is the DB structure behind the scenes:
airline data
SQL source file: airline_data.sql
We first want to check if the data has already been extracted from our DB and cached as a local .csv file. If the csv file is available, use that. If it’s not, then we will load the data from our DB and cache as a local csv. This allows me to develop the RMarkdown locally, where I have access to the DB, then publish and allow others to run the code without access the the DB.
Note: I created several .env variables (MYSQL_USER, MYSQL_PASSWORD, and MYSQL_HOST) in my .Renviron file in my home folder. This prevents leaking sensitive info in code.
# cached local airline csv data file
csv_fn <- "airline_data.csv"
# Some variables to help with flow control
foundCSV <- FALSE
dbError <- FALSE
# Check if we have a local copy of the data available to load
if(file.exists(csv_fn)) {
msg <- paste('Using locally cached', csv_fn)
foundCSV <- TRUE
} else {
msg <- paste('Cached copy of', csv_fn, 'not found. Try loading from DB ...')
}
print(msg)## [1] "Using locally cached airline_data.csv"
# The local csv wasn't found - try loading the data from MySQL and saving cached copy for future runs.
if (!foundCSV) {
# Connect to our MySQL DB
con <- dbConnect(RMySQL::MySQL(), user=Sys.getenv("MYSQL_USER"), password=Sys.getenv("MYSQL_PASSWORD"), host=Sys.getenv("MYSQL_HOST"))
# Attempt to connect to DATA607 DB
result = tryCatch({
res <- dbSendQuery(con, 'USE DATA607;')
}, error = function(e) {
dbError <- TRUE
})
# If we succcessfully connected to the DB (no errors), then attempt to query data
if (!dbError) {
# Load the airline data into a local DF
sql <- "SELECT * FROM airline_data ORDER BY id;"
res <- dbSendQuery(con, sql)
airline_data_df <- dbFetch(res)
# Save the data to a local file
write.csv(airline_data_df,csv_fn, row.names=FALSE )
}
# CSV file found - load into a DF
} else {
airline_data_df <- read.csv(csv_fn, header=TRUE)
}
# Did we encounter any errors?
if (dbError & !foundCSV) {
msg <- "We encountered an error and couldn't load the airline data."
} else {
msg <- "Data loaded."
}
print(msg)## [1] "Data loaded."
| id | airline | status | city_los_angeles | city_phoenix | city_san_diego | city_san_francisco | city_seattle |
|---|---|---|---|---|---|---|---|
| 1 | ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| 2 | ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| 3 | AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| 4 | AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
# Steps:
# 1. use gather() to convert from wide to long and turn city columns in to a variable
# 2. use select() to remove the unecessary id column brought in from the DB
# 3. use spread() to widen status into columns (delayed and `on time`)
airline_data2 <- airline_data %>%
gather(city, count, -id, -airline, -status) %>%
select(airline, status, city, count) %>%
spread(status, count)
formattable(airline_data2)| airline | city | delayed | on time |
|---|---|---|---|
| ALASKA | city_los_angeles | 62 | 497 |
| ALASKA | city_phoenix | 12 | 221 |
| ALASKA | city_san_diego | 20 | 212 |
| ALASKA | city_san_francisco | 102 | 503 |
| ALASKA | city_seattle | 305 | 1841 |
| AM WEST | city_los_angeles | 117 | 694 |
| AM WEST | city_phoenix | 415 | 4840 |
| AM WEST | city_san_diego | 65 | 383 |
| AM WEST | city_san_francisco | 129 | 320 |
| AM WEST | city_seattle | 61 | 201 |
Add summary columns for delay_rate and ontime_rate for each Airline/City
airline_data3 <- airline_data2 %>%
mutate(delay_rate = delayed / (delayed + `on time`),
ontime_rate = `on time` / (delayed + `on time`),
flights = delayed + `on time`)
formattable(airline_data3)| airline | city | delayed | on time | delay_rate | ontime_rate | flights |
|---|---|---|---|---|---|---|
| ALASKA | city_los_angeles | 62 | 497 | 0.11091234 | 0.8890877 | 559 |
| ALASKA | city_phoenix | 12 | 221 | 0.05150215 | 0.9484979 | 233 |
| ALASKA | city_san_diego | 20 | 212 | 0.08620690 | 0.9137931 | 232 |
| ALASKA | city_san_francisco | 102 | 503 | 0.16859504 | 0.8314050 | 605 |
| ALASKA | city_seattle | 305 | 1841 | 0.14212488 | 0.8578751 | 2146 |
| AM WEST | city_los_angeles | 117 | 694 | 0.14426634 | 0.8557337 | 811 |
| AM WEST | city_phoenix | 415 | 4840 | 0.07897241 | 0.9210276 | 5255 |
| AM WEST | city_san_diego | 65 | 383 | 0.14508929 | 0.8549107 | 448 |
| AM WEST | city_san_francisco | 129 | 320 | 0.28730512 | 0.7126949 | 449 |
| AM WEST | city_seattle | 61 | 201 | 0.23282443 | 0.7671756 | 262 |
I’m curious overall how the airlines performed.
# Group the data by airline
airline_data4 <- airline_data3 %>%
select(airline, delay_rate, ontime_rate) %>%
group_by(airline)
# Calculate the mean of the rates over cities
tbl <- airline_data4 %>%
summarize(delay = mean(delay_rate), ontime=mean(ontime_rate)) %>%
arrange(desc(delay))
# Print the table
formattable(tbl)| airline | delay | ontime |
|---|---|---|
| AM WEST | 0.1776915 | 0.8223085 |
| ALASKA | 0.1118683 | 0.8881317 |
# Barplot
ggplot(airline_data4, aes(x=reorder(airline, -delay_rate), y=delay_rate, fill=airline)) +
ggtitle("Delay Rate by Airline (All Cities)") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_bar(position='dodge', stat="identity") +
xlab('Airline') +
ylab('Percent of Flights Delayed')Let’s see if there are differences in flight delays by city.
# Group the data by airline
airline_data4 <- airline_data3 %>%
select(airline, city, delay_rate, ontime_rate) %>%
group_by(city)
# Calculate the mean of the rates over airlines
tbl <- airline_data4 %>%
summarize(delay = mean(delay_rate), ontime=mean(ontime_rate)) %>%
arrange(desc(delay))
# Print the table
formattable(tbl)| city | delay | ontime |
|---|---|---|
| city_san_francisco | 0.22795008 | 0.7720499 |
| city_seattle | 0.18747466 | 0.8125253 |
| city_los_angeles | 0.12758934 | 0.8724107 |
| city_san_diego | 0.11564809 | 0.8843519 |
| city_phoenix | 0.06523728 | 0.9347627 |
# barplot
ggplot(airline_data4, aes(x=reorder(city, -delay_rate), y=delay_rate, fill=city)) +
ggtitle("Delay Rate by City (All Airlines)") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_bar(position='dodge', stat="identity") +
xlab('City') +
ylab('Percent of Flights Delayed')Let’s look side-by-side for each city and see how the airlines performed against each other.
# Paired bar plot
ggplot(airline_data3, aes(fill=airline, y=delay_rate, x=reorder(city, -delay_rate))) +
ggtitle("Delay Rate by City") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_bar(position='dodge', stat="identity") +
xlab('City') +
ylab('Percent of Flights Delayed')I wonder if delays just track with overall traffic, i.e. as # of flights increase, there are more delays? Each point represents a single airpoint and we are plotting the delay rate vs the total number of flights by that airline into each airport.
ggplot(airline_data3, aes(y=delay_rate, x=flights, color=airline)) +
ggtitle("Delay vs Total Flights") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point() +
xlab('Total Flights') +
ylab('Delay Rate')