The chart above describes arrival delays for two airlines across five destinations.

Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
  3. Perform analysis to compare the arrival delays for the two airlines.
  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: ??? The URL to the .Rmd file in your GitHub repository. and ??? The URL for your rpubs.com web page.

1. MySQL Database

I created a MySQL database data607_wk5asg and entered the provided information into a table called delays. I used a dynamic query to format the data in a wide structure, similar to a pivot table in M.S. Excel.

library(RMySQL)
library(DBI)

localuserpassword <- "root"
con <- dbConnect(MySQL(), 
                 user='root', password=localuserpassword, 
                 dbname='data607_wk5asg', 
                 host='localhost')

dbReadTable(con, "delays")
##    id airline    stat          city    n
## 1   1  ALASKA on time   Los Angeles  497
## 2   1  ALASKA on time       Phoenix  221
## 3   1  ALASKA on time     San Diego  212
## 4   1  ALASKA on time San Francisco  503
## 5   1  ALASKA on time       Seattle 1841
## 6   1    <NA> delayed   Los Angeles   62
## 7   1    <NA> delayed       Phoenix   12
## 8   1    <NA> delayed     San Diego   20
## 9   1    <NA> delayed San Francisco  102
## 10  1    <NA> delayed       Seattle  305
## 11  2 AM WEST on time   Los Angeles  694
## 12  2 AM WEST on time       Phoenix 4840
## 13  2 AM WEST on time     San Diego  383
## 14  2 AM WEST on time San Francisco  320
## 15  2 AM WEST on time       Seattle  201
## 16  2    <NA> delayed   Los Angeles  117
## 17  2    <NA> delayed       Phoenix  415
## 18  2    <NA> delayed     San Diego   65
## 19  2    <NA> delayed San Francisco  129
## 20  2    <NA> delayed       Seattle   61
dbDisconnect(con)
## [1] TRUE

2. Import .CSV file into R & use tidyr and dplyr to tidy and transform data.

library(stringr)

delayscsv <- read.csv(paste0("/Users/15082/Documents/CUNY_MSDS/DATA607/delayquery.csv"), stringsAsFactors = F) 
delayscsv$stat <- str_replace_all(delayscsv$stat, "on time", "on_time")  
head(delayscsv)
##   airline    stat Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on_time         497     221       212           503    1841
## 2    NULL delayed          62      12        20           102     305
## 3 AM WEST on_time         694    4840       383           320     201
## 4    NULL delayed         117     415        65           129      61
library(tidyr)
library(dplyr)

delays <- delayscsv %>%  
  mutate(airline=c("ALASKA", "ALASKA", "AM WEST", "AM WEST")) %>% 
  gather(city, n, 3:7) %>%
  arrange(airline, desc(stat), city, n)

knitr::kable(delays, caption = 'Airline Delays Dataset', format = "html") %>%
  kable_styling(bootstrap_options = "condensed", full_width = F, position = "left") %>%
  row_spec(row = 0:0, background = "#D4E0F7") %>%
  column_spec(column = 1, bold = T) 
Airline Delays Dataset
airline stat city n
ALASKA on_time Los.Angeles 497
ALASKA on_time Phoenix 221
ALASKA on_time San.Diego 212
ALASKA on_time San.Francisco 503
ALASKA on_time Seattle 1841
ALASKA delayed Los.Angeles 62
ALASKA delayed Phoenix 12
ALASKA delayed San.Diego 20
ALASKA delayed San.Francisco 102
ALASKA delayed Seattle 305
AM WEST on_time Los.Angeles 694
AM WEST on_time Phoenix 4840
AM WEST on_time San.Diego 383
AM WEST on_time San.Francisco 320
AM WEST on_time Seattle 201
AM WEST delayed Los.Angeles 117
AM WEST delayed Phoenix 415
AM WEST delayed San.Diego 65
AM WEST delayed San.Francisco 129
AM WEST delayed Seattle 61

To avoid future syntax errors, I first used the stringr package to replace the space in value “on time” with a “_" character. Using the dplyr package, I then applied the mutate and c() function to overwrite the airline null values with their proper names. Last, I used the gather function from tidyr to combind the multiple city columns into one field.

3. Analysis of arrival delays for the two airlines.

status <- delays %>%
  group_by(airline) %>% 
  group_by(airline, stat) %>% 
  mutate(id = row_number()) %>% 
  spread(stat, n) %>% 
  select(-id, -city) %>% 
  summarise_all(funs(sum)) %>% 
  group_by(airline) %>% 
  mutate(total = sum(delayed) + sum(on_time)) %>% 
  summarise(delay_ratio = delayed / total, on_time_ratio = on_time / total, total_flights = total)

citystatus <- delays %>% 
  unite(city_airline, city, airline, sep = "_") %>% 
  arrange(desc(city_airline)) %>% 
  group_by(city_airline) %>% 
  group_by(city_airline, stat) %>% 
  mutate(id = row_number()) %>% 
  spread(stat, n) %>% 
  select(-id) %>% 
  summarise_all(funs(sum)) %>% 
  group_by(city_airline) %>% 
  mutate(total = sum(delayed) + sum(on_time)) %>% 
  summarise(delay_ratio = delayed / total, on_time_ratio = on_time / total, total_flights = total)

I again used the dplyr and tidyr packages to calculate the percentages of delayed and on time arrivals for each airline.
- The group_by function was used to collapse repeated data within the identified columns.
- The mutate function was used to append new columns of data to allow for further synthesis.
- The spread function was used to separate the status of airline flights.
- The select function was used in conjunction with the - indicator to remove columns from the selection.
- The summarise function was used to calculate the ratio of delays and on_time arrivals into a single row of values for each element.

Airline Status
airline delay_ratio on_time_ratio total_flights
ALASKA 0.1327152 0.8672848 3775
AM WEST 0.1089273 0.8910727 7225
Airline Status by City
city_airline delay_ratio on_time_ratio total_flights
Los.Angeles_ALASKA 0.1109123 0.8890877 559
Los.Angeles_AM WEST 0.1442663 0.8557337 811
Phoenix_ALASKA 0.0515021 0.9484979 233
Phoenix_AM WEST 0.0789724 0.9210276 5255
San.Diego_ALASKA 0.0862069 0.9137931 232
San.Diego_AM WEST 0.1450893 0.8549107 448
San.Francisco_ALASKA 0.1685950 0.8314050 605
San.Francisco_AM WEST 0.2873051 0.7126949 449
Seattle_ALASKA 0.1421249 0.8578751 2146
Seattle_AM WEST 0.2328244 0.7671756 262

4. Conclusion.

In total, Alaska Airlines flights were found to arrive on-time 86.72%, compared to AM West, which arrived on-time 89.10%. However, if you look at the ratio’s in the citystatus table, you will see that AM West experiences higher delays at each city.