Tidying and Transforming Data

The chart above describes arrival delays for two airlines across five destinations.

Your task is to:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: ??? The URL to the .Rmd file in your GitHub repository. and ??? The URL for your rpubs.com web page.

1. MySQL Database

I created a MySQL database data607_wk5asg and entered the provided information into a table called delays. I used a dynamic query to format the data in a wide structure, similar to a pivot table in M.S. Excel.

library(RMySQL)
library(DBI)

localuserpassword <- "root"
con <- dbConnect(MySQL(), 
                 user='root', password=localuserpassword, 
                 dbname='data607_wk5asg', 
                 host='localhost')

dbReadTable(con, "delays")

##    id airline    stat          city    n
## 1   1  ALASKA on time   Los Angeles  497
## 2   1  ALASKA on time       Phoenix  221
## 3   1  ALASKA on time     San Diego  212
## 4   1  ALASKA on time San Francisco  503
## 5   1  ALASKA on time       Seattle 1841
## 6   1    <NA> delayed   Los Angeles   62
## 7   1    <NA> delayed       Phoenix   12
## 8   1    <NA> delayed     San Diego   20
## 9   1    <NA> delayed San Francisco  102
## 10  1    <NA> delayed       Seattle  305
## 11  2 AM WEST on time   Los Angeles  694
## 12  2 AM WEST on time       Phoenix 4840
## 13  2 AM WEST on time     San Diego  383
## 14  2 AM WEST on time San Francisco  320
## 15  2 AM WEST on time       Seattle  201
## 16  2    <NA> delayed   Los Angeles  117
## 17  2    <NA> delayed       Phoenix  415
## 18  2    <NA> delayed     San Diego   65
## 19  2    <NA> delayed San Francisco  129
## 20  2    <NA> delayed       Seattle   61

dbDisconnect(con)

## [1] TRUE

2. Import `.CSV` file into R & use `tidyr` and `dplyr` to tidy and transform data.

library(stringr)

delayscsv <- read.csv(paste0("/Users/15082/Documents/CUNY_MSDS/DATA607/delayquery.csv"), stringsAsFactors = F) 
delayscsv$stat <- str_replace_all(delayscsv$stat, "on time", "on_time")  
head(delayscsv)

##   airline    stat Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on_time         497     221       212           503    1841
## 2    NULL delayed          62      12        20           102     305
## 3 AM WEST on_time         694    4840       383           320     201
## 4    NULL delayed         117     415        65           129      61

library(tidyr)
library(dplyr)

delays <- delayscsv %>%  
  mutate(airline=c("ALASKA", "ALASKA", "AM WEST", "AM WEST")) %>% 
  gather(city, n, 3:7) %>%
  arrange(airline, desc(stat), city, n)

knitr::kable(delays, caption = 'Airline Delays Dataset', format = "html") %>%
  kable_styling(bootstrap_options = "condensed", full_width = F, position = "left") %>%
  row_spec(row = 0:0, background = "#D4E0F7") %>%
  column_spec(column = 1, bold = T)

Airline Delays Dataset
airline	stat	city	n
ALASKA	on_time	Los.Angeles	497
ALASKA	on_time	Phoenix	221
ALASKA	on_time	San.Diego	212
ALASKA	on_time	San.Francisco	503
ALASKA	on_time	Seattle	1841
ALASKA	delayed	Los.Angeles	62
ALASKA	delayed	Phoenix	12
ALASKA	delayed	San.Diego	20
ALASKA	delayed	San.Francisco	102
ALASKA	delayed	Seattle	305
AM WEST	on_time	Los.Angeles	694
AM WEST	on_time	Phoenix	4840
AM WEST	on_time	San.Diego	383
AM WEST	on_time	San.Francisco	320
AM WEST	on_time	Seattle	201
AM WEST	delayed	Los.Angeles	117
AM WEST	delayed	Phoenix	415
AM WEST	delayed	San.Diego	65
AM WEST	delayed	San.Francisco	129
AM WEST	delayed	Seattle	61

To avoid future syntax errors, I first used the stringr package to replace the space in value “on time” with a “_" character. Using the dplyr package, I then applied the mutate and c() function to overwrite the airline null values with their proper names. Last, I used the gather function from tidyr to combind the multiple city columns into one field.

3. Analysis of arrival delays for the two airlines.

status <- delays %>%
  group_by(airline) %>% 
  group_by(airline, stat) %>% 
  mutate(id = row_number()) %>% 
  spread(stat, n) %>% 
  select(-id, -city) %>% 
  summarise_all(funs(sum)) %>% 
  group_by(airline) %>% 
  mutate(total = sum(delayed) + sum(on_time)) %>% 
  summarise(delay_ratio = delayed / total, on_time_ratio = on_time / total, total_flights = total)

citystatus <- delays %>% 
  unite(city_airline, city, airline, sep = "_") %>% 
  arrange(desc(city_airline)) %>% 
  group_by(city_airline) %>% 
  group_by(city_airline, stat) %>% 
  mutate(id = row_number()) %>% 
  spread(stat, n) %>% 
  select(-id) %>% 
  summarise_all(funs(sum)) %>% 
  group_by(city_airline) %>% 
  mutate(total = sum(delayed) + sum(on_time)) %>% 
  summarise(delay_ratio = delayed / total, on_time_ratio = on_time / total, total_flights = total)

I again used the dplyr and tidyr packages to calculate the percentages of delayed and on time arrivals for each airline.
- The group_by function was used to collapse repeated data within the identified columns.
- The mutate function was used to append new columns of data to allow for further synthesis.
- The spread function was used to separate the status of airline flights.
- The select function was used in conjunction with the - indicator to remove columns from the selection.
- The summarise function was used to calculate the ratio of delays and on_time arrivals into a single row of values for each element.

Airline Status
airline	delay_ratio	on_time_ratio	total_flights
ALASKA	0.1327152	0.8672848	3775
AM WEST	0.1089273	0.8910727	7225

Airline Status by City
city_airline	delay_ratio	on_time_ratio	total_flights
Los.Angeles_ALASKA	0.1109123	0.8890877	559
Los.Angeles_AM WEST	0.1442663	0.8557337	811
Phoenix_ALASKA	0.0515021	0.9484979	233
Phoenix_AM WEST	0.0789724	0.9210276	5255
San.Diego_ALASKA	0.0862069	0.9137931	232
San.Diego_AM WEST	0.1450893	0.8549107	448
San.Francisco_ALASKA	0.1685950	0.8314050	605
San.Francisco_AM WEST	0.2873051	0.7126949	449
Seattle_ALASKA	0.1421249	0.8578751	2146
Seattle_AM WEST	0.2328244	0.7671756	262

4. Conclusion.

In total, Alaska Airlines flights were found to arrive on-time 86.72%, compared to AM West, which arrived on-time 89.10%. However, if you look at the ratio’s in the citystatus table, you will see that AM West experiences higher delays at each city.

Tidying and Transforming Data

Juliann McEachern

September 30, 2018

1. MySQL Database

2. Import `.CSV` file into R & use `tidyr` and `dplyr` to tidy and transform data.

3. Analysis of arrival delays for the two airlines.

4. Conclusion.

Tidying and Transforming Data

Juliann McEachern

September 30, 2018

1. MySQL Database

2. Import .CSV file into R & use tidyr and dplyr to tidy and transform data.

3. Analysis of arrival delays for the two airlines.

4. Conclusion.

2. Import `.CSV` file into R & use `tidyr` and `dplyr` to tidy and transform data.