Data 607 Assignment 5

Ojbectives

Transform and tidy the airline data

Perform an analysis of flight delays

Initialization

library(RMySQL)

## Loading required package: DBI

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(ggplot2)
library(pander)
library(stringr)
library(htmlTable)

Create untidy dataset

query <- 'select * from flights'
con <- dbConnect(MySQL(), user = 'guest_user', password = 'datascience', dbname = 'airports', host = 'localhost')
untidy <- dbGetQuery(con, query)

Begin Cleaning

pander(head(untidy))

airline	status	los_angeles	phoenix	san_diego	san_francisco	seattle
Alaska	on time	497	221	212	503	1841
	delayed	62	12	20	102	305
AM West	on time	694	4840	383	320	201
	delayed	117	415	65	129	61

rows <- nrow(untidy)
untidy$status <-  untidy$status %>% lapply(function (x) str_replace_all(x,' ', '_'))
#We want the statuses to make better column names, so they can be referenced with the "$" operator or used in a relational DB

for (i in seq (2,rows,2)){
  untidy[i,1] <- untidy[i-1,1]
}
#The airline only appears as a header.  All the null columns beneath are assumed to be the same airline.  This is common when data is displayed, but not good for storing data.

Transpose data

We want the dataframe to conform to the tidy format, so the airport and count become variables, while locations become rows. We also split the status column into separate variables, as the count of on time and delayed represent different quantities. Finally, we want to create variables for total flights and ratio delayed for further analysis

tidier <- untidy %>% gather( 'Location', 'count', 3:7) %>% spread('status', 'count')
tidier$Location <-  tidier$Location %>% lapply( function (x) str_to_title(str_replace_all(x,'_', ' '))) %>% unlist()
#this no longer needs to be one word and can be propcased
tidy <- tidier %>% mutate(total_flights = delayed + on_time) %>% mutate (ratio_delayed = delayed/total_flights)

pander(tidy)

airline	Location	on_time	delayed	total_flights	ratio_delayed
Alaska	Los Angeles	497	62	559	0.1109
Alaska	Phoenix	221	12	233	0.0515
Alaska	San Diego	212	20	232	0.08621
Alaska	San Francisco	503	102	605	0.1686
Alaska	Seattle	1841	305	2146	0.1421
AM West	Los Angeles	694	117	811	0.1443
AM West	Phoenix	4840	415	5255	0.07897
AM West	San Diego	383	65	448	0.1451
AM West	San Francisco	320	129	449	0.2873
AM West	Seattle	201	61	262	0.2328

Exploratory analysis

by_airline <- tidy %>% group_by(airline) %>% 
  summarize(total_delayed = sum(delayed), total_flights = sum(total_flights), ratio = sum(delayed)/sum(total_flights))
#Compare the overal ratios

pander(data.frame(by_airline))

airline	total_delayed	total_flights	ratio
Alaska	501	3775	0.1327
AM West	787	7225	0.1089

#pander did not work with the tibble and in my opinion looks better than htmlTable

ggplot(tidy) + geom_bar(mapping = aes(x = Location, y = ratio_delayed, fill = airline), stat = "identity") +
   labs( title = "Delay Ratio by Location", y = "Proportion Delayed")

#compare the ratios by airport

ggplot(tidy) + geom_bar(mapping = aes(x = Location, y = ratio_delayed, fill = airline), stat = "identity", 
      position ="fill") + labs( title = "Relative Delay Ratios", y = "Proportion Delayed")

#the same chart as above, but with the ratio_delayed normalized.  Can be used to see if there is a difference in delay

Statistical tests

We want to prove there is a statistically signficant difference in delay activity between airlines

t.test(tidy[tidy$airline == 'AM West', "ratio_delayed"], tidy[tidy$airline == 'Alaska', "ratio_delayed"], paired = TRUE)

## 
##  Paired t-test
## 
## data:  tidy[tidy$airline == "AM West", "ratio_delayed"] and tidy[tidy$airline == "Alaska", "ratio_delayed"]
## t = 3.8037, df = 4, p-value = 0.01904
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01777701 0.11386950
## sample estimates:
## mean of the differences 
##              0.06582325

by_airline %>% select(2:3) %>% chisq.test()

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 10.456, df = 1, p-value = 0.001223

Both indicate a differance at a .05 level of significance. The chi squared test is has much more statistical power because it differentiates between many observations and few observations. For instance, the T test does not differentiate between 1 delay in 4 flights and 2500 delays in 10,000 flights. This can be seen through the much high p value with this data.