Week 6 Hwk.

#Initialization
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(RCurl)

## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete

#Create a data frame. I did not do the optional MySql Database. Just created a standard CSV file. 
airlineURL <- getURL("https://raw.githubusercontent.com/mfarris9505/Airline_Delays/master/AirlineDelays.csv")

airline <- read.csv(text=airlineURL)

airline

##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska On_Time         497     221       212           503    1841
## 2  Alaska Delayed          62      12        20           102     305
## 3  AMWest On_Time         694    4840       383           320     201
## 4  AMWest Delayed         117     415        65           129      61

The date frame created shows an almost identical one to file on the homework page. A couple key differences include, a header for the 2 initial columns, and Alaska and AM West were added to the 2th and 4th column. Honestly, this was done for simplicity sake.

Now, we see that the data set is understandable, however, it is not best suited for R manipulation. First, we want our values in a more readable format. The best way to read this in would be given as follows:

Tidy_Airline <- airline %>%
  gather("City", "N", 3:7)  %>% 
  spread("Status","N")

Breaking these two steps down, we can see the transition we made, and why: First the gather functions was employed:

airline %>%
  gather("City", "N", 3:7)

##    Airline  Status          City    N
## 1   Alaska On_Time   Los.Angeles  497
## 2   Alaska Delayed   Los.Angeles   62
## 3   AMWest On_Time   Los.Angeles  694
## 4   AMWest Delayed   Los.Angeles  117
## 5   Alaska On_Time       Phoenix  221
## 6   Alaska Delayed       Phoenix   12
## 7   AMWest On_Time       Phoenix 4840
## 8   AMWest Delayed       Phoenix  415
## 9   Alaska On_Time     San.Diego  212
## 10  Alaska Delayed     San.Diego   20
## 11  AMWest On_Time     San.Diego  383
## 12  AMWest Delayed     San.Diego   65
## 13  Alaska On_Time San.Francisco  503
## 14  Alaska Delayed San.Francisco  102
## 15  AMWest On_Time San.Francisco  320
## 16  AMWest Delayed San.Francisco  129
## 17  Alaska On_Time       Seattle 1841
## 18  Alaska Delayed       Seattle  305
## 19  AMWest On_Time       Seattle  201
## 20  AMWest Delayed       Seattle   61

This first step gathered all the values(these would be the observations) for the number of flights and grouped them by their respective flight destinations. This is an example of melting the data set. As the City itself was one of the variables of the dataset, creating a new column called Cities was necessary. As you can see, however, it created a excessive number of rows that were repetitive, which was both difficult to read, and equaly difficult to process in R. This is where the spread function came into play. We could now extract the two variables, on-time and Delayed, and create separate columns for each of them. The final manipulated data frame can be seen as such:

Tidy_Airline

##    Airline          City Delayed On_Time
## 1   Alaska   Los.Angeles      62     497
## 2   Alaska       Phoenix      12     221
## 3   Alaska     San.Diego      20     212
## 4   Alaska San.Francisco     102     503
## 5   Alaska       Seattle     305    1841
## 6   AMWest   Los.Angeles     117     694
## 7   AMWest       Phoenix     415    4840
## 8   AMWest     San.Diego      65     383
## 9   AMWest San.Francisco     129     320
## 10  AMWest       Seattle      61     201

Now that the data is organized how we want, it is simply to do some analysis. For this project, we want to calculate which Airline has the most delays, and which location (based soley on these two airlines) has the most delays.The first that I think is the most relevant, is calculating the probablilty of a delayed flight. As we know from basic statistics, we can simply compute this as the percentage of delayed flights which is total number of delays, divided by the total number of flights.

Tidy_Airline <- Tidy_Airline %>% mutate(Probability = Delayed/(Delayed + On_Time) )
Tidy_Airline

##    Airline          City Delayed On_Time Probability
## 1   Alaska   Los.Angeles      62     497  0.11091234
## 2   Alaska       Phoenix      12     221  0.05150215
## 3   Alaska     San.Diego      20     212  0.08620690
## 4   Alaska San.Francisco     102     503  0.16859504
## 5   Alaska       Seattle     305    1841  0.14212488
## 6   AMWest   Los.Angeles     117     694  0.14426634
## 7   AMWest       Phoenix     415    4840  0.07897241
## 8   AMWest     San.Diego      65     383  0.14508929
## 9   AMWest San.Francisco     129     320  0.28730512
## 10  AMWest       Seattle      61     201  0.23282443

We can now do some analysis, the first of which, is to compare the total airline delay data and calculate a mean value. This could be done as follows:

Tidy_Airline %>% group_by(Airline) %>% 
  summarise(Mean = mean(Probability))

## Source: local data frame [2 x 2]
## 
##   Airline      Mean
##    (fctr)     (dbl)
## 1  Alaska 0.1118683
## 2  AMWest 0.1776915

From this summary we can see that Alaska airlines are less likely to be delayed to these 5 cities.

We could also find the probability of having a delayed flight by city (assuming these are the only two airlines flying to a particular destination). This can be seen as follows:

Tidy_Airline %>% group_by(City) %>% 
  summarise(Mean = mean(Probability))

## Source: local data frame [5 x 2]
## 
##            City       Mean
##          (fctr)      (dbl)
## 1   Los.Angeles 0.12758934
## 2       Phoenix 0.06523728
## 3     San.Diego 0.11564809
## 4 San.Francisco 0.22795008
## 5       Seattle 0.18747466

As we can see from this data, the cities with the most delays are Seattle and San Francisco, while Phoenix is relatively low by comparison. Some possible conclusion that could be drawn from this are that most delays are weather related. A brief Google search reveals that Phoenix has the lowest cloudy day of any other city, while both Seattle and San Francisco top 100 days or more of fog/clouds per year. It should be noted, however, that Seattle has nearly double that of San Francisco, but does not have a higher rate of delay. This is indicative of a possible correlation between cloudy days and delays, but not a causation.

Week 6 Hwk.

Matthew Farris

October 4, 2015