#Initialization
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
##
## The following object is masked from 'package:tidyr':
##
## complete
#Create a data frame. I did not do the optional MySql Database. Just created a standard CSV file.
airlineURL <- getURL("https://raw.githubusercontent.com/mfarris9505/Airline_Delays/master/AirlineDelays.csv")
airline <- read.csv(text=airlineURL)
airline
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On_Time 497 221 212 503 1841
## 2 Alaska Delayed 62 12 20 102 305
## 3 AMWest On_Time 694 4840 383 320 201
## 4 AMWest Delayed 117 415 65 129 61
The date frame created shows an almost identical one to file on the homework page. A couple key differences include, a header for the 2 initial columns, and Alaska and AM West were added to the 2th and 4th column. Honestly, this was done for simplicity sake.
Now, we see that the data set is understandable, however, it is not best suited for R manipulation. First, we want our values in a more readable format. The best way to read this in would be given as follows:
Tidy_Airline <- airline %>%
gather("City", "N", 3:7) %>%
spread("Status","N")
Breaking these two steps down, we can see the transition we made, and why: First the gather functions was employed:
airline %>%
gather("City", "N", 3:7)
## Airline Status City N
## 1 Alaska On_Time Los.Angeles 497
## 2 Alaska Delayed Los.Angeles 62
## 3 AMWest On_Time Los.Angeles 694
## 4 AMWest Delayed Los.Angeles 117
## 5 Alaska On_Time Phoenix 221
## 6 Alaska Delayed Phoenix 12
## 7 AMWest On_Time Phoenix 4840
## 8 AMWest Delayed Phoenix 415
## 9 Alaska On_Time San.Diego 212
## 10 Alaska Delayed San.Diego 20
## 11 AMWest On_Time San.Diego 383
## 12 AMWest Delayed San.Diego 65
## 13 Alaska On_Time San.Francisco 503
## 14 Alaska Delayed San.Francisco 102
## 15 AMWest On_Time San.Francisco 320
## 16 AMWest Delayed San.Francisco 129
## 17 Alaska On_Time Seattle 1841
## 18 Alaska Delayed Seattle 305
## 19 AMWest On_Time Seattle 201
## 20 AMWest Delayed Seattle 61
This first step gathered all the values(these would be the observations) for the number of flights and grouped them by their respective flight destinations. This is an example of melting the data set. As the City itself was one of the variables of the dataset, creating a new column called Cities was necessary. As you can see, however, it created a excessive number of rows that were repetitive, which was both difficult to read, and equaly difficult to process in R. This is where the spread function came into play. We could now extract the two variables, on-time and Delayed, and create separate columns for each of them. The final manipulated data frame can be seen as such:
Tidy_Airline
## Airline City Delayed On_Time
## 1 Alaska Los.Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San.Diego 20 212
## 4 Alaska San.Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AMWest Los.Angeles 117 694
## 7 AMWest Phoenix 415 4840
## 8 AMWest San.Diego 65 383
## 9 AMWest San.Francisco 129 320
## 10 AMWest Seattle 61 201
Now that the data is organized how we want, it is simply to do some analysis. For this project, we want to calculate which Airline has the most delays, and which location (based soley on these two airlines) has the most delays.The first that I think is the most relevant, is calculating the probablilty of a delayed flight. As we know from basic statistics, we can simply compute this as the percentage of delayed flights which is total number of delays, divided by the total number of flights.
Tidy_Airline <- Tidy_Airline %>% mutate(Probability = Delayed/(Delayed + On_Time) )
Tidy_Airline
## Airline City Delayed On_Time Probability
## 1 Alaska Los.Angeles 62 497 0.11091234
## 2 Alaska Phoenix 12 221 0.05150215
## 3 Alaska San.Diego 20 212 0.08620690
## 4 Alaska San.Francisco 102 503 0.16859504
## 5 Alaska Seattle 305 1841 0.14212488
## 6 AMWest Los.Angeles 117 694 0.14426634
## 7 AMWest Phoenix 415 4840 0.07897241
## 8 AMWest San.Diego 65 383 0.14508929
## 9 AMWest San.Francisco 129 320 0.28730512
## 10 AMWest Seattle 61 201 0.23282443
We can now do some analysis, the first of which, is to compare the total airline delay data and calculate a mean value. This could be done as follows:
Tidy_Airline %>% group_by(Airline) %>%
summarise(Mean = mean(Probability))
## Source: local data frame [2 x 2]
##
## Airline Mean
## (fctr) (dbl)
## 1 Alaska 0.1118683
## 2 AMWest 0.1776915
From this summary we can see that Alaska airlines are less likely to be delayed to these 5 cities.
We could also find the probability of having a delayed flight by city (assuming these are the only two airlines flying to a particular destination). This can be seen as follows:
Tidy_Airline %>% group_by(City) %>%
summarise(Mean = mean(Probability))
## Source: local data frame [5 x 2]
##
## City Mean
## (fctr) (dbl)
## 1 Los.Angeles 0.12758934
## 2 Phoenix 0.06523728
## 3 San.Diego 0.11564809
## 4 San.Francisco 0.22795008
## 5 Seattle 0.18747466
As we can see from this data, the cities with the most delays are Seattle and San Francisco, while Phoenix is relatively low by comparison. Some possible conclusion that could be drawn from this are that most delays are weather related. A brief Google search reveals that Phoenix has the lowest cloudy day of any other city, while both Seattle and San Francisco top 100 days or more of fog/clouds per year. It should be noted, however, that Seattle has nearly double that of San Francisco, but does not have a higher rate of delay. This is indicative of a possible correlation between cloudy days and delays, but not a causation.