Goal: To compare the flight delay across five destinations for Alaska and AM West Airlines. Initiated analysis by loading the csv file to GITHUB for global access. Then reading the file and converting it into a Dataframe for easy manupulation

library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
## Warning: package 'stringr' was built under R version 3.2.2
flight_del <- read.csv("https://raw.githubusercontent.com/ksanju0/IS607/master/flightdet.csv",header=TRUE,sep=",",fill = TRUE)
hflights_df<-tbl_df(flight_del)

Found that 2 of the columns doesn’t have any names, so assigned column names using rename() functions

rename <- function(dat, oldnames, newnames) {
  datnames <- colnames(dat)
 datnames[which(datnames %in% oldnames)] <- newnames
 colnames(dat) <- datnames
 dat
}

hflights_df<- rename(hflights_df,c("X","X.1"), c("flightName","status"))

Now, raw dataset needs to be transformed to Wide structure so that we can fetch and manipulate the data easily using tidyr and dplyr. So initiated data cleaning and tranformation using gather() which widens the destinaton and numbers related with delayed and On time flights. After gather(), applied spread() to convert On time and delayed as columns. Then, gt erroe due to “on Time” column name which has gap in between. So I again used the rename function to change the column to “onTime”. This was used then to calculate the percentDelayed as new column dataset using mutate()

hflights_df<-gather(hflights_df,"Destinations","n",3:7)
hflights_df<-spread(hflights_df,"status","n")
hflights_df<- rename(hflights_df,c("on time"),c("onTime"))
hflights_df<-mutate(hflights_df, percentDelayed=(delayed/(onTime+delayed))*100)
hflights_df
## Source: local data frame [10 x 5]
## 
##    flightName  Destinations delayed onTime percentDelayed
##        (fctr)        (fctr)   (int)  (int)          (dbl)
## 1      ALASKA   Los.Angeles      62    497      11.091234
## 2      ALASKA       Phoenix      12    221       5.150215
## 3      ALASKA     San.Diego      20    212       8.620690
## 4      ALASKA San.Francisco     102    503      16.859504
## 5      ALASKA       Seattle     305   1841      14.212488
## 6     AM WEST   Los.Angeles     117    694      14.426634
## 7     AM WEST       Phoenix     415   4840       7.897241
## 8     AM WEST     San.Diego      65    383      14.508929
## 9     AM WEST San.Francisco     129    320      28.730512
## 10    AM WEST       Seattle      61    201      23.282443

Now, I have all the required data fields in wide formate to visualize the data for airlines per destination. This will help to compare and assess the delay per destination per airlines. Used ggplot2 to provide a histogram chart to compare the most delayed flights per destination as % delayed

s <- ggplot(data = hflights_df, aes(x = Destinations, y = percentDelayed, fill = flightName))
s + geom_histogram(stat = "identity", position = "dodge") + ggtitle("Flight Delay by Destinations and Airlines") +
xlab("Destinations") + ylab("% Delayed Flights by Destinations")

select(hflights_df,flightName, Destinations, percentDelayed)
## Source: local data frame [10 x 3]
## 
##    flightName  Destinations percentDelayed
##        (fctr)        (fctr)          (dbl)
## 1      ALASKA   Los.Angeles      11.091234
## 2      ALASKA       Phoenix       5.150215
## 3      ALASKA     San.Diego       8.620690
## 4      ALASKA San.Francisco      16.859504
## 5      ALASKA       Seattle      14.212488
## 6     AM WEST   Los.Angeles      14.426634
## 7     AM WEST       Phoenix       7.897241
## 8     AM WEST     San.Diego      14.508929
## 9     AM WEST San.Francisco      28.730512
## 10    AM WEST       Seattle      23.282443

Conclusion: AM West is delayed the most in San francisco and Seattle as 28% and 23%. Flight delays are least in Phoenix for both the airlines.