Goal: To compare the flight delay across five destinations for Alaska and AM West Airlines. Initiated analysis by loading the csv file to GITHUB for global access. Then reading the file and converting it into a Dataframe for easy manupulation
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
## Warning: package 'stringr' was built under R version 3.2.2
flight_del <- read.csv("https://raw.githubusercontent.com/ksanju0/IS607/master/flightdet.csv",header=TRUE,sep=",",fill = TRUE)
hflights_df<-tbl_df(flight_del)
Found that 2 of the columns doesn’t have any names, so assigned column names using rename() functions
rename <- function(dat, oldnames, newnames) {
datnames <- colnames(dat)
datnames[which(datnames %in% oldnames)] <- newnames
colnames(dat) <- datnames
dat
}
hflights_df<- rename(hflights_df,c("X","X.1"), c("flightName","status"))
Now, raw dataset needs to be transformed to Wide structure so that we can fetch and manipulate the data easily using tidyr and dplyr. So initiated data cleaning and tranformation using gather() which widens the destinaton and numbers related with delayed and On time flights. After gather(), applied spread() to convert On time and delayed as columns. Then, gt erroe due to “on Time” column name which has gap in between. So I again used the rename function to change the column to “onTime”. This was used then to calculate the percentDelayed as new column dataset using mutate()
hflights_df<-gather(hflights_df,"Destinations","n",3:7)
hflights_df<-spread(hflights_df,"status","n")
hflights_df<- rename(hflights_df,c("on time"),c("onTime"))
hflights_df<-mutate(hflights_df, percentDelayed=(delayed/(onTime+delayed))*100)
hflights_df
## Source: local data frame [10 x 5]
##
## flightName Destinations delayed onTime percentDelayed
## (fctr) (fctr) (int) (int) (dbl)
## 1 ALASKA Los.Angeles 62 497 11.091234
## 2 ALASKA Phoenix 12 221 5.150215
## 3 ALASKA San.Diego 20 212 8.620690
## 4 ALASKA San.Francisco 102 503 16.859504
## 5 ALASKA Seattle 305 1841 14.212488
## 6 AM WEST Los.Angeles 117 694 14.426634
## 7 AM WEST Phoenix 415 4840 7.897241
## 8 AM WEST San.Diego 65 383 14.508929
## 9 AM WEST San.Francisco 129 320 28.730512
## 10 AM WEST Seattle 61 201 23.282443
Now, I have all the required data fields in wide formate to visualize the data for airlines per destination. This will help to compare and assess the delay per destination per airlines. Used ggplot2 to provide a histogram chart to compare the most delayed flights per destination as % delayed
s <- ggplot(data = hflights_df, aes(x = Destinations, y = percentDelayed, fill = flightName))
s + geom_histogram(stat = "identity", position = "dodge") + ggtitle("Flight Delay by Destinations and Airlines") +
xlab("Destinations") + ylab("% Delayed Flights by Destinations")
select(hflights_df,flightName, Destinations, percentDelayed)
## Source: local data frame [10 x 3]
##
## flightName Destinations percentDelayed
## (fctr) (fctr) (dbl)
## 1 ALASKA Los.Angeles 11.091234
## 2 ALASKA Phoenix 5.150215
## 3 ALASKA San.Diego 8.620690
## 4 ALASKA San.Francisco 16.859504
## 5 ALASKA Seattle 14.212488
## 6 AM WEST Los.Angeles 14.426634
## 7 AM WEST Phoenix 7.897241
## 8 AM WEST San.Diego 14.508929
## 9 AM WEST San.Francisco 28.730512
## 10 AM WEST Seattle 23.282443
Conclusion: AM West is delayed the most in San francisco and Seattle as 28% and 23%. Flight delays are least in Phoenix for both the airlines.