For this assignment, we are asked to create a .csv file(or MySQL database) describing arrival delays for two airlines across five destinations. We are encouraged to use a “wide” structure. We are then to read the file into R and use tidyr and dplyr to tidy and transform the data. Lastly, we should perform analysis to compare the arrival delays for the two airlines.

Step 1: We will load the our required libraries, then load the .csv file that we created using Excel and save as data frame. I’ll then do some work on tidying/reshaping the data.

require(tidyr)
## Loading required package: tidyr
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
flight.chart <- read.csv('https://raw.githubusercontent.com/komotunde/DATA607/master/Homework%203/flightchart.csv', header = TRUE, na.strings = "")
flight.chart <- data.frame(flight.chart)
flight.chart
##         X      X.1 Los.Angeles Phoenix San.Diego San.Fransicso Seattle
## 1  ALASKA  on time         497     221       212           503    1841
## 2    <NA>  delayed          62      12        20           102     305
## 3 AM WEST on time          694    4840       383           320     201
## 4    <NA>  delayed         117     415        65           129      61

My first task is to rename the columns, and fill in the NA’s with the Airline that’s above it.

require(zoo)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
colnames(flight.chart) <- c("Airline", "Status", "LA", "PHO", "San Diego", "San Fran.", "SEA")
flight.chart <- na.locf(flight.chart) #this is the only line that will require zoo.
flight.chart
##   Airline   Status  LA  PHO San Diego San Fran.  SEA
## 1  ALASKA  on time 497  221       212       503 1841
## 2  ALASKA  delayed  62   12        20       102  305
## 3 AM WEST on time  694 4840       383       320  201
## 4 AM WEST  delayed 117  415        65       129   61

Next is to reshape the data. I would like to have all the destinations(column 3-7) in a column and add an AMOUNT column for the number of times the flight was on time and delayed.

flight.chart <- gather(flight.chart, "Destination", "Amount", 3:7 )
colnames(flight.chart) <- c("Airline", "Status", "Destination", "Amount")
flight.chart <-dplyr::arrange(flight.chart, Airline)
flight.chart
##    Airline   Status Destination Amount
## 1   ALASKA  on time          LA    497
## 2   ALASKA  delayed          LA     62
## 3   ALASKA  on time         PHO    221
## 4   ALASKA  delayed         PHO     12
## 5   ALASKA  on time   San Diego    212
## 6   ALASKA  delayed   San Diego     20
## 7   ALASKA  on time   San Fran.    503
## 8   ALASKA  delayed   San Fran.    102
## 9   ALASKA  on time         SEA   1841
## 10  ALASKA  delayed         SEA    305
## 11 AM WEST on time           LA    694
## 12 AM WEST  delayed          LA    117
## 13 AM WEST on time          PHO   4840
## 14 AM WEST  delayed         PHO    415
## 15 AM WEST on time    San Diego    383
## 16 AM WEST  delayed   San Diego     65
## 17 AM WEST on time    San Fran.    320
## 18 AM WEST  delayed   San Fran.    129
## 19 AM WEST on time          SEA    201
## 20 AM WEST  delayed         SEA     61

I want to create an on time and delayed column with appropriate counts in each column.

flight.chart <- tidyr::spread(flight.chart, Status, Amount)
colnames(flight.chart) <- c("Airline", "Destination","Delayed","OnTime", "Unknown")
flight.chart   
##    Airline Destination Delayed OnTime Unknown
## 1   ALASKA          LA      62    497    <NA>
## 2   ALASKA         PHO      12    221    <NA>
## 3   ALASKA   San Diego      20    212    <NA>
## 4   ALASKA   San Fran.     102    503    <NA>
## 5   ALASKA         SEA     305   1841    <NA>
## 6  AM WEST          LA     117   <NA>     694
## 7  AM WEST         PHO     415   <NA>    4840
## 8  AM WEST   San Diego      65   <NA>     383
## 9  AM WEST   San Fran.     129   <NA>     320
## 10 AM WEST         SEA      61   <NA>     201
# I noticed that my result was split up so I did the following to fix it.
flight.chart$OnTime <- c(497, 221, 212, 503, 1841, 694, 4840, 383, 320, 201)
flight.chart
##    Airline Destination Delayed OnTime Unknown
## 1   ALASKA          LA      62    497    <NA>
## 2   ALASKA         PHO      12    221    <NA>
## 3   ALASKA   San Diego      20    212    <NA>
## 4   ALASKA   San Fran.     102    503    <NA>
## 5   ALASKA         SEA     305   1841    <NA>
## 6  AM WEST          LA     117    694     694
## 7  AM WEST         PHO     415   4840    4840
## 8  AM WEST   San Diego      65    383     383
## 9  AM WEST   San Fran.     129    320     320
## 10 AM WEST         SEA      61    201     201
flight.chart$Unknown <- NULL
flight.chart
##    Airline Destination Delayed OnTime
## 1   ALASKA          LA      62    497
## 2   ALASKA         PHO      12    221
## 3   ALASKA   San Diego      20    212
## 4   ALASKA   San Fran.     102    503
## 5   ALASKA         SEA     305   1841
## 6  AM WEST          LA     117    694
## 7  AM WEST         PHO     415   4840
## 8  AM WEST   San Diego      65    383
## 9  AM WEST   San Fran.     129    320
## 10 AM WEST         SEA      61    201

Step 2: We will now perform some analysis on the data. My main goal is to figure out which airline is more on time overall.

I’ll need to ensure that my Delayed and On Time columns are numeric so I can do calculations on them

flight.chart$Delayed <- as.numeric(as.character((flight.chart$Delayed)))
flight.chart$OnTime <- as.numeric(as.character(flight.chart$OnTime))
flight.chart <- tbl_df(flight.chart)  #This was more so to confirm that those two columns were converted to numeric.
flight.chart
## # A tibble: 10 × 4
##    Airline Destination Delayed OnTime
## *    <chr>       <chr>   <dbl>  <dbl>
## 1   ALASKA          LA      62    497
## 2   ALASKA         PHO      12    221
## 3   ALASKA   San Diego      20    212
## 4   ALASKA   San Fran.     102    503
## 5   ALASKA         SEA     305   1841
## 6  AM WEST          LA     117    694
## 7  AM WEST         PHO     415   4840
## 8  AM WEST   San Diego      65    383
## 9  AM WEST   San Fran.     129    320
## 10 AM WEST         SEA      61    201

I wanted to create a column that would show the ratio of delays vs. on time for each trip. Just from skimming the latest output, we see that the Alaska flight to Phoenix had the lowest rate of delays, while the AM West flight to San Fransico had the highest delay rate.

flight.chart <- mutate(flight.chart, Ratio = (Delayed /(OnTime + Delayed)))
flight.chart <- arrange(flight.chart, desc(Ratio))
flight.chart
## # A tibble: 10 × 5
##    Airline Destination Delayed OnTime      Ratio
##      <chr>       <chr>   <dbl>  <dbl>      <dbl>
## 1  AM WEST   San Fran.     129    320 0.28730512
## 2  AM WEST         SEA      61    201 0.23282443
## 3   ALASKA   San Fran.     102    503 0.16859504
## 4  AM WEST   San Diego      65    383 0.14508929
## 5  AM WEST          LA     117    694 0.14426634
## 6   ALASKA         SEA     305   1841 0.14212488
## 7   ALASKA          LA      62    497 0.11091234
## 8   ALASKA   San Diego      20    212 0.08620690
## 9  AM WEST         PHO     415   4840 0.07897241
## 10  ALASKA         PHO      12    221 0.05150215

*In order to get a better view of which airline has more delays, I want to produce a grouped bar graph. I chose the destination to be my x-axis, and for my y-axis, I went with the delay ratio (delays/(delays+ontime)

ggplot(flight.chart,aes(x=Destination, y=Ratio, fill=factor(Airline))) +
  geom_bar(stat="identity",position="dodge") +
  scale_fill_brewer(palette = "Set1", name="Airline", breaks =c("ALASKA", "AM WEST"), labels=c("Alaska", "AM WEST")) +
  xlab("Destination") +
  ylab("Delayed Ratio")  

# from: http://stackoverflow.com/questions/17721126/simplest-way-to-do-grouped-barplot

*From this, we can easily see that overall, AM WEST has more delays than ALASKA. This is shown by every destination having a higher delay rate for AM wEST THAN Alaska.