Introduction

The objective of this assignment is to practice tidying and transforming data using R studio or MySQL.

Software Requirement

To perform the task various libraries are needed for data transformation and visualization.

library(tidyr)
library(dplyr)
library(ggplot2)

Loading Data

A csv file of the data must be created and imported into R for analytical processing.

flight_timedelay= data.frame( 
  airlines = c('alaska', 'alaska','am_West','am_West'),
  status = c('on_time', 'delayed', 'on_time','delayed'),
  losAngeles = c(497,62,694,117),
  phoenix = c(221,12,4840,415),
  sanDiego = c(212,20,383,65),
  sanFrancisco = c(503,102,320,129),
  seattle = c(1841,305,201,61)
)

write.table(flight_timedelay, file = "flight_timedelay.csv", sep = ",", col.names=T)

airlines_tb <-data.frame(read.csv("https://raw.githubusercontent.com/aliharb/R-Programming/master/flight_timedelay.csv"))

airlines_tb
##   airlines  status losAngeles phoenix sanDiego sanFrancisco seattle
## 1   alaska on_time        497     221      212          503    1841
## 2   alaska delayed         62      12       20          102     305
## 3  am_West on_time        694    4840      383          320     201
## 4  am_West delayed        117     415       65          129      61
str(airlines_tb)
## 'data.frame':    4 obs. of  7 variables:
##  $ airlines    : Factor w/ 2 levels "alaska","am_West": 1 1 2 2
##  $ status      : Factor w/ 2 levels "delayed","on_time": 2 1 2 1
##  $ losAngeles  : int  497 62 694 117
##  $ phoenix     : int  221 12 4840 415
##  $ sanDiego    : int  212 20 383 65
##  $ sanFrancisco: int  503 102 320 129
##  $ seattle     : int  1841 305 201 61

Data Analysis

Based on the data provided, analysis will be performed to find the following:

  1. transform the data to group based on delays and ontime arrival.
delays_ontime<-filter(airlines_tb,status=="on_time")
ontimes_tb<-gather(delays_ontime,'airports','flights',3:7)

cols_delay<-filter(airlines_tb,status=="delayed")
delay_tb<-gather(cols_delay,'airports','flights',3:7)

tb<-bind_rows(ontimes_tb,delay_tb)
tb
##    airlines  status     airports flights
## 1    alaska on_time   losAngeles     497
## 2   am_West on_time   losAngeles     694
## 3    alaska on_time      phoenix     221
## 4   am_West on_time      phoenix    4840
## 5    alaska on_time     sanDiego     212
## 6   am_West on_time     sanDiego     383
## 7    alaska on_time sanFrancisco     503
## 8   am_West on_time sanFrancisco     320
## 9    alaska on_time      seattle    1841
## 10  am_West on_time      seattle     201
## 11   alaska delayed   losAngeles      62
## 12  am_West delayed   losAngeles     117
## 13   alaska delayed      phoenix      12
## 14  am_West delayed      phoenix     415
## 15   alaska delayed     sanDiego      20
## 16  am_West delayed     sanDiego      65
## 17   alaska delayed sanFrancisco     102
## 18  am_West delayed sanFrancisco     129
## 19   alaska delayed      seattle     305
## 20  am_West delayed      seattle      61

Plot on time arrivals for both airlines

ggplot(ontimes_tb,aes(x=airports, y=flights)) +
  geom_bar(stat="identity",fill="green", colour="black") +
  facet_wrap(~ airlines) +
  labs(title="Comparison of Arrival On Time", x="Airports", y="On time")

Plot delays arrivals for both airlines

ggplot(delay_tb,aes(x=airports, y=flights)) +
  geom_bar(stat="identity",fill="red", colour="black") +
  facet_wrap(~ airlines) +
  labs(title="Comparison of Arrival Delayed", x="airports", y="Delayed")

From the above graphs we can determine the following:

Even though the graphs highlight important information about the data set but they didn’t show any evidence about airlines and airport performance

  1. at first, the proportionality and percentile of ever evry flights arrivals will compared.
##    airlines     airports delayed on_time total_flights delay_proportional delay_percentile
## 1    alaska   losAngeles      62     497           559         0.11091234        11.091234
## 2    alaska      phoenix      12     221           233         0.05150215         5.150215
## 3    alaska     sanDiego      20     212           232         0.08620690         8.620690
## 4    alaska sanFrancisco     102     503           605         0.16859504        16.859504
## 5    alaska      seattle     305    1841          2146         0.14212488        14.212488
## 6   am_West   losAngeles     117     694           811         0.14426634        14.426634
## 7   am_West      phoenix     415    4840          5255         0.07897241         7.897241
## 8   am_West     sanDiego      65     383           448         0.14508929        14.508929
## 9   am_West sanFrancisco     129     320           449         0.28730512        28.730512
## 10  am_West      seattle      61     201           262         0.23282443        23.282443
ggplot(airport_performance,aes(x=airports, y=airport_proportional)) +
  geom_bar(stat="identity",fill="blue", colour="black") +
  labs(title="Comparison of Airpot performance", x="airports", y="percentile")

The graph indicates that, Phoenix airport has the best performance with the lowest delay less than 8% delays.

airlines_status<-tb%>%group_by(airlines, status) %>% 
  summarise(flights = sum(flights))
airlines_performance<-spread(airlines_status,status,flights)
airlines_performance<-mutate(airlines_performance,Proportion = delayed / (on_time+delayed)) 
airlines_performance<-mutate(airlines_performance,percentile = Proportion*100) 
airlines_performance
## Source: local data frame [2 x 5]
## Groups: airlines [2]
## 
##   airlines delayed on_time Proportion percentile
##     <fctr>   <int>   <int>      <dbl>      <dbl>
## 1   alaska     501    3274  0.1327152   13.27152
## 2  am_West     787    6438  0.1089273   10.89273
ggplot(airlines_performance,aes(x=airlines, y=percentile)) +
  geom_bar(stat="identity",fill="yellow", colour="black") +
  labs(title="Comparison of Airlines Delayed", x="airlines", y="percentile")

Based on the graph, the am west airline has a better performance with less than 11% delays.