The objective of this assignment is to practice tidying and transforming data using R studio or MySQL.
To perform the task various libraries are needed for data transformation and visualization.
library(tidyr)
library(dplyr)
library(ggplot2)
A csv file of the data must be created and imported into R for analytical processing.
flight_timedelay= data.frame(
airlines = c('alaska', 'alaska','am_West','am_West'),
status = c('on_time', 'delayed', 'on_time','delayed'),
losAngeles = c(497,62,694,117),
phoenix = c(221,12,4840,415),
sanDiego = c(212,20,383,65),
sanFrancisco = c(503,102,320,129),
seattle = c(1841,305,201,61)
)
write.table(flight_timedelay, file = "flight_timedelay.csv", sep = ",", col.names=T)
airlines_tb <-data.frame(read.csv("https://raw.githubusercontent.com/aliharb/R-Programming/master/flight_timedelay.csv"))
airlines_tb
## airlines status losAngeles phoenix sanDiego sanFrancisco seattle
## 1 alaska on_time 497 221 212 503 1841
## 2 alaska delayed 62 12 20 102 305
## 3 am_West on_time 694 4840 383 320 201
## 4 am_West delayed 117 415 65 129 61
str(airlines_tb)
## 'data.frame': 4 obs. of 7 variables:
## $ airlines : Factor w/ 2 levels "alaska","am_West": 1 1 2 2
## $ status : Factor w/ 2 levels "delayed","on_time": 2 1 2 1
## $ losAngeles : int 497 62 694 117
## $ phoenix : int 221 12 4840 415
## $ sanDiego : int 212 20 383 65
## $ sanFrancisco: int 503 102 320 129
## $ seattle : int 1841 305 201 61
Based on the data provided, analysis will be performed to find the following:
delays_ontime<-filter(airlines_tb,status=="on_time")
ontimes_tb<-gather(delays_ontime,'airports','flights',3:7)
cols_delay<-filter(airlines_tb,status=="delayed")
delay_tb<-gather(cols_delay,'airports','flights',3:7)
tb<-bind_rows(ontimes_tb,delay_tb)
tb
## airlines status airports flights
## 1 alaska on_time losAngeles 497
## 2 am_West on_time losAngeles 694
## 3 alaska on_time phoenix 221
## 4 am_West on_time phoenix 4840
## 5 alaska on_time sanDiego 212
## 6 am_West on_time sanDiego 383
## 7 alaska on_time sanFrancisco 503
## 8 am_West on_time sanFrancisco 320
## 9 alaska on_time seattle 1841
## 10 am_West on_time seattle 201
## 11 alaska delayed losAngeles 62
## 12 am_West delayed losAngeles 117
## 13 alaska delayed phoenix 12
## 14 am_West delayed phoenix 415
## 15 alaska delayed sanDiego 20
## 16 am_West delayed sanDiego 65
## 17 alaska delayed sanFrancisco 102
## 18 am_West delayed sanFrancisco 129
## 19 alaska delayed seattle 305
## 20 am_West delayed seattle 61
Plot on time arrivals for both airlines
ggplot(ontimes_tb,aes(x=airports, y=flights)) +
geom_bar(stat="identity",fill="green", colour="black") +
facet_wrap(~ airlines) +
labs(title="Comparison of Arrival On Time", x="Airports", y="On time")
Plot delays arrivals for both airlines
ggplot(delay_tb,aes(x=airports, y=flights)) +
geom_bar(stat="identity",fill="red", colour="black") +
facet_wrap(~ airlines) +
labs(title="Comparison of Arrival Delayed", x="airports", y="Delayed")
From the above graphs we can determine the following:
Even though the graphs highlight important information about the data set but they didn’t show any evidence about airlines and airport performance
## airlines airports delayed on_time total_flights delay_proportional delay_percentile
## 1 alaska losAngeles 62 497 559 0.11091234 11.091234
## 2 alaska phoenix 12 221 233 0.05150215 5.150215
## 3 alaska sanDiego 20 212 232 0.08620690 8.620690
## 4 alaska sanFrancisco 102 503 605 0.16859504 16.859504
## 5 alaska seattle 305 1841 2146 0.14212488 14.212488
## 6 am_West losAngeles 117 694 811 0.14426634 14.426634
## 7 am_West phoenix 415 4840 5255 0.07897241 7.897241
## 8 am_West sanDiego 65 383 448 0.14508929 14.508929
## 9 am_West sanFrancisco 129 320 449 0.28730512 28.730512
## 10 am_West seattle 61 201 262 0.23282443 23.282443
ggplot(airport_performance,aes(x=airports, y=airport_proportional)) +
geom_bar(stat="identity",fill="blue", colour="black") +
labs(title="Comparison of Airpot performance", x="airports", y="percentile")
The graph indicates that, Phoenix airport has the best performance with the lowest delay less than 8% delays.
airlines_status<-tb%>%group_by(airlines, status) %>%
summarise(flights = sum(flights))
airlines_performance<-spread(airlines_status,status,flights)
airlines_performance<-mutate(airlines_performance,Proportion = delayed / (on_time+delayed))
airlines_performance<-mutate(airlines_performance,percentile = Proportion*100)
airlines_performance
## Source: local data frame [2 x 5]
## Groups: airlines [2]
##
## airlines delayed on_time Proportion percentile
## <fctr> <int> <int> <dbl> <dbl>
## 1 alaska 501 3274 0.1327152 13.27152
## 2 am_West 787 6438 0.1089273 10.89273
ggplot(airlines_performance,aes(x=airlines, y=percentile)) +
geom_bar(stat="identity",fill="yellow", colour="black") +
labs(title="Comparison of Airlines Delayed", x="airlines", y="percentile")
Based on the graph, the am west airline has a better performance with less than 11% delays.