Given chart describes arrival delays for two airlines across five destinations. We have to

(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above

Environment Setup

library(stringr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.3
library(knitr)

Loading Csv file

myurl <- "https://raw.githubusercontent.com/Harpreet1984/DATA607/master/Flight.csv"
flight <- read.csv(myurl, header= TRUE,sep=",",stringsAsFactors=FALSE)
flight
##        X     X.1 LOS.ANGELES Phoenix San.diego San.Francisco Seattle
## 1 ALASKA  ontime         497     221       212           503    1841
## 2        delayed          63      12        20           102     305
## 3                         NA      NA        NA            NA      NA
## 4 Amwest  ontime         694    4840       383           320     201
## 5        delayed         117     415        65           129      61
Renaming Column headers

From the above table we see By default, the read.csv() function named first two columns “X”" and “X1”. We need to rename these with proper labels. Here i converted flight data into tibble

flightdata <- flight
library(tibble)
## Warning: package 'tibble' was built under R version 3.4.3
flightdata <- as_data_frame(flightdata)
flightdata <- flightdata %>% rename(Airline = X, Status = X.1)
flightdata
## # A tibble: 5 x 7
##   Airline Status  LOS.ANGELES Phoenix San.diego San.Francisco Seattle
##   <chr>   <chr>         <int>   <int>     <int>         <int>   <int>
## 1 ALASKA  ontime          497     221       212           503    1841
## 2 ""      delayed          63      12        20           102     305
## 3 ""      ""               NA      NA        NA            NA      NA
## 4 Amwest  ontime          694    4840       383           320     201
## 5 ""      delayed         117     415        65           129      61
(2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

Tidying DATA

Here we are removing the empty rows if any and then replacing missing values in Airline column

flightdata <-  flightdata %>% filter(Status != '')
flightdata
## # A tibble: 4 x 7
##   Airline Status  LOS.ANGELES Phoenix San.diego San.Francisco Seattle
##   <chr>   <chr>         <int>   <int>     <int>         <int>   <int>
## 1 ALASKA  ontime          497     221       212           503    1841
## 2 ""      delayed          63      12        20           102     305
## 3 Amwest  ontime          694    4840       383           320     201
## 4 ""      delayed         117     415        65           129      61
flightdata$Airline <- ifelse(flightdata$Airline == "", lag(flightdata$Airline),flightdata$Airline)
kable(flightdata) 
Airline Status LOS.ANGELES Phoenix San.diego San.Francisco Seattle
ALASKA ontime 497 221 212 503 1841
ALASKA delayed 63 12 20 102 305
Amwest ontime 694 4840 383 320 201
Amwest delayed 117 415 65 129 61

Using gather and spread and mutate function

—-gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.

—-spread() function spreads a key-value pair across multiple columns.

flightdata <- flightdata %>% 
  gather(city, frequency, 3:length(flightdata))
flightdata <- flightdata %>% spread("Status",frequency)
kable(flightdata)
Airline city delayed ontime
ALASKA LOS.ANGELES 63 497
ALASKA Phoenix 12 221
ALASKA San.diego 20 212
ALASKA San.Francisco 102 503
ALASKA Seattle 305 1841
Amwest LOS.ANGELES 117 694
Amwest Phoenix 415 4840
Amwest San.diego 65 383
Amwest San.Francisco 129 320
Amwest Seattle 61 201

—– mutate() function - Mutate adds new variables and preserves existing to add total time column

flightdata <- flightdata %>% mutate(Total.time = ontime + delayed) %>%
              mutate(ontime_percentage = ontime / Total.time) %>%
              mutate(delay_percentage = delayed / Total.time)
kable(flightdata)
Airline city delayed ontime Total.time ontime_percentage delay_percentage
ALASKA LOS.ANGELES 63 497 560 0.8875000 0.1125000
ALASKA Phoenix 12 221 233 0.9484979 0.0515021
ALASKA San.diego 20 212 232 0.9137931 0.0862069
ALASKA San.Francisco 102 503 605 0.8314050 0.1685950
ALASKA Seattle 305 1841 2146 0.8578751 0.1421249
Amwest LOS.ANGELES 117 694 811 0.8557337 0.1442663
Amwest Phoenix 415 4840 5255 0.9210276 0.0789724
Amwest San.diego 65 383 448 0.8549107 0.1450893
Amwest San.Francisco 129 320 449 0.7126949 0.2873051
Amwest Seattle 61 201 262 0.7671756 0.2328244
(3) Perform analysis to compare the arrival delays for the two airlines.

Graphical representation of Data Using ggplots

Here we can perform analysis on Flight arrival status based on flights delayed and flights reached on-time.

library(ggplot2)

myggplot <- ggplot(flightdata, aes(x = city, y = delay_percentage,group = Airline,color = Airline,shape=Airline) ,xlab = "city" , ylab = "Delay") +
  geom_line() + geom_point((aes(shape=Airline)))
myggplot

Conclusion

Based on the above graphs we can say delay rates of ALASKA Airlines is less compared to AMWEST Airlines Also we see that most delays happens at the city of Sanfrancisco and least delays at Phoenix