Data 607 - Assignment 5

X	X.1	LosAngeles	Phoenix	SanDiego	SanFrancisco	Seattle
ALASKA	on time	497	221	212	503	1841
	delayed	62	12	20	102	305
						NA
AMWEST	on time	694	4840	383	320	201
	delayed	117	415	65	129	61

The table above describes arrival delays for two airlines across five destinations.

The ask is to:

Create a .CSV file that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.

1) Loading required libraries and downloading data

We load all required libraries:

library(httr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)

We download the CSV from my GitHub account replacing all empty spaces with NAs to make the clean up easier later on:

urlRemote  <- "https://raw.githubusercontent.com/"
pathGithub <- "chilleundso/DATA607/master/Assignment5/"
fileName   <- "Airline.csv"


air <- read.csv(paste0(urlRemote, pathGithub, fileName),header = TRUE, na.strings=c(""," ","NA"))

kable(air)

X	X.1	LosAngeles	Phoenix	SanDiego	SanFrancisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AMWEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

2a) Clean-Up

As we can see the header for the first two columns are not very helpful so we change them to something more explanatory, secondly we remove all rows that are entirely N/As and lastly we fill all missing airline names with the name that is directly above:

#rename the first 2 column headers to "Airline" and "Status"
air <- dplyr::rename(air, Airline = X, Status = X.1)
#removing all rows that are entirely N/As (row 3):
air <- dplyr::filter_all(air,any_vars(!is.na(.)))
#we use the fill function to fill any N/A with the value directly above it in the data frame to add airline names for the second and 4th row
air <- tidyr::fill(air,Airline)
kable(air)

Airline	Status	LosAngeles	Phoenix	SanDiego	SanFrancisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AMWEST	on time	694	4840	383	320	201
AMWEST	delayed	117	415	65	129	61

The above table now is entirely filled, however it is not in a format that is easy to be used for further investigations:

2b) Reformatting

Using gather we will generate a row specific for every tripple of airline, status and departure city (in the table below we are only showing the first 6 entries to make the report more readable)

air <- tidyr::gather(air, "Destination", "Amount", -Airline, -Status)
kable(head(air))

Airline	Status	Destination	Amount
ALASKA	on time	LosAngeles	497
ALASKA	delayed	LosAngeles	62
AMWEST	on time	LosAngeles	694
AMWEST	delayed	LosAngeles	117
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12

For our final reformatting we will use spread to move the amount of delays and on-time flights per destination and airline into the same row

air <- tidyr::spread(air,Status, Amount)
kable(air)

Airline	Destination	delayed	on time
ALASKA	LosAngeles	62	497
ALASKA	Phoenix	12	221
ALASKA	SanDiego	20	212
ALASKA	SanFrancisco	102	503
ALASKA	Seattle	305	1841
AMWEST	LosAngeles	117	694
AMWEST	Phoenix	415	4840
AMWEST	SanDiego	65	383
AMWEST	SanFrancisco	129	320
AMWEST	Seattle	61	201

3) Investigation

To make the data more easily comparable we “normalize” it by adding the delay_rate which gives the percentage of delays per total flight for each pair of destination and airline:

air$delay_rate <- air$delayed / (air$delayed+air$'on time')
kable(air)

Airline	Destination	delayed	on time	delay_rate
ALASKA	LosAngeles	62	497	0.1109123
ALASKA	Phoenix	12	221	0.0515021
ALASKA	SanDiego	20	212	0.0862069
ALASKA	SanFrancisco	102	503	0.1685950
ALASKA	Seattle	305	1841	0.1421249
AMWEST	LosAngeles	117	694	0.1442663
AMWEST	Phoenix	415	4840	0.0789724
AMWEST	SanDiego	65	383	0.1450893
AMWEST	SanFrancisco	129	320	0.2873051
AMWEST	Seattle	61	201	0.2328244

To investigate the behavior of delay rates per destination and airline we plot the delay rate for every city with both airlines next to each other in a grouped barchart (source for the code: https://www.r-graph-gallery.com/48-grouped-barplot-with-ggplot2.html)

ggplot(air, aes(fill=Airline, y=delay_rate, x=reorder(Destination, delay_rate))) + 
    geom_bar(position="dodge", stat="identity") +
    xlab('Destination') +
    ylab('Delay Rate')

Conculsion

What we can see is that for every destination AMWEST has a higher delay rate than Alaska and the order of delay (from least to most) per destination is: Phoenix, San Diego, Los Angeles, Seattle and lastly San Francisco.