X X.1 LosAngeles Phoenix SanDiego SanFrancisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
NA
AMWEST on time 694 4840 383 320 201
delayed 117 415 65 129 61

The table above describes arrival delays for two airlines across five destinations.

The ask is to:

  1. Create a .CSV file that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  3. Perform analysis to compare the arrival delays for the two airlines.



1) Loading required libraries and downloading data

We load all required libraries:

library(httr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)

We download the CSV from my GitHub account replacing all empty spaces with NAs to make the clean up easier later on:

urlRemote  <- "https://raw.githubusercontent.com/"
pathGithub <- "chilleundso/DATA607/master/Assignment5/"
fileName   <- "Airline.csv"


air <- read.csv(paste0(urlRemote, pathGithub, fileName),header = TRUE, na.strings=c(""," ","NA"))

kable(air)
X X.1 LosAngeles Phoenix SanDiego SanFrancisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AMWEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61


2a) Clean-Up

As we can see the header for the first two columns are not very helpful so we change them to something more explanatory, secondly we remove all rows that are entirely N/As and lastly we fill all missing airline names with the name that is directly above:

#rename the first 2 column headers to "Airline" and "Status"
air <- dplyr::rename(air, Airline = X, Status = X.1)
#removing all rows that are entirely N/As (row 3):
air <- dplyr::filter_all(air,any_vars(!is.na(.)))
#we use the fill function to fill any N/A with the value directly above it in the data frame to add airline names for the second and 4th row
air <- tidyr::fill(air,Airline)
kable(air)
Airline Status LosAngeles Phoenix SanDiego SanFrancisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AMWEST on time 694 4840 383 320 201
AMWEST delayed 117 415 65 129 61

The above table now is entirely filled, however it is not in a format that is easy to be used for further investigations:


2b) Reformatting

Using gather we will generate a row specific for every tripple of airline, status and departure city (in the table below we are only showing the first 6 entries to make the report more readable)

air <- tidyr::gather(air, "Destination", "Amount", -Airline, -Status)
kable(head(air))
Airline Status Destination Amount
ALASKA on time LosAngeles 497
ALASKA delayed LosAngeles 62
AMWEST on time LosAngeles 694
AMWEST delayed LosAngeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12

For our final reformatting we will use spread to move the amount of delays and on-time flights per destination and airline into the same row

air <- tidyr::spread(air,Status, Amount)
kable(air)
Airline Destination delayed on time
ALASKA LosAngeles 62 497
ALASKA Phoenix 12 221
ALASKA SanDiego 20 212
ALASKA SanFrancisco 102 503
ALASKA Seattle 305 1841
AMWEST LosAngeles 117 694
AMWEST Phoenix 415 4840
AMWEST SanDiego 65 383
AMWEST SanFrancisco 129 320
AMWEST Seattle 61 201


3) Investigation

To make the data more easily comparable we “normalize” it by adding the delay_rate which gives the percentage of delays per total flight for each pair of destination and airline:

air$delay_rate <- air$delayed / (air$delayed+air$'on time')
kable(air)
Airline Destination delayed on time delay_rate
ALASKA LosAngeles 62 497 0.1109123
ALASKA Phoenix 12 221 0.0515021
ALASKA SanDiego 20 212 0.0862069
ALASKA SanFrancisco 102 503 0.1685950
ALASKA Seattle 305 1841 0.1421249
AMWEST LosAngeles 117 694 0.1442663
AMWEST Phoenix 415 4840 0.0789724
AMWEST SanDiego 65 383 0.1450893
AMWEST SanFrancisco 129 320 0.2873051
AMWEST Seattle 61 201 0.2328244

To investigate the behavior of delay rates per destination and airline we plot the delay rate for every city with both airlines next to each other in a grouped barchart (source for the code: https://www.r-graph-gallery.com/48-grouped-barplot-with-ggplot2.html)

ggplot(air, aes(fill=Airline, y=delay_rate, x=reorder(Destination, delay_rate))) + 
    geom_bar(position="dodge", stat="identity") +
    xlab('Destination') +
    ylab('Delay Rate')

Conculsion

What we can see is that for every destination AMWEST has a higher delay rate than Alaska and the order of delay (from least to most) per destination is: Phoenix, San Diego, Los Angeles, Seattle and lastly San Francisco.

GitHub: https://github.com/chilleundso/DATA607/blob/master/Assignment5/Data607_Assignment5.Rmd