Background

The purpose of this assignment is to tidy and transform data. The dataset of interest describes arrival delays for two airlines across five destinations and our task is to:

  1. Create a .csv file in a “wide” structure.
  2. Read this .csv and use tidyr and dplyr to tidy and transform.
  3. Analyze arrival delays between the airlines.
  4. Provide the .rmd file, rpubs link, and descriptions of steps taken.

(1) Create a .csv.

A wide format .csv is created per assignment description and uploaded to github.

With the .csv available on github, we shift to reading from this file prior to tidying and transforming the data therein.

(2) Read, tidy, and transform.

We read the .csv (in its raw form) from github and store corresponding data in a variable named “data” (I know … very creative).

#Get URL, read .csv (in raw form) from github, and put into tabular form
url <- getURL("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Assignment-5/Wcoast_airline_comp.csv")
data <- read.csv(text = url)
data <- tbl_df(data)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
#Show what we're working with:
data
## # A tibble: 5 x 7
##   X         X.1       Los.Angeles Phoenix San.Diego San.Fransisco Seattle
##   <chr>     <chr>           <int>   <int>     <int>         <int>   <int>
## 1 "ALASKA"  "on time"         497     221       212           503    1841
## 2 ""        "delayed"          62      12        20           102     305
## 3 ""        ""                 NA      NA        NA            NA      NA
## 4 "AM WEST" "on time"         694    4840       383           320     201
## 5 ""        "delayed"         117     415        65           129      61

Once we’ve read from .csv and stored our data, we shift to tidying and transforming the data therein.

To start, we tidy:

#Tidy / shape our data (using tidyr)

##Provide fitting column names for the 1st and 2nd column
names(data)[names(data) == "X"] <- "Airline"
names(data)[names(data) == "X.1"] <- "Status"

##Remove rows with "NA" from dataset
data <- na.omit(data)

##Make observations from variables using gather()
##This is also when we convert from a 'wide' to 'long' format
data_long <- gather(data, "Airport", "n", 3:7, factor_key=TRUE)


##Initialize counter then fill empty Airline elements via for loop
n <- 2 

for (n in c(2,6,10,14,18)){
    data_long$Airline[n] <- "ALASKA"
    data_long$Airline[n+2] <- "AM WEST"
    n <- n+4
}

##Make variables from observations using spread()
data_long <- spread(data_long, "Status", "n")
data_long
## # A tibble: 10 x 4
##    Airline Airport       delayed `on time`
##    <chr>   <fct>           <int>     <int>
##  1 ALASKA  Los.Angeles        62       497
##  2 ALASKA  Phoenix            12       221
##  3 ALASKA  San.Diego          20       212
##  4 ALASKA  San.Fransisco     102       503
##  5 ALASKA  Seattle           305      1841
##  6 AM WEST Los.Angeles       117       694
##  7 AM WEST Phoenix           415      4840
##  8 AM WEST San.Diego          65       383
##  9 AM WEST San.Fransisco     129       320
## 10 AM WEST Seattle            61       201
qplot(x=delayed, y=`on time`, data=data_long, col= Airline, main="On Time v Delayed Flights", xlab="# of delayed flights", ylab="# of on time flights")

What started as a less clear 5x7 ‘wide’ table (“data”) is now a tidy 4x10 ‘long’ table (“data_long”) with an associated plot (above) showing the relationship between on time and delayed flights for each Airline.

Each point on the plot represents an airport, with the point of this plot being to show the range in scale (ie. # of flights) between different data in our data set.

With tidying complete, we move on to transforming our data. We’ll make use of dplyr() to transform data and perform mathematical calculations to solve for our variables. After this point, we should have all we need to clearly observe the difference in delays between the two airlines.

We can identify our high-level variables of interest as: (1) What airline has the lowest overall probability of arrival delay? (2) At each airport, which airline has the lower probability of arrival delay?

To calculate (1), we can take the total flights delayed and on time per airline, sum them together, and observe the probability of delay for each airline.

#Transform the data: dplyr

##Compute the probability of an Alaska flight arriving on time
ak <- filter(data_long, Airline == "ALASKA")
ak_on_time <- sum(ak$`on time`) / (sum(ak$delayed) + sum(ak$`on time`))
round(ak_on_time, digits = 3)
## [1] 0.867
#Compute the probability of an AM West flight arriving on time
aw_flts <- filter(data_long, Airline == "AM WEST")
aw_on_time <- sum(aw_flts$`on time`) / (sum(aw_flts$delayed) + sum(aw_flts$`on time`))
round(aw_on_time, digits = 3)
## [1] 0.891

We first observe that AM West has a better total on time probability. Meaning that for on time flight total over total number of flights (incl. delay) AM West has a better on time probability than ALASKA.

But we can’t close on this conclusion as it’s very high level. We’ve got to observe the probabilities at each individual Airport. Thus exploring, (2) At each airport, which airline has the lower probability of arrival delay?

(3) Analyze and compare.

Compare arrival delays between the two airlines.

#Initialize counter then create new column on_time_prob to store the on time probability for each Airline at each Airport via for loop
n <- 1

for (n in c(1:10)){
    data_long$on_time_prob[n] <- round(data_long$`on time`[n] / (data_long$`on time`[n] + data_long$delayed[n]), digits = 3)
    n <- n+1
}
## Warning: Unknown or uninitialised column: `on_time_prob`.
data_long
## # A tibble: 10 x 5
##    Airline Airport       delayed `on time` on_time_prob
##    <chr>   <fct>           <int>     <int>        <dbl>
##  1 ALASKA  Los.Angeles        62       497        0.889
##  2 ALASKA  Phoenix            12       221        0.948
##  3 ALASKA  San.Diego          20       212        0.914
##  4 ALASKA  San.Fransisco     102       503        0.831
##  5 ALASKA  Seattle           305      1841        0.858
##  6 AM WEST Los.Angeles       117       694        0.856
##  7 AM WEST Phoenix           415      4840        0.921
##  8 AM WEST San.Diego          65       383        0.855
##  9 AM WEST San.Fransisco     129       320        0.713
## 10 AM WEST Seattle            61       201        0.767
qplot(x=Airport, y=on_time_prob, data=data_long, col= Airline, main="On Time Probability v Airport", xlab="Airport", ylab="Probability of an On Time Arrival")

Using the table and plot above, we can observe that for every Airport, Alaskan Air has a better probability of being on time.

Thus there is a discrepancy in the data. At a macro scale it would seem AM WEST would be the better Airline to choose but when we zoom on each Airline’s performance per Airport we see that ALASKA actually performs better.

(4) Conclude and recommend.

The reason that the total on time probability favors AM WEST is because AM WEST has 4840 flights for Phoenix that are on time (recall our earlier plot), and this is its best performing Airport. If we compare this value to ALASKA’s largest volume of flights (1841 at Seattle), we see that it’s nearly 3x the volume.

While ALASKA may have a better on time probability than AM WEST for arrival in Seattle (and every other airport for that matter), AM WEST has ~3x volume on its best performing airport and ALASKA has its greatest volume on one of its lowest performing airports. Thus volume sways the total in AM WEST’s favor and paints a “different tale”.

In closing, it may seem that AM WEST is better if we just take flight totals into account but once we actually dig in and look at on time probabilities per Airline and Airport, we see that ALASKA is the better airline to choose for on time arrival. It outperforms AM WEST at every airport (including Phoenix).