The purpose of this assignment is to tidy and transform data. The dataset of interest describes arrival delays for two airlines across five destinations and our task is to:
A wide format .csv is created per assignment description and uploaded to github.
With the .csv available on github, we shift to reading from this file prior to tidying and transforming the data therein.
We read the .csv (in its raw form) from github and store corresponding data in a variable named “data” (I know … very creative).
#Get URL, read .csv (in raw form) from github, and put into tabular form
url <- getURL("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Assignment-5/Wcoast_airline_comp.csv")
data <- read.csv(text = url)
data <- tbl_df(data)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 5 x 7
## X X.1 Los.Angeles Phoenix San.Diego San.Fransisco Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 "ALASKA" "on time" 497 221 212 503 1841
## 2 "" "delayed" 62 12 20 102 305
## 3 "" "" NA NA NA NA NA
## 4 "AM WEST" "on time" 694 4840 383 320 201
## 5 "" "delayed" 117 415 65 129 61
Once we’ve read from .csv and stored our data, we shift to tidying and transforming the data therein.
To start, we tidy:
#Tidy / shape our data (using tidyr)
##Provide fitting column names for the 1st and 2nd column
names(data)[names(data) == "X"] <- "Airline"
names(data)[names(data) == "X.1"] <- "Status"
##Remove rows with "NA" from dataset
data <- na.omit(data)
##Make observations from variables using gather()
##This is also when we convert from a 'wide' to 'long' format
data_long <- gather(data, "Airport", "n", 3:7, factor_key=TRUE)
##Initialize counter then fill empty Airline elements via for loop
n <- 2
for (n in c(2,6,10,14,18)){
data_long$Airline[n] <- "ALASKA"
data_long$Airline[n+2] <- "AM WEST"
n <- n+4
}
##Make variables from observations using spread()
data_long <- spread(data_long, "Status", "n")
data_long
## # A tibble: 10 x 4
## Airline Airport delayed `on time`
## <chr> <fct> <int> <int>
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Fransisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Fransisco 129 320
## 10 AM WEST Seattle 61 201
qplot(x=delayed, y=`on time`, data=data_long, col= Airline, main="On Time v Delayed Flights", xlab="# of delayed flights", ylab="# of on time flights")
What started as a less clear 5x7 ‘wide’ table (“data”) is now a tidy 4x10 ‘long’ table (“data_long”) with an associated plot (above) showing the relationship between on time and delayed flights for each Airline.
Each point on the plot represents an airport, with the point of this plot being to show the range in scale (ie. # of flights) between different data in our data set.
…
With tidying complete, we move on to transforming our data. We’ll make use of dplyr() to transform data and perform mathematical calculations to solve for our variables. After this point, we should have all we need to clearly observe the difference in delays between the two airlines.
We can identify our high-level variables of interest as: (1) What airline has the lowest overall probability of arrival delay? (2) At each airport, which airline has the lower probability of arrival delay?
To calculate (1), we can take the total flights delayed and on time per airline, sum them together, and observe the probability of delay for each airline.
#Transform the data: dplyr
##Compute the probability of an Alaska flight arriving on time
ak <- filter(data_long, Airline == "ALASKA")
ak_on_time <- sum(ak$`on time`) / (sum(ak$delayed) + sum(ak$`on time`))
round(ak_on_time, digits = 3)
## [1] 0.867
#Compute the probability of an AM West flight arriving on time
aw_flts <- filter(data_long, Airline == "AM WEST")
aw_on_time <- sum(aw_flts$`on time`) / (sum(aw_flts$delayed) + sum(aw_flts$`on time`))
round(aw_on_time, digits = 3)
## [1] 0.891
We first observe that AM West has a better total on time probability. Meaning that for on time flight total over total number of flights (incl. delay) AM West has a better on time probability than ALASKA.
But we can’t close on this conclusion as it’s very high level. We’ve got to observe the probabilities at each individual Airport. Thus exploring, (2) At each airport, which airline has the lower probability of arrival delay?
Compare arrival delays between the two airlines.
#Initialize counter then create new column on_time_prob to store the on time probability for each Airline at each Airport via for loop
n <- 1
for (n in c(1:10)){
data_long$on_time_prob[n] <- round(data_long$`on time`[n] / (data_long$`on time`[n] + data_long$delayed[n]), digits = 3)
n <- n+1
}
## Warning: Unknown or uninitialised column: `on_time_prob`.
## # A tibble: 10 x 5
## Airline Airport delayed `on time` on_time_prob
## <chr> <fct> <int> <int> <dbl>
## 1 ALASKA Los.Angeles 62 497 0.889
## 2 ALASKA Phoenix 12 221 0.948
## 3 ALASKA San.Diego 20 212 0.914
## 4 ALASKA San.Fransisco 102 503 0.831
## 5 ALASKA Seattle 305 1841 0.858
## 6 AM WEST Los.Angeles 117 694 0.856
## 7 AM WEST Phoenix 415 4840 0.921
## 8 AM WEST San.Diego 65 383 0.855
## 9 AM WEST San.Fransisco 129 320 0.713
## 10 AM WEST Seattle 61 201 0.767
qplot(x=Airport, y=on_time_prob, data=data_long, col= Airline, main="On Time Probability v Airport", xlab="Airport", ylab="Probability of an On Time Arrival")
Using the table and plot above, we can observe that for every Airport, Alaskan Air has a better probability of being on time.
Thus there is a discrepancy in the data. At a macro scale it would seem AM WEST would be the better Airline to choose but when we zoom on each Airline’s performance per Airport we see that ALASKA actually performs better.
The reason that the total on time probability favors AM WEST is because AM WEST has 4840 flights for Phoenix that are on time (recall our earlier plot), and this is its best performing Airport. If we compare this value to ALASKA’s largest volume of flights (1841 at Seattle), we see that it’s nearly 3x the volume.
While ALASKA may have a better on time probability than AM WEST for arrival in Seattle (and every other airport for that matter), AM WEST has ~3x volume on its best performing airport and ALASKA has its greatest volume on one of its lowest performing airports. Thus volume sways the total in AM WEST’s favor and paints a “different tale”.
In closing, it may seem that AM WEST is better if we just take flight totals into account but once we actually dig in and look at on time probabilities per Airline and Airport, we see that ALASKA is the better airline to choose for on time arrival. It outperforms AM WEST at every airport (including Phoenix).