The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines.
After replicating the file as a CSV and uploading to Github, our first step is to load into R and assess the data’s structure.
library(RCurl)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/607/master/607W5.csv")
RawTable <- read.csv(text = x)
RawTable
## X X.1 Los.Angeles Phoenix San.Diego San.Fransisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 Delayed 117 415 65 129 61
As we can see, this data violated several of our TidyData rules. Most notably, Columns 3 through 7 contain variable names and Column 2 contains multiple variables (On Time and Delayed).
We first need to add titles to Columns 1 and 2, and insert missing values into select cells in Column 1. From here, we can restructure our data using pivot_longer and pivot_wider as needed.
colnames(RawTable)[1] <- "Airline"
colnames(RawTable)[2] <- "Status"
RawTable[2,1] <- "Alaska"
RawTable[4,1] <- "AM West"
RawTable2 <- RawTable %>%
pivot_longer(c(Los.Angeles, Phoenix, San.Diego, San.Fransisco, Seattle), names_to = "City", values_to = "Flight Count")
RawTable2
## # A tibble: 20 x 4
## Airline Status City `Flight Count`
## <fct> <fct> <chr> <int>
## 1 Alaska On Time Los.Angeles 497
## 2 Alaska On Time Phoenix 221
## 3 Alaska On Time San.Diego 212
## 4 Alaska On Time San.Fransisco 503
## 5 Alaska On Time Seattle 1841
## 6 Alaska Delayed Los.Angeles 62
## 7 Alaska Delayed Phoenix 12
## 8 Alaska Delayed San.Diego 20
## 9 Alaska Delayed San.Fransisco 102
## 10 Alaska Delayed Seattle 305
## 11 AM West On Time Los.Angeles 694
## 12 AM West On Time Phoenix 4840
## 13 AM West On Time San.Diego 383
## 14 AM West On Time San.Fransisco 320
## 15 AM West On Time Seattle 201
## 16 AM West Delayed Los.Angeles 117
## 17 AM West Delayed Phoenix 415
## 18 AM West Delayed San.Diego 65
## 19 AM West Delayed San.Fransisco 129
## 20 AM West Delayed Seattle 61
TidyTable <- RawTable2 %>%
pivot_wider(names_from = Status, values_from = `Flight Count`)
TidyTable
## # A tibble: 10 x 4
## Airline City `On Time` Delayed
## <fct> <chr> <int> <int>
## 1 Alaska Los.Angeles 497 62
## 2 Alaska Phoenix 221 12
## 3 Alaska San.Diego 212 20
## 4 Alaska San.Fransisco 503 102
## 5 Alaska Seattle 1841 305
## 6 AM West Los.Angeles 694 117
## 7 AM West Phoenix 4840 415
## 8 AM West San.Diego 383 65
## 9 AM West San.Fransisco 320 129
## 10 AM West Seattle 201 61
As this table measures delayed and on-time flights across cities and airlines, we can infer that there are two different uses one might have for this table: “which airline runs on time most frequently on average across all cities?” and “which airline is more likely to arrive to a given city on time?”
We can add in a new column to answer the second question, and provide averages that answer the first question:
TidyTable$"Delayed Rate" <- (TidyTable$Delayed / (TidyTable$Delayed + TidyTable$`On Time`))
TidyTable
## # A tibble: 10 x 5
## Airline City `On Time` Delayed `Delayed Rate`
## <fct> <chr> <int> <int> <dbl>
## 1 Alaska Los.Angeles 497 62 0.111
## 2 Alaska Phoenix 221 12 0.0515
## 3 Alaska San.Diego 212 20 0.0862
## 4 Alaska San.Fransisco 503 102 0.169
## 5 Alaska Seattle 1841 305 0.142
## 6 AM West Los.Angeles 694 117 0.144
## 7 AM West Phoenix 4840 415 0.0790
## 8 AM West San.Diego 383 65 0.145
## 9 AM West San.Fransisco 320 129 0.287
## 10 AM West Seattle 201 61 0.233
sum(subset(TidyTable$Delayed, TidyTable$Airline == 'Alaska')) / sum(subset(c(TidyTable$Delayed, TidyTable$`On Time`), TidyTable$Airline == 'Alaska'))
## [1] 0.1327152
sum(subset(TidyTable$Delayed, TidyTable$Airline == 'AM West')) / sum(subset(c(TidyTable$Delayed, TidyTable$`On Time`), TidyTable$Airline == 'AM West'))
## [1] 0.1089273
AM West is only delayed on average, 11% of the time, while Alaska is delayed 13% on average. That being said, if one was traveling to Phoenix, they would likely want to fly Alaska as they are only delayed 5% of the time for that city specifically, and AM West has a delayed rate of nearly 8%.