I used a .csv (created in Excel, then saved as a .csv).
The data is shown as below. I only added two column names where they were missing (“Airline” and “Status”). I also removed the blank row in between the two airline groupings, and repeat the missing values in “Airline”.
data<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Assignment_Week5_data.csv")
data## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Alaska Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 AM West Delayed 117 415 65 129 61
tidyr and dplyr to tidy and transform the dataNow that I have the data, I pipe the data to gather the city columns into one column (“City”). Then I reorder the columns before I spread the data to make “On Time” and “Delayed” into separate columns populated by the “Counts” data. Then I change the column name “On Time” to “OnTime”.
Finally, I use mutate to calculate the Total number of flights per route as well as the Delayed Percentage per route for each airline. Then I assign the transformed data to the “final” variable.
library(tidyr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#tidy the data
tidy_data<-data %>%
gather(City,Counts,3:7) %>%
select(Airline,City,Status,Counts) %>%
spread(Status,Counts)
#change name
names(tidy_data)[4]<-"OnTime"
#transform the data for analysis
transform_data<-tidy_data %>%
mutate(Total = Delayed + OnTime) %>%
mutate(Delay_Perc = Delayed/Total)
#set as final
final<-transform_data
final## Airline City Delayed OnTime Total Delay_Perc
## 1 Alaska Los.Angeles 62 497 559 0.11091234
## 2 Alaska Phoenix 12 221 233 0.05150215
## 3 Alaska San.Diego 20 212 232 0.08620690
## 4 Alaska San.Francisco 102 503 605 0.16859504
## 5 Alaska Seattle 305 1841 2146 0.14212488
## 6 AM West Los.Angeles 117 694 811 0.14426634
## 7 AM West Phoenix 415 4840 5255 0.07897241
## 8 AM West San.Diego 65 383 448 0.14508929
## 9 AM West San.Francisco 129 320 449 0.28730512
## 10 AM West Seattle 61 201 262 0.23282443
I suspect a Simpson’s paradox here…
Let’s compare total delay percentages between the two airlines. I use the group_by function to group by Airline before summarizing the data by the Delayed Total Percentage, calculated by summing up the Delayed column and dividing that by the sum of the Total column.
final %>%
group_by(Airline) %>%
summarise(DelayedTotal_Perc=sum(Delayed) / sum(Total))## # A tibble: 2 × 2
## Airline DelayedTotal_Perc
## <fctr> <dbl>
## 1 Alaska 0.1327152
## 2 AM West 0.1089273
We see that Alaska is delayed 13.3% of the time while AM West is delayed almost 10.8% of the time for all flights combined. Now let’s compare each route individually.
final %>%
select(Airline,City,Delay_Perc) %>%
spread(Airline,Delay_Perc)## City Alaska AM West
## 1 Los.Angeles 0.11091234 0.14426634
## 2 Phoenix 0.05150215 0.07897241
## 3 San.Diego 0.08620690 0.14508929
## 4 San.Francisco 0.16859504 0.28730512
## 5 Seattle 0.14212488 0.23282443
Since each route’s delayed percentage has already been calculated, I reorder and limit the data using the select function to the columns that I need (Airline, City, and Delay Percentage). Then I use the spread function to move the Airline values to the columns, which are populated by the Delay Percentage values according to the City.
When looking at each route individually, we see that Alaska always has fewer delays as a percentage than AM West. So as I suspected, this is Simpson’s paradox. When looking at the aggregated delays, Alaska has a greater ratio of delays. However, when looking at each route individually, AM West has a greater ratio of delays for each route.
In short, we have this paradox. So if asked, “which airline has more delays?”, we’d have to clarify the question. Are we concerned with each route individually? If so, AM West has more delays as a percentage per route. Are we concerned with all flights in total, regardless of route? If so, Alaska has more delays as a percentage of total.
Let’s look at the delay percentages and standard deviations.
summary(filter(final, Airline=="Alaska")$Delay_Perc)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05150 0.08621 0.11090 0.11190 0.14210 0.16860
sd(filter(final, Airline=="Alaska")$Delay_Perc)## [1] 0.04592624
summary(filter(final, Airline=="AM West")$Delay_Perc)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07897 0.14430 0.14510 0.17770 0.23280 0.28730
sd(filter(final, Airline=="AM West")$Delay_Perc)## [1] 0.08212854
We can see that by this analysis, looking at the route percentages, AM West looks a lot worse. The minimum route delay percentage is 7% compared to 5%, the mean is 18% compared to 11%, and the maximum is 29% compared to 17%. Alaska has a delay percentage standard deviation of 5% while AM West has a standard deviation of 8%. So I suppose if one is looking for a more consistent experience, Alaska is preferrable to AM West.
Who does the most flights? In total, AM West does much more. 7225 compared to 3775.
final %>%
group_by(Airline) %>%
summarise(Total=sum(Total))## # A tibble: 2 × 2
## Airline Total
## <fctr> <int>
## 1 Alaska 3775
## 2 AM West 7225
By route, however, Alaska does have more flights to Seattle and San Francisco. The other routes have more by AM West. Phoenix has a lot more.
final %>%
select(Airline,City,Total) %>%
spread(Airline,Total) ## City Alaska AM West
## 1 Los.Angeles 559 811
## 2 Phoenix 233 5255
## 3 San.Diego 232 448
## 4 San.Francisco 605 449
## 5 Seattle 2146 262
It’s hard to say which airline is “better” in terms of delays. Instead, we should be more precise: “better” for what? Overall, arguably AM West. Each route? Alaska. Consistency among routes? Alaska. So I suppose I’d lean more towards Alaska, but neither choice is overwhelming.