The chart above (not shown) describes arrival delays for two airlines across five destinations. Your task is to:
Data for arrival delays for two airlines is presented in an untidy format requiring data cleanup in order to perform some basic analysis.
To complete all the tasks below 3 packages are required:
library(tidyr)
library(dplyr)
library(stringr)
If a package is not installed it must first be installed before it can be loaded.
Here are the steps we will follow:
Read in raw CSV data from local directory.
rawcsv <- read.csv("Assignment 5.csv", header = TRUE, sep = ",", row.names = NULL, stringsAsFactors = FALSE)
rawcsv
## X X.1 Los.Angeles Phoneix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
In original file the ‘Airline’ and ‘Status’ columns were missing headers so we need to assign column names for airline and status
colnames(rawcsv)[1] <- "Airline"
colnames(rawcsv)[2] <- "Status"
Original file had one blank row separating airlines this imported as all NAs so we will remove row with all NA (i.e. the blank row)
rawcsv_narm <- na.omit(rawcsv)
We need to add in the airline names for the two missing values (the value did not repeat in the original file).
rawcsv_narm[2, 1] <- "ALASKA"
rawcsv_narm[4, 1] <- "AM WEST"
Now the data can be further manipulated for analysis. The gather() function was used to move data from wide to tall format.
prefinalcsv <- gather(rawcsv_narm, "City", "Count", 3:7)
prefinalcsv
## Airline Status City Count
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoneix 221
## 6 ALASKA delayed Phoneix 12
## 7 AM WEST on time Phoneix 4840
## 8 AM WEST delayed Phoneix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
Now we will remove punctuation from city names that was introduced due to spaces in some (i.e. Los Angeles became Los.Angeles).
prefinalcsv$City <- str_replace(prefinalcsv$City, "[.]", " ")
Finally, we will move Status into its on variable.
finalcsv <- spread(prefinalcsv, Status, Count)
We will examine delays for these two airlines overall and by city using the summarise() function from the dplyr package.
DelayOverall <- finalcsv %>%
group_by(Airline) %>%
summarise(TotalDelayed=sum(delayed),TotalOnTime=sum(`on time`), PercentDelayed=round((TotalDelayed/(TotalDelayed+TotalOnTime))*100, 2))
DelayCity <- finalcsv %>%
group_by(Airline, City) %>%
summarise(TotalDelayed=sum(delayed),TotalOnTime=sum(`on time`), PercentDelayed=round((TotalDelayed/(TotalDelayed+TotalOnTime))*100, 2))
DelayOverall
## # A tibble: 2 × 4
## Airline TotalDelayed TotalOnTime PercentDelayed
## <chr> <int> <int> <dbl>
## 1 ALASKA 501 3274 13.27
## 2 AM WEST 787 6438 10.89
Overall, it would appear that Alaska has a higher % of delayed arrivals than Am West at 13.27% delayed. It is worth noting that Alaska has a much smaller (nearly 50% less) total number of observations in the set.
DelayCity
## Source: local data frame [10 x 5]
## Groups: Airline [?]
##
## Airline City TotalDelayed TotalOnTime PercentDelayed
## <chr> <chr> <int> <int> <dbl>
## 1 ALASKA Los Angeles 62 497 11.09
## 2 ALASKA Phoneix 12 221 5.15
## 3 ALASKA San Diego 20 212 8.62
## 4 ALASKA San Francisco 102 503 16.86
## 5 ALASKA Seattle 305 1841 14.21
## 6 AM WEST Los Angeles 117 694 14.43
## 7 AM WEST Phoneix 415 4840 7.90
## 8 AM WEST San Diego 65 383 14.51
## 9 AM WEST San Francisco 129 320 28.73
## 10 AM WEST Seattle 61 201 23.28
When we examine this data by city, again using the summarise() function from dplyr, the trend in arrival delays seems to reverse.
For each city Am West now has the higher % of delayed arrivals with some cities being as high as 28.73% (San Francisco).
What we have stumbled upon here is a rather classic illustration of Simpson’s paradox: A trend in data that reverses when the results are conditioned by an additional grouping variable.
In this case it appears that city has a dramatic impact on the % of flights that are delayed. If we step away from the data and considered the differences in weather between the cities below this paradox makes sense (consider weather in San Diego compared to Seattle on average). Given that the distribution of total flights to each city differs by airline we can see how this difference might emerge.
If we were considering which airline to take based on historical delay information it would be important to consider city in all analyses.