The chart above describes arrival delays for two airlines across five destinations. Your task is to:
I have started by creating a .csv file with the arrival delay information provided in the assignment. I uploaded my .csv file to github to read it from an online location. The data includes blanks and it is challenging to analyze it in this format.
#install.packages("tidyverse")
#install.packages("dplyr")
require(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
require(tidyr)
AirData <- read.csv(file="https://raw.githubusercontent.com/che10vek/Data-607-Assignments/master/Assignment%205%20Data.csv", header=TRUE, sep=",")
head(AirData)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
## 6 NA NA NA NA NA
Now that we have read in our data - we need to clean up and transform it to make it possible to do analysis.
This code will get read of NAs:
newairdata <- subset(AirData, AirData$X.1 != "")
head(newairdata)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
I struggled with filling the blank cells with the value from the cell above quiet a bit. I ended up finding a function on the following webpage and adopting it a bit for my data: https://stackoverflow.com/questions/10554741/fill-in-data-frame-with-values-from-rows-above
f4 <- function(x, blank = "") {
# Find the values
if (is.function(blank)) {
isnotblank <- !blank(x)
} else {
isnotblank <- x != blank
}
# Fill down
x[which(isnotblank)][cumsum(isnotblank)]
}
newairdata$X <- f4(newairdata$X)
head (newairdata)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
The following step converts this “wide”" data set into “long” data.
air_data_long <- newairdata %>% gather(City, Count, Los.Angeles:Seattle)
air_data_long
## X X.1 City Count
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
I have manipulated the data further to get the “on time” and “delayed” data into columns.
air_data_long2 <- air_data_long %>% spread(X.1, Count)
air_data_long2
## X City delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
Let’s add a column with includes percentages to give us a beter understanding of delays for each city.
air_data_long2 <- mutate(air_data_long2,
PercentageDelayed = round(air_data_long2$delayed/(air_data_long2$delayed + air_data_long2$'on time')*100,2)
)
air_data_long2 <- mutate(air_data_long2,
PercentageOnTime = round(air_data_long2$'on time'/(air_data_long2$delayed+air_data_long2$'on time')*100,2)
)
air_data_long2
## X City delayed on time PercentageDelayed
## 1 ALASKA Los.Angeles 62 497 11.09
## 2 ALASKA Phoenix 12 221 5.15
## 3 ALASKA San.Diego 20 212 8.62
## 4 ALASKA San.Francisco 102 503 16.86
## 5 ALASKA Seattle 305 1841 14.21
## 6 AM WEST Los.Angeles 117 694 14.43
## 7 AM WEST Phoenix 415 4840 7.90
## 8 AM WEST San.Diego 65 383 14.51
## 9 AM WEST San.Francisco 129 320 28.73
## 10 AM WEST Seattle 61 201 23.28
## PercentageOnTime
## 1 88.91
## 2 94.85
## 3 91.38
## 4 83.14
## 5 85.79
## 6 85.57
## 7 92.10
## 8 85.49
## 9 71.27
## 10 76.72
Now let’s look at the data graphically. We do a boxplot and looks at Percentage of delayed flights for both airlines. We can see that AM West has much higher percentage delayed.
plot(air_data_long2$X, air_data_long2$PercentageDelayed)
We can also look at Percent on time and that shows us the same result. Alaska has a much higher percent of on time arrivals.
plot(air_data_long2$X, air_data_long2$PercentageOnTime)